This patent document relates generally to systems and methods for optimizing artificial intelligence solutions. Examples of optimizing an artificial intelligence model in a semiconductor solution are provided.
Artificial intelligence solutions are emerging with the advancement of computing platforms and integrated circuit solutions. For example, an artificial intelligence (AI) integrated circuit (IC) may include a processor capable of performing AI tasks in embedded hardware. Hardware-based solutions, as well as software solutions, still encounter the challenges of obtaining an optimal AI model, such as a convolutional neural network (CNN). For example, if weights of a CNN model are trained outside the chip, they are usually stored in floating point. When weights in floating point are loaded into an AI chip they usually lose data bits, for example, from 16- or 32-bits to 5- or 8-bits. The loss of data bits in an AI chip compromises the performance of the AI chip due to lost information and data precision.
The present solution will be described with reference to the following figures, in which like numerals represent like items throughout the figures.
As used in this document, the singular forms “a”, “an”, and “the” include plural references unless the context clearly dictates otherwise. Unless defined otherwise, all technical and scientific terms used herein have the same meanings as commonly understood by one of ordinary skill in the art. As used in this document, the term “comprising” means “including, but not limited to.” Unless defined otherwise, all technical and scientific terms used in this document have the same meanings as commonly understood by one of ordinary skill in the art.
Each of the terms “artificial intelligence logic circuit” and “AI logic circuit” refers to a logic circuit that is configured to execute certain AI functions such as a neural network in AI or machine learning tasks. An AI logic circuit can be a processor. An AI logic circuit can also be a logic circuit that is controlled by an external processor and executes certain AI functions.
Each of the terms “integrated circuit,” “semiconductor chip,” “chip,” and “semiconductor device” refers to an integrated circuit (1C) that contains electronic circuits on semiconductor materials, such as silicon, for performing certain functions. For example, an integrated circuit can be a microprocessor, a memory, a programmable array logic (PAL) device, an application-specific integrated circuit (ASIC) or others. An integrated circuit that contains an AI logic circuit is referred to as an AI integrated circuit.
The term “AI chip” refers to a hardware- or software-based device that is capable of performing functions of an AI logic circuit. An AI chip can be a physical IC. For example, a physical AI chip may include an embedded cellular neural network (CeNN), which may contain parameters of a CNN. The AI chip may also be a virtual chip, i.e., software-based. For example, a virtual AI chip may include one or more processor simulators to implement functions of a desired AI logic circuit of a physical AI chip.
The term of “AI model” refers to data that include one or more weights that are used for, when loaded inside an AI chip, executing the AI chip. For example, an AI model for a given CNN may include the weights and or parameters for one or more convolutional layers of the CNN. Here, the weights and parameters of an AI model are interchangeable.
In
In some examples, the communication system 100 may be a centralized system. System 100 may also be a distributed or decentralized system, such as a peer-to-peer (P2P) system. For example, a host device, e.g., 110, 112, 114, and 116, may be a node in a P2P system. In a non-limiting example, a client devices, e.g., 120a, 120b, 120c, and 120d may include a processor and an AI physical chip. In another non-limiting example, multiple AI chips may be installed in a host device. For example, host device 116 may have multiple AI chips installed on one or more PCI boards in the host device or in a USB cradle that may communicate with the host device. Host device 116 may have access to dataset 156 and may communicate with one or more AI chips via PCI board(s), internal data buses, or other communication protocols such as universal serial bus (USB).
In some scenarios, the AI chip may contain an AI model for performing certain AI tasks, e.g., voice or image recognition tasks, or other tasks that may be performed using an AI model. In some examples, an AI model may include a forward propagation neural network, in which information may flow from the input layer to one or more hidden layers of the network to the output layer. For example, an AI model may include a convolutional neural network (CNN) that is trained to perform voice or image recognition tasks. A CNN may include multiple convolutional layers, each of which may include multiple weights and/or parameters. In such case, an AI model may include weights and/or one or more parameters of the CNN model. In some examples, the weights of a CNN model may include a mask and a scalar for a given layer of the CNN model. For example, a kernel in a CNN layer may be represented by a mask that has multiple values in lower precision multiplied by a scalar n higher precision. In some examples, an output channel of a CNN layer may include one or more bias values that, when added to the output of tire output channel, adjust the output values to a desired range.
In a non-limiting example, in a CNN model, a computation in a given layer in the CNN may be expressed by Y=w*X+b, where X is input data, Y is output data in the given layer, w is a kernel, and b is a bias. Operation “*” is a convolution. Kernel w may include binary values. For example, a kernel may include 9 cells in a 3×3 mask, where each cell may have a binary value, such as “1” and “−1.” In such case, a kernel may be expressed by multiple binary values in the 3×3 mask multiplied by a scalar. The scalar may include a value having a bit width, such as 12-bit or 8-bit. Other bit length may also be possible. By multiplying each binary value in the 3×3 mask with the scalar, a kernel may be used in convolution operations effectively in higher bit-length. Alternatively, and/or additionally, a kernel may contain data with n-value, such as 7-value. The bias b may contain a value having multiple bits, such as 12 bits or 20 bits. Other bit length may also be possible.
In the case of physical AI chip, the AI chip may include an embedded cellular neural network that has memory containing the multiple weights and/or parameters in the CNN. In some scenarios, the memory in a physical AI chip may be a one-time-programmable (OTP) memory that allows a user to load a CNN model into the physical AI chip once. Alternatively, a physical AI chip may have a random access memory (RAM) or other types of memory that allows a user to update and load a CNN model into the physical AI chip multiple times.
In the case of virtual AI chip, the AI chip may include a data structure that simulates the cellular neural network in a physical AI chip. A virtual AI chip can be of particular advantageous when multiple tests need to be run over various CNNs in order to determine a model that produces the best performance (e.g., highest recognition rate or lowest error rate). In a test run, the weights in the CNN can vary and be loaded into the virtual AI chip without the cost associated with a physical AI chip. Only after the CNN model is determined will the CNN model be loaded into a physical AI chip for real-time applications.
In some examples, the weights of an AI model for an AI chip may be trained partially on a computing system external to the AI chip to utilize the computing resources of the computing system. In such case, the weights of the AI model may be trained to be floating point, as opposed to fixed point that may be needed in a physical AI chip. In some examples, the trained weights in floating point may be directly downloaded to the AI chip, during which process weights in floating point will lost data precision or bit information from quantization, e.g., from 32 bits to 12 bits. In some examples, one or more gain values may be stored together with floating point and are used to convert the weights of the AI model from floating point to fixed point. For example, a gain vector including multiple gain values may be used to convert weights of an AI model from floating point to fixed point. In some examples, an AI model may include a CNN model that includes multiple convolution layers. In such case, the gain vector may be represented as G={g1, g2, . . . , gn}, where n equals the number of convolution layers in the CNN model. Each of the multiple gain values in the gain vector may be used for a corresponding convolution layer of the CNN model. In some examples, a gain value may be used by each layer, where the weights of the convolution layer are multiplied by the gain value, then the result of the multiplication may then be quantized from floating point to fixed point. In other words, if the trained AI model is represented by (w, b), then the AI model (w×g, b×g) is quantized. The multiplication of the gain value for a layer (or the gain vector for the AI model) may reduce the quantization error of each layer, such that the fixed point model performance can approach or approximate floating point model performance. In some examples, the gain vector may be obtained via an optimizer based on the AI model itself and the test data for a specific AI task. The details of the optimizer are further described in
Returning to
Returning to
Now,
With further reference to
With further reference to
Returning to
In some scenarios, in determining the performance value of an AI model in fixed point, the output of the AI model may include a feature map of the AI model. The feature map may include the output of the AI model at a given layer. For example, in some scenarios, a CNN model may include multiple convolution layers followed by one or more fully connected layers. In an example implementation, an AI chip may include the multiple convolution layers, whereas the fully connected layers may be implemented in a processing device (e.g., a graphics processing unit, i.e., GPU) outside the AI chip. In such case, the feature map may be selected as the output of the last convolution layer of the AI chip. By comparing the feature map of the AI model in fixed point and the feature map of the AI model in floating point, the performance value can be assessed. For example, the performance value may be based on the correlation of two feature maps. In another example, the performance value may be based on comparing the two feature maps pixel-by-pixel and determining a sum of differences between corresponding pixels in the feature maps.
In some or other scenarios, the output of an AI model may be the final recognition result. In such case, the performance value of the AI model in fixed point is based on the difference between the recognition result from the AI model in floating point and the recognition result from the AI model in fixed point based on the same test data. In some examples, an image recognition result of an AI model may indicate which of the multiple classes a given input image belongs to. In determining the performance value, for each input image in the test data, the classification results are compared and an error (or accuracy value) is calculated based on the difference between the two classification results, and a sum is calculated based on the accuracy values for multiple input images in the test data.
With further reference to
Process 400 may repeat for a number iterations until the iteration count has exceeded a threshold Tc at 414 and/or the time duration of the process has exceeded a threshold TD at 416. At each iteration, process 400 generates updated gains at 410, and repeating applying the gains to one or more AI chips at 404, determining performance values of the one or more AI chips at 406 and determining optimal gains for each AI chip at 407. For example, G″t,0, G″t,1, . . . , G″i,N-1 represent the gains, such as a gain vector for each AI chip 0, 1, 2, . . . N−1, respectively, at ith iteration, where N represents the number of AI chips. In some examples, each AI chip may be coupled to a respective client device. In some examples, multiple AI chips may be coupled together to a client or a host device.
Let A″i,0, A″i,1, . . . , A″i,N-1 stand for the performance values of the AI model in fixed point based on the gains applied at box 404 to each AI chip 1 to N at the ith iteration. Then the optimal gain vector may be represented by Gi-n_op, where i stands for the current iteration, and n stands for one of the N AI chips, and Gi-n_op=U(A0.n″, A1.n″, . . . , Ai-1.n″). Function U may include selecting the gains that result in the best performance among the performance values from various iterations immediately preceding the current iteration i. In some examples, a performance value A may include a single value described above indicating the difference between the output of the AI model in fixed point and the output of the corresponding AI model in floating point. U may indicate the highest performance value among the performance values of the AI model in fixed point during all of the previous iterations for an AI chip prior to the current iteration.
With further reference to
In some examples, the process 400 may output the final global gains, to a converter that converts an AI model in floating point to an AI model in fixed point for uploading to an AI chip. In some examples, the global gains may be shared among multiple processing devices on the network, in which any device may use the global gains. An example of a converter is shown as 214 in
As described in some examples, gains may be represented by a 1D vector, which contains all of the gains for the one or more AI chips. When gains are represented by a 1D vector, a subtraction of two gain vectors may result in a 1D vector containing multiple gain values, each of which is a subtraction of two corresponding gain values in the two 1D gain vectors, respectively. An addition of two gain vectors may result in a 1D vector containing multiple gain values, each of which is a sum of two corresponding gain values in the two 1D gain vectors. An average of two gain vectors may result in a 1D vector containing multiple gain values, each of which is an average of the corresponding gain values in the two 1 D gain vectors. Similarly, a gain vector may be incremented (added or subtracted) by a perturbation. The resulting gain vector may contain multiple gain values, each of which includes a corresponding gain value in the original gain vector incremented (added or subtracted) by a corresponding value in the perturbation. In some examples, an addition or subtraction of two gain vectors may be in finite field. For example, the addition of gain vectors may be performed in a real coordinate space.
With further reference to
In some examples, the process may determine a velocity of gains for AI chip n, Vi_n at the current iteration i based on the velocity of gains at its previous iteration Vi-1_n. The new velocity Vi_n may also be determined based on the closeness of the current gains relative to the optimal gains for an AI chip. The new velocity of the gains may also be based on the closeness of the current gains relative to the global gains. The closer the current gains are to the optimal gains and/or the global gains, the lower the velocity of gains for the next iteration may be. For example, a velocity of the gain vector for AI chip n at the current ith iteration may be expressed as:
V
i_n
=w*V
i-1_n
+c1*r1*(Gi-n_op−Gi-1_n)+c2*r2*(Gglobal−Gi-1_n)
where w is the inertial coefficient, c1 and c2 are acceleration coefficients, r1 and r2 are random numbers. In some examples, w may be a constant number selected between [0.8, 1.2], such as 0.9, or other values. Coefficients c1 and c2 may be constant numbers in the range of [0, 2], such as 2.0, or other values. Random numbers r1 and r2 may be generated at each iteration i. The determination of velocity of gains described herein may allow the training process to update the gains at each iteration, moving towards the local optimal AT model (per AI chip) and the global optimal model of the system.
In some examples, gains, such as a Gi-1_n, may be a 1D vector containing multiple gain values, for one of the multiple AI chips. A subtraction of two gain vectors, such as Gglobal−Gi-1_n may also be a 1D vector containing multiple gain values, each of which is a subtraction of two corresponding gain values in Gglobal and Gi-1_n. In some examples, r1 and r2 may be diagonal matrices, for example, n×n matrices, for which each parameter in the column vector corresponds to different randomly-generated values r1 and r2. In some examples, r1 and r2 may be randomly generated and have values between [0, 1]. As such, the training process, such as process 400, becomes an n-dimensional optimization problem. As described herein, the velocity of the gains, e.g., Vi_n, Vi-1_n, may contain the same number of parameters as that in a gain vector and have the same dimension as the gain vector. Once the velocity Vi_n is determined, the process may increment the current gains at the previous iteration by the new velocity to determine updated gains. For example, the updated initial gains for AI chip n may be determined as Gi_n=Gi-1_n+Vi_n. Process 400 may determine the updated gains for all of the AI chips n=0, 1, . . . , N−1 in a similar manner. Upon completion of the process at 418, process 400 may further transmit the global gains to a converter, such as 214 in
It is appreciated that the disclosures of various embodiments in
An optional display interface 530 may permit information from the bus 500 to be displayed on a display device 535 in visual, graphic, or alphanumeric format. An audio interface and audio output (such as a speaker) also may be provided. Communication with external devices may occur using various communication ports 540 such as a transmitter and/or receiver, antenna, an RFID tag and/or short-range, or near-field communication circuitry. A communication port 540 may be attached to a communications network, such as the Internet, a local area network, or a cellular telephone data network.
The hardware may also include a user interface sensor 545 that allows for receipt of data from input devices 550 such as a keyboard, a mouse, a joystick, a touchscreen, a remote control, a pointing device, a video input device, and/or an audio input device, such as a microphone. Digital image frames may also be received from an imaging capturing device 555 such as a video or camera that can either be built-in or external to the system. Other environmental sensors 560, such as a GPS system and/or a temperature sensor, may be installed on system and communicatively accessible by the processor 505, either directly or via the communication ports 540. The communication ports 540 may also communicate with the AI chip to upload or retrieve data to/from the chip. For example, the global optimal AI model may be shared by all of the processing devices on the network. Any device on the network may receive the global AI model from the network and upload the global AI model, e.g., CNN weights, to the AI chip via the communication port 540 and an SDK (software development kit). Additionally, and/or alternatively, a device on the network may receive a global gains, e.g., a gain vector, and a global AI model, both of which may be obtained via one or more training processes. The device may convert the global AI model to an AI model in fixed point by applying the global gains. The communication port 540 may also communicate with any other interface circuit or device that is designed for communicating with an integrated circuit.
Optionally, the hardware may not need to include a memory, but instead programming instructions are run on one or more virtual machines or one or more containers on a cloud. For example, the various methods illustrated above may be implemented by a server on a cloud that includes multiple virtual machines, each virtual machine having an operating system, a virtual disk, virtual network and applications, and the programming instructions fbr implementing various functions in the robotic system may be stored on one or more of those virtual machines on the cloud.
Various embodiments described above may be implemented and adapted to various applications. For example, the AI chip having a CeNN architecture may be residing in an electronic mobile device. The electronic mobile device may use the built-in AI chip to produce recognition results and generate performance values. In some scenarios, obtaining the CNN can be done in the mobile device itself, where the mobile device retrieves test data from a dataset and uses the built-in AI chip to perform the training. In other scenarios, the processing device may be a server device in the communication network (e.g., 102 in
The various systems and methods disclosed in this patent document provide advantages over the prior art, whether implemented standalone or combined. For example, using the systems and methods described in
It will be readily understood that the components of the present solution as generally described herein and illustrated in the appended figures could be arranged and designed in a wide variety of different configurations. Thus, the following more detailed description of various implementations, as represented in the figures, is not intended to limit the scope of the present disclosure, but is merely representative of various implementations. While the various aspects of the present solution are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The present solution may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the present solution is, therefore, indicated by the appended claims rather than by this detailed description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.
Reference throughout this specification to features, advantages, or similar language does not imply that all of the features and advantages that may be realized with the present solution should be or are in any single embodiment thereof. Rather, language referring to the features and advantages is understood to mean that a specific feature, advantage, or characteristic described in connection with an embodiment is included in at least one embodiment of the present solution. Thus, discussions of the features and advantages, and similar language, throughout the specification may, but do not necessarily, refer to the same embodiment.
Furthermore, the described features, advantages, and characteristics of the present solution may be combined in any suitable manner in one or more embodiments. One ordinarily skilled in the relevant art will recognize, in light of the description herein, that the present solution can be practiced without one or more of the specific features or advantages of a particular embodiment. In other instances, additional features and advantages may be recognized in certain embodiments that may not be present in all embodiments of the present solution.
Other advantages can be apparent to those skilled in the art from the foregoing specification. Accordingly, it will be recognized by those skilled in the art that changes, modifications, or combinations may be made to the above-described embodiments without departing from the broad inventive concepts of the invention. It should therefore be understood that the present solution is not limited to the particular embodiments described herein, but is intended to include all changes, modifications, and all combinations of various embodiments that are within the scope and spirit of the invention as defined in the claims.