This patent document relates generally to encoding data into an artificial intelligence integrated circuit. For example, system and method for generating time-spectral diagrams in an audio recognition integrated circuit solution are described.
In an artificial intelligence solution, such as recurrent neural network (RNN) or convolutional neural network (CNN), audio recognition tasks typically require preprocessing raw audio data in the time domain to generate spectral information, such as spectrogram or Mel-frequency cepstral coefficients (MFCC) before training and recognition tasks are performed. This preprocessing tasks may impose challenges, particular on an embedded device with limited computing powers as such preprocessing tasks are usually computation extensive which may drain significant resources from the device. For example, for an input image with size of 224×224 and one channel, 224 Fast Fourier Transforms (FFT)s (1024 or 512 point) are required. On an Android system, this may take the microcontroller 370 ms to compute. For a typical PC with an i-7 processor, it may take 15 ms to compute the 224×224 spectrogram for one channel. Whether these preprocessing tasks are performed on an embedded device or a desktop computer, significant computing powers may be needed.
Additionally, AI integrated circuit solutions may also face challenges in arranging data to be loaded into the AI chip having physical constraints. Only meaningful models can be obtained through the training if data are arranged (encoded) properly inside the chip. For example, if intrinsic relationships exist among events that occur proximately in time (e.g., waveform segments in a syllable or in a phrase in a speech), then the intrinsic relationships may be discovered by the training process when the data that are captured proximately in time are arranged to be loaded to the AI chip and processed by the AI chip concurrently.
The present solution will be described with reference to the following figures, in which like numerals represent like items throughout the figures.
As used in this document, the singular forms “a”, “an”, and “the” include plural references unless the context clearly dictates otherwise. Unless defined otherwise, all technical and scientific terms used herein have the same meanings as commonly understood by one of ordinary skill in the art. As used in this document, the term “comprising” means “including, but not limited to.” Unless defined otherwise, all technical and scientific terms used in this document have the same meanings as commonly understood by one of ordinary skill in the art.
Each of the terms “artificial intelligence logic circuit” and “AI logic circuit” may include an example of a logic circuit that is configured to execute certain AI functions such as a neural network in AI or machine learning tasks. An AI logic circuit may be a processor. An AI logic circuit may also be a logic circuit that is controlled by an external processor and executes certain AI functions.
Each of the terms “integrated circuit,” “semiconductor chip,” “chip” and “semiconductor device” may include an example of an integrated circuit (IC) that contains electronic circuits on semiconductor materials, such as silicon, for performing certain functions. For example, an integrated circuit may include a microprocessor, a memory, a programmable array logic (PAL) device, an application-specific integrated circuit (ASIC) or others. An integrated circuit that contains an AI logic circuit is referred to as an AI integrated circuit or an AI chip.
The term “AI chip” may include a hardware- or software-based device that is capable of performing functions of an AI logic circuit. An AI chip may be a physical AI integrated circuit or a virtual chip, i.e., software-based. For example, a virtual AI chip may include one or more process simulators to simulate the operations of a physical AI integrated circuit.
The term of “AI model” may include data that include one or more weights that are used for, when loaded inside an AI chip, executing the AI chip. For example, an AI model for a given CNN may include the weights for one or more convolutional layers of the CNN.
Each of the terms “data precision,” “precision” and “numerical precision” as used in representing values in a digital representation in a memory refers to the maximum number of values that the digital representation can represent. If two data values are represented in the same digital representation, for example, as an unsigned integer, a data value represented by more bits in the memory generally has a higher precision than a data value represented by fewer bits. For example, a data value using 5 bits has a lower precision than a data value using 8 bits.
With reference to
System 100 may further include a communication network 108 that is in communication with the processing devices 102a-102d. Each processing device 102a-102d in system 100 may be in electrical communication with other processing devices via the communication network 108. Communication network 108 may include any suitable communication links, such as wired (e.g., serial, parallel, optical, or Ethernet connections) or wireless (e.g., Wi-Fi, Bluetooth, mesh network connections) or any suitable communication network later developed. In some scenarios, the processing devices 102a-102d may communicate with each other via a peer-to-peer (P2P) network or a client/server based communication protocol. System 100 may also include one or more AI models 106a-106b. System 100 may also include one or more databases that contain test data for training the one or more AI models 106a-106b.
In some scenarios, the AI chip may contain an AI model for performing certain AI tasks. For example, an AI model may be a CNN that is trained to perform voice or image recognition tasks. A CNN may include multiple convolutional layers, each of which may include multiple weights. In the case of physical AI chip, the AI chip may include an embedded cellular neural network that has a memory for containing the multiple weights in the CNN. In some scenarios, the memory in a physical AI chip may be a one-time-programmable (OTP) memory that allows a user to load a CNN model into the physical AI chip once. Alternatively, a physical AI chip may have a random access memory (RAM) or other types of memory that allows a user to load and/or update a CNN model in the physical AI chip.
In the case of virtual AI chip, the AI chip may include a data structure to simulate the cellular neural network in a physical AI chip. A virtual AI chip can be of particular advantageous when multiple tests need to be run over various CNNs in order to determine a model that produces the best performance (e.g., highest recognition rate or lowest error rate). In each test run, the weights in the CNN can vary and, each time the CNN is updated, the weights in the CNN can be loaded into the virtual AI chip without the cost associated with a physical AI chip. After the CNN model is determined, the final CNN model may be loaded into a physical AI chip for real-time applications.
Each of the processing devices 102a-102d may be any suitable device for performing an AI task (e.g., voice recognition, image recognition, scene recognition etc.), training an AI model 106a-106b or capturing test data 104. For example, the processing device may be a desktop computer, an electronic mobile device, a tablet PC, a server or a virtual machine on the cloud. Various methods may be implemented in the above described embodiments in
With reference to
The encoding method may also include loading the input voice data into an AI chip at 204. In loading the input voice data into the AI chip at 204, the input voice data may be loaded into one or more channels in a cellular neural network (CeNN) in the AI chip, examples of various arrangement of voice data will be described further in detail. In some examples, the AI chip may contain an AI model, e.g., a CNN, which may have multiple layers, each having a filter/kernel. For example, a filter/kernel may be a 3 by 3 array, and the AI chip may be programmed to perform a convolution by applying a filter/kernel to each respective layer in the CNN. In some examples, the encoding method may include programming the first N layers of the AI chip to generate time-spectral information at 206. For example, programming may include feeding programming instructions to the AI chip to cause a microprocessor of the AI chip to operate. Programming may also include sending command instructions to the AI chip from a controller (e.g., an external device, such as a mobile or desktop device) to cause the AI chip to operate. In some examples, raw voice data may be received at a first layer of the CNN and propagated through one or more layers (e.g., N layers) of the CNN to generate the time-spectral information. A time-spectral diagram includes a plurality of pixels each comprising a value that represents an audio intensity of the segment of the audio waveform at a time in the segment and a frequency. The number N may be any suitable number. This step eliminates the need to preprocess audio data to generate a time-spectral diagram, e.g., a spectrogram, before being loaded into an AI chip, thus faster computation utilizing the hardware of the AI chip will be achieved.
With further reference to
The method may further include outputting the voice recognition result at 216. Outputting the voice recognition result 216 may include storing a digital representation of the recognition result to a memory device inside the AI chip or outside the AI chip, the content of the memory can be retrieved by the application running the AI task, an external device or a process. The application running the AI task may be an application running inside the AI integrated circuit should the AI integrated circuit also have a processor. The application may also run on a processor on the communication network (102c-102d in
In a non-limiting example, the embedded CeNN in the AI chip may be configured to have a maximal number of channels, e.g., 3, 8, 16, 128 or other numbers, and each channel may be configured to include a 2D array having a size, e.g., 224 by 224 pixels, and each pixel value may have a depth, such as, for example, 5 bits. Input data for any AI tasks using the AI chip must be encoded to adapt to such hardware constraints of the AI chip. For example, loading the input voice data into the AI chip at 204 may including arranging voice data in one-dimension (1D) into rows and columns in two-dimension (2D) in column-wise or row-wise. Assuming the 2D array in each channel of CeNN includes 224×224 pixels, the method may fill the first column of the input array of the CeNN by the first 224 data points in the voice input data, followed by the second column that takes from the next 224 data points in the voice input data. In a non-limiting example, each CeNN layer may have a number of channels, and the voice input data may also be filled into each layer in channel-wise, followed by column-wise or row-wise fashion. The above described 2D array sizes, channel number and depth for each channel are illustrative only. Other sizes may be possible. Further, the number of 2D arrays for encoding into the CeNN in the AI chip may be smaller than the maximum channels of the CeNN in the AI chip.
In some scenarios, the embedded CeNN in the AI chip may store a CNN that was trained and pre-loaded. The structure of the CNN may correspond to the same constraints of an AI integrated circuit. For example, for the above illustrated example of the embedded CeNN, the CNN may correspondingly be structured to have three channels, each having an array of 224×224 pixels, and each pixel may have a 5-bit value. The training of the CNN may include encoding the training data and programming the AI chip in the same manner as described in the recognition process (e.g., block 204, 206), and an example of a training process is further explained, as below.
With continued reference to
In some examples, the sample training voice data may be loaded into the AI chip in a similar manner as in block 204. For example, loading the sample training voice data into the AI chip at 224 may including arranging the sample training voice data in one-dimension (1D) into rows and columns in two-dimension (2D) in a column-wise or row-wise, or in a channel-wise fashion followed by column- or row-wise fashion. Assuming the 2D array in each channel of CeNN includes 224×224 pixels, the method may fill the first column of the input array of the CeNN by the first 224 data points in the sample training input data, followed by the second column that takes from the next 224 data points in the sample training voice data. In a non-limiting example, each CeNN layer may have a number of channels, and the sample training input data may also be filled into each layer in a channel-wise fashion, followed by column-wise or row-wise fashion. The above described 2D array sizes, channel number and depth for each channel are illustrative only and may vary depending on hardware constraints.
In some examples, the training method may include programming the first N layers of the AI chip to generate time-spectral information at 226, similar to the process 206. For example, programming may include feeding programming instructions to the AI chip to cause a microprocessor of the AI chip to operate. Programming may also include sending command instructions to the AI chip from a controller (e.g., an external device, such as a mobile or desktop device) to cause the AI chip to operate. In some examples, similar to the process 206, raw sample training voice data may be received at a first layer of the CNN and propagated through the N layers of the CNN to generate the time-spectral information. The number N may be any suitable number or identical to the number N in process 206. This step eliminates the need to preprocess audio data to generate a time-spectral diagram, e.g., a spectrogram, before being loaded into an AI chip, thus will result in faster computation utilizing the hardware of the AI chip.
With further reference to
In another non-limiting example, a voice recognition task may be designed to recognize the content of the voice input, for example, a syllable, a word, a phrase or a sentence. In each of these cases, the CNN may include a multi-class classifier that assigns each segment of input voice data into one of the multiple classes. Correspondingly, the training process also uses the same CNN structure and multi-class classifier, for which the training process receives an indication for each training sample of one of the multiple classes to which the sample belongs.
Alternatively, and/or additionally, in some scenarios, a voice recognition task may include feature extraction, in which the voice recognition result may include, for example, a vector that may be invariant to a given class of samples, e.g., a given person's utterance regardless of the exact word spoken. In a CNN, both training and recognition may use a similar approach. For example, the system may use any of the fully connected layers in the CNN, after the convolution layers and before the softmax layers. In a non-limiting example, let the CNN have six convolution layers followed by four fully connected layers. In some scenarios, the last fully connected layer may be a softmax layer in which the system stores the classification results, and the system may use the second to last fully connected layer to store the feature vector. There may be various configurations depending on the size of the feature vector. A large feature vector may result in large capacity and high accuracy for classification tasks, whereas a feature vector too large may reduce efficiencies in performing the voice recognition tasks.
The system may use other techniques to train the feature vectors directly without using the softmax layer. Such techniques may include the Siamese network, and methods used in dimension reduction techniques, such as t-Distributed Stochastic Neighbor Embedding (t-SNE), etc.
In some examples, input data 302 may be received at the first layer of the CNN. For example, input data may include the raw audio waveform and contains a number of audio data. For example, the input data may include a segment audio data from time T0 to T0+(H×W)×ΔT, where H and W are the height and width of each layer in the CNN. For example, H and W each may have a value of 224. Time difference AT may be the time difference between adjacent data points in the input data. For example, AT may be the inverse of the sampling rate of the raw audio waveform. ΔT may be different from the inverse of the sampling rate of the raw audio waveform.
In some examples, input data may be convolved with a respective filter/kernel in each layer, and the convolution result is fed into the next layer. For example, in layer 1, X1=X0*w1+b1, where X0 is the input data, w1 is the first filter/kernel, X1 is the output for layer 1, b1 is the bias term for layer 1. The operation “*” is the convolution. In layer 2, X2=X1*w2+b2, . . . in layer N, XN=XN-1*wN+bN. Applying these operations to the input data at 302 with filter banks “f1, . . . , fn,” the output for each layer may correspond to the frequency axis fat time T0. Similarly, input data at 312 that includes a segment audio data from time T0+ΔT to T0+ΔT+(H×W)×ΔT may be received at the first layer of the CNN and convolved to the wavelet filter banks “f1, f2, . . . , fn” in each layer to generate the result that corresponds to the time-spectral diagram at 320 at time T0+ΔT. Here, in the example in
The processes described herein facilitate the use of raw audio data in performing training and recognition tasks. The acoustic sound waves (e.g., raw data) in general have certain properties that exist in certain time ranges/windows. The use of wavelets windows, e.g., f1, f2, . . . , fn in
With further reference to
In some examples, when keeping the kernel pattern fixed in time, a learning algorithm will discover the “wavelet filter banks” through the training of the neural network. For example, if a CNN is trained to detect speakers, the “words” or “contents” may not be relevant in discovering speaker identity, however, a feature (finger print) of the spectrogram might. In other words, a feature in a time-spectral diagram, such as diagram 320 in
The set of filter banks may be “encoded” into multiple kernels in various ways. In some examples, the method may program the CNN to have multiple kernels each containing one of the filters, such as one of f1, f2, . . . fn in
In some examples, the method may program the CNN to use a combination of multiple kernels to represent one filter in the set filter banks to accommodate more sampling points in each filter. For example, five kernels of a size of 3×3×1 may be used to contain 45 sampling points in a filter. In a non-limiting example, with reference to
With further reference to
With further reference to
When each kernel is convolved with the input data y(t), in a column-wise representation, the convolution result Y=y (*) ω, where y(i, j, k) is convolved with ω(ci, cj, ck), and (*) is the convolution operator. In an element-by-element representation, Y(i, j)=sum (ci, cj, ck)(y(i+ci, j+cj, k+ck)*ω(ci, cj, ck), where sum (ci, cj, ck) is the summation of all possible sum (ci, cj, ck) values, e.g., all elements of all kernels. In performing summation, in a second layer, such as 508 in
With reference to
In some examples, arranging continuous block of data in a kernel may treat the blocks of continuous data, such as three blocks, as three independent (uncorrelated) samples. In outputting the convolution result, the three blocks of data may each include an identical filter and share the same weight. The resulted convolution may be treated as the average of three identical filters.
Alternatively, each kernel may use more than one or all of the columns to represent a single filter. For example, a kernel may include three blocks of continuous data (in three columns) as related samples, or a subset of a large sample. A wavelet filter may be represented across multiple blocks of continuous data, such as multiple columns of a kernel as shown in
With further reference to
In arranging the data in a multi-resolution fashion, proper sub-sampling (e.g., regularly skip certain fixed number of samples, such as taking one out of every 4 samples) may be used. Using multi-resolution encoding, data within each larger block, such as 920(1) having C×4×4 samples, may span a period of time in the raw audio data and capture several audio segments in different time scales. Thus, different data samples in different blocks at the same resolution level, such as 920(1), 920(2), 920(3), 920(4), should share the same time-invariant properties, i.e. share the same filters. In such configuration, only one convolution layer may be used to capture voice features at several time scales within a short time window. The resulted “feature maps” (e.g., outputs of the convolution) may be an overlapped, sequential inputs for sub-sequent layers.
With further reference to
With further reference to
In obtaining the coefficients for kernels in various layers in the CNN, in some examples, a linear combination can be used. For example, let data in the second layer L2(X′) be expressed in terms of data in the first layer, e.g., L2(X′)=sum(L1(X)*K1), where K1 is the kernel in the first layer, L1(X) is data in the first layer, and “*” is the convolution operator. Similarly, data in the third layer L3(X″) may be expressed as L3(X″)=sum(L2(X′)*K2)=sum(sum(L1(X)*K1)*K2), where K2 is the kernel in the second layer. If each of the above formula is carried out at each layer and data in the third layer L3(X″) is expressed in a linear combination of data in the first layer L1(X), then kernel coefficients may be obtained such that data in the third layer is a valid filtered value of input data.
Additionally and/or alternatively, each of the filter banks may be learned via a training of the CNN. For example, given training targets, such as lower cost of cross entropy or softmax values, etc., a training algorithm may be used to discover filter values, which may be expressed in several layers. In achieving this, “feature” is relative to the range of data seen at each layer in a hierarchical (layered) network. In other words, “features” at the lower layers can be expressed by “features” of the upper layers.
It is appreciated that data arrangement may vary in achieving multi-resolution convolution. For example, in each channel, input data may be arranged in a column-wise first followed by row-wise fashion. Further,
Multi-resolution encoding may allow each layer of the CNN to “see” different resolution. For example, a CNN includes 3 layers, each including a kernel of a size of 2×2. As such, a data point “x” in layer 2 may be affected by a 2×2 data block in layer 1, which may be subsequently affected by a 2×2 data block in layer 0. If a stride of two is used to avoid data reuse in convolution, the regions seen by different layers have different “resolution” on the input image. Although kernels at different layers may all be 2×2 or 3×3, but the “data” they see are different at different layers. In layer 1, “features” on the smallest time scale may be seen, whereas the same “features” in layer 1 may be “seen” on a larger time scale in layer 2. Those “features” may be the trained as kernel parameters on each layer in a training process, such as described in
In various embodiments in
In some examples, the encoding method use regular skipping (i.e., skipping certain number of samples, as sub-sampling) on the input data or filter values. As a result, different sampling rate on channel, column or row may be achieved, which may facilitate capturing of frequency properties of the input data within a convolution kernel operation.
The above illustrated embodiments in
An optional display interface 1230 may permit information from the bus 1200 to be displayed on a display device 1235 in a visual, graphic or alphanumeric format. An audio interface and an audio output (such as a speaker) also may be provided. Communications with external devices may occur using various communication devices 1240 such as a transmitter and/or receiver, antenna, an RFID tag and/or short-range or near-field communication circuitry. A communication device 1240 may be attached to a communications network, such as the Internet, a local area network (LAN) or a cellular telephone data network.
The hardware may also include a user interface sensor 1245 that allows for receipt of data from input devices 1250 such as a keyboard, a mouse, a joystick, a touchscreen, a remote control, a pointing device, a video input device and/or an audio input device, such as a microphone. For example, input device 1250 may include a microphone configured to capture voice input data for loading into the AI chip and generating the time-spectral diagram. Digital image frames may also be received from an imaging capturing device 1255 such as a video or camera that can either be built-in or external to the system. Other environmental sensors 1260, such as a GPS system and/or a temperature sensor, may be installed on system and communicatively accessible by the processor 1205, either directly or via the communication device 1240. The communication ports 1240 may also communicate with the AI chip to upload or retrieve data to/from the chip. For example, the computer system may implement the encoding methods and upload the trained CNN weights or the encoded audio data to the AI chip via the communication port 1240. The communication port 1240 may also communicate with any other interface circuit or device that is designed for communicating with an integrated circuit.
Optionally, the hardware may not need to include a memory, but instead programming instructions are running on one or more virtual machines or one or more containers on a cloud. For example, the various methods illustrated above may be implemented by a server on a cloud that includes multiple virtual machines, each virtual machine having an operating system, a virtual disk, virtual network and applications, and the programming instructions for implementing various functions in the robotic system may be stored on one or more of those virtual machines on the cloud.
Various embodiments described above may be implemented and adapted to various applications. For example, an AI integrated circuit having a cellular neural network architecture may be residing in an electronic mobile device. The electronic mobile device may also have a voice or image capturing device, such as a microphone or a video camera for capturing input audio/video data, and use the built-in AI chip to generate recognition results. In some scenarios, training for the CNN can be done in the mobile device itself, where the mobile device captures or retrieves training data samples from a database and uses the built-in AI chip to perform the training. In other scenarios, training can be done in a service device or on a cloud. These are only examples of applications in which an AI task can be perform in the AI chip.
The above illustrated embodiments are described in the context of implementing a CNN solution in an AI chip, but can also be applied to various other applications. For example, the current solution is not limited to implementing CNN but can also be applied to other algorithms or architectures inside a chip. The voice encoding methods can still be applied when the bit-width or the number of channels in the chip varies, or when the algorithm changes.
It will be readily understood that the components of the present solution as generally described herein and illustrated in the appended figures could be arranged and designed in a wide variety of different configurations. Thus, the following more detailed description of various implementations, as represented in the figures, is not intended to limit the scope of the present disclosure, but is merely representative of various implementations. While the various aspects of the present solution are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The present solution may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the present solution is, therefore, indicated by the appended claims rather than by this detailed description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.
Reference throughout this specification to features, advantages, or similar language does not imply that all of the features and advantages that may be realized with the present solution should be or are in any single embodiment thereof. Rather, language referring to the features and advantages is understood to mean that a specific feature, advantage, or characteristic described in connection with an embodiment is included in at least one embodiment of the present solution. Thus, discussions of the features and advantages, and similar language, throughout the specification may, but do not necessarily, refer to the same embodiment.
Furthermore, the described features, advantages and characteristics of the present solution may be combined in any suitable manner in one or more embodiments. One ordinarily skilled in the relevant art will recognize, in light of the description herein, that the present solution can be practiced without one or more of the specific features or advantages of a particular embodiment. In other instances, additional features and advantages may be recognized in certain embodiments that may not be present in all embodiments of the present solution.
Other advantages can be apparent to those skilled in the art from the foregoing specification. Accordingly, it will be recognized by those skilled in the art that changes, modifications or combinations may be made to the above-described embodiments without departing from the broad inventive concepts of the invention. It should therefore be understood that the present solution is not limited to the particular embodiments described herein, but is intended to include all changes, modifications, and all combinations of various embodiments that are within the scope and spirit of the invention as defined in the claims.