This patent document can be exactly reproduced as it appears in the files of the United States Patent and Trademark Office, but the assignee(s) otherwise reserves all rights in any subsets of included original works of authorship in this document protected by 35 USC 102(a) of the U.S. copyright law.
In the following Background, Summary, and Detailed Description, paragraph headings are signifiers that do not limit the scope of an embodiment of a claimed technology (ECIN). The citation or identification of any publication signifies neither relevance nor use as prior art. A paragraph for which the font is all italicized signifies text that exists in one or more patent specifications filed by the assignee(s). A writing enclosed in double quotes (“ ”) signifies an exact copy of a writing that has been expressed as a work of authorship. Signifiers, such as a word or a phrase enclosed in single quotes (‘ ’), signify a term that as of yet has not been defined and that has no meaning to be evaluated for, or has no meaning in that specific use (for example, when the quoted term ‘module’ is first used) until defined.
This disclosure has general significance in the field of artificial intelligence, in particular, significance for the following topics: sequence-to-sequence (Seq2Seq) AI architectural models, natural language processing tasks such as machine translation, text summarization, and question answering. This information is limited to use in the searching of the prior art.
A sequence-to-sequence (Seq2Seq) AI model is an architectural term to define a certain type of neural network architecture that is commonly used for natural language processing tasks such as machine translation, text summarization, and question answering.
It consists of two main components: an encoder, which processes the input sequence, and a decoder, which generates the output sequence. The encoder and decoder are typically implemented using recurrent neural networks (RNNs) or transformer networks.
The encoder takes in the input sequence and produces a fixed-length context vector, which is used by the decoder to generate the output sequence.
Seq2Seq models are trained by providing pairs of input and output sequences and minimizing the difference between the model's predicted output and the true output.
Sequens has a number of tokens. A token is a part of the sequence (e.g., letter, word, part of the word.)
A Seq2Seq architecture or model refers to a class of problems that makes a prediction based on an input sequence of data. One example of a Seq2Seq model is speech transcription where a word processor creates a printed document based on spoken words.
However, the Seq2Seq models impose some temporal dependencies which means that while implementing such models on hardware processing devices (such as a GPU, CPU or TPU) there are periods where data must be returned to a host processor to complete the model's processing steps.
Unfortunately, Seq2Seq models may not run efficiently on various compute or accelerator hardware platforms. To illustrate, a hardware processing device runs the encoder in a step-by-step fashion where each word in the sequence is processed to infer what the next word should be. In such processing devices, the output at one time step becomes the input to the next time step.
This relational dependency creates a bottleneck because architecture basically works in a step-by-step fashion. In many hardware accelerators, the generated output for each step is sent from the hardware accelerator to a host CPU and then fed back again into the hardware accelerator to perform another inference. That output is then sent back to the host for post processing. A simple intuitive illustration of this process is a car traveling along on a highway for one mile at which time it takes an off-ramp and then, after a requisite stop at the end of the off-ramp, it takes the on-ramp before driving another mile at which time it takes the next off-ramp and on-ramp to continue its drive along the highway. Obviously, this is not an efficient approach to arriving at the intended destination in a timely fashion. Neither is it an efficient approach to performing the inferences and generating an appropriate response.
This Summary, together with any Claims, is a brief set of signifiers for at least one ECIN (which can be a discovery, see 35 USC 100(a); and see 35 USC 100(j)), for use in commerce for which the Specification and Drawings satisfy 35 USC 112.
Instead of using a Seq2Seq model architecture, a preferred embodiment of the claimed technology (ECIN) substitutes a different type of neural network architecture that makes the model efficiently run on a variety of hardware platforms.
More specifically, in one ECIN, the Seq2Seq model is replaced with an equivalent model that does not sacrifice the accuracy of the neural network but eliminates the need to process in a step-by-step fashion.
In one ECIN the equivalent model comprises a neural network architecture for natural language processing. The neural network architecture comprises: a speech-to-text encoder configured to encode an input speech signal; a Bifrost speech recognizable engine configured for processing the encoded speech signal to generate a speech recognized signal corresponding to the input speech signal; and a decoder configured to decode the speech recognized signal and to generate the output sequence.
This Summary does not completely signify any ECIN. While this Summary can signify at least one essential element of an ECIN enabled by the Specification and Figures, the Summary does not signify any limitation in the scope of any ECIN.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fec.
The following Detailed Description, Figures, and Claims signify the uses of progress enabled by one or more ECINs. All the Figures are used only to provide knowledge and understanding and do not limit the scope of any ECIN. Such Figures are not necessarily drawn to scale. The Figures can have the same, or similar, reference signifiers in the form of labels (such as alphanumeric symbols, e.g., reference numerals), and can signify a similar or equivalent function or use. Further, reference signifiers of the same type can be distinguished by appending to the reference label a dash and a second label that distinguishes among the similar signifiers. If only the first label is used in the Specification, its use applies to any similar component having the same label irrespective of any other reference labels. A brief list of the Figures is below.
In the Figures, reference signs can be omitted as is consistent with accepted engineering practice; however, a skilled person will understand that the illustrated components are understood in the context of the Figures as a whole, of the accompanying writings about such Figures, and of the embodiments of the claimed technologies.
In the Figures, reference signs can be omitted as is consistent with accepted engineering practice; however, a skilled person will understand that the illustrated components are understood in the context of the Figures as a whole, of the accompanying writings about such Figures, and of the embodiments of the claimed technology.
The Figures and Detailed Description, only to provide knowledge and understanding, signify at least one embodiment of a claimed technology (ECIN). To minimize the length of the Detailed Description, while various features, structures or characteristics can be described together in a single embodiment, they also can be used in other embodiments without being written about. Variations of any of these elements, and modules, processes, machines, systems, manufactures, or compositions disclosed by such embodiments and/or examples are easily used in commerce. The Figures and Detailed Description signify, implicitly or explicitly, advantages and improvements of at least one ECIN for use in commerce.
In the Figures and Detailed Description, numerous specific details can be described to enable at least one ECIN. Any embodiment disclosed herein signifies a tangible form of a claimed technology. To not diminish the significance of the embodiments and/or examples in this Detailed Description, some elements that are known to a skilled person can be combined together for presentation and for illustration purposes and not be specified in detail. To not diminish the significance of these embodiments and/or examples, some well-known processes, machines, systems, manufactures, or compositions are not written about in detail. However, a skilled person can use these embodiments and/or examples in commerce without these specific details or their equivalents. Thus, the Detailed Description focuses on enabling the inventive elements of any ECIN. Where this Detailed Description refers to some elements in the singular tense, more than one element can be depicted in the Figures and like elements are labeled with like numerals.
This means that for each time step, data that arrives at the inputs of each stream at time T1 and result data exits at time T2 for all streams. Each stream performs an equal (or equivalent) amount of compute comprising a sequence of computations. Each sequence of computation may comprise different mathematical operations or similar mathematical operations in a different sequence.
In an embodiment of the present technology, to illustrate, one sequence may comprise the mathematical operations ADD, then MULT, then TRANSPOSE while another stream may compute the mathematical operations of TRANSPOSE, then MULT, then ADD.
In an embodiment of the present technology, each stream performs an identical sequence of mathematical operations.
So, using the given above highway analogy, there is no off-ramp and on-ramp movement of data on the data highway because the data is just moving forward right from one end and exits on the other end by using five parallel data lanes with the same speed.
Referring still to
In an embodiment of the present technology the compute block 16 comprises a 1D convolution neural network (CNN) that performs dilation parameterization.
A 1D convolutional neural network (CNN) is a type of neural network that is used to process 1-dimensional data, such as time series data. It is composed of one or more convolutional layers, which are responsible for learning the features of the input data. A dilated convolution is a technique that expands the kernel (input) by inserting holes between its consecutive elements. In simpler terms, it is the same as convolution, but it involves pixel skipping, so as to cover a larger area of the input.
Dilated convolution is a type of convolution operation used in convolutional neural networks (CNNs) that enables the network to have a larger receptive field without increasing the number of parameters.
In a regular convolution operation, a filter of a fixed size slides over the input feature map, and the values in the filter are multiplied with the corresponding values in the input feature map to produce a single output value. The receptive field of a neuron in the output feature map is defined as the area in the input feature map that the filter can “see.” The size of the receptive field is determined by the size of the filter and the stride of the convolution.
In contrast, in a dilated convolution operation, the filter is “dilated” by inserting gaps between the filter values. The dilation rate determines the size of the gaps, and it is a hyperparameter that can be adjusted. When the dilation rate is 1, the dilated convolution reduces to a regular convolution.
The increased dilation rate effectively increases the receptive field of the filter without increasing the number of parameters, because the filter is still the same size, but with gaps between the values. This can be useful in situations where a larger receptive field is needed, but increasing the size of the filter would lead to an increase in the number of parameters and computational complexity.
More specifically, the CNN layer employs a “dilation” parameter to control the spacing between the values in the input that are being convolved.
In a standard 1D convolution, the convolutional filter slides over the input, with a fixed stride, and computes the dot product of the filter weights and the input values at each position. This process is repeated for every position in the input, and the output is a new feature map with a reduced spatial dimensionality.
On the other hand, a 1D convolutional filter with dilation rate 2 is a filter that has gaps of size 2 between its values. This means that the filter considers every other value in the input data, effectively doubling the receptive field of the filter. The output of a 1D convolutional filter with dilation rate 2 is a feature map that represents the learned features of the input data, with a larger receptive field than a regular 1D convolutional filter. The 1D convolutional filter with dilation rate n is a filter that has gaps of size n between its values,
Referring still to
In an embodiment of the present technology, referring still to
The kernel size in a convolutional neural network (CNN) refers to the size of the filter that is applied to the input data. The kernel size is a hyperparameter that can be adjusted to control the size of the receptive field of the filter. A larger kernel size will result in a larger receptive field, which means that the filter can “see” more of the input data.
By way of example, the dilation basically controls which part of the input each stream is processing at a particular time. If, in one embodiment, the kernel size of the convolution is 3 in the first stream, then subsequent streams will have a kernel size of 3 but the bigger dilation will cause a different window at a particular time to select the input data for the parameterized convolutions.
In an embodiment of the present technology, referring still to
Multi-headed self-attention block is a module for attention mechanisms that runs through an attention mechanism several times in parallel. The independent attention outputs are then concatenated and linearly transformed into the expected dimension.
This module is commonly used in natural language processing (NLP) tasks such as machine translation, text classification, and sentiment analysis.
Multi-headed Attention is a module for attention mechanisms which runs through an attention mechanism several times in parallel. The independent attention outputs are then concatenated and linearly transformed into the expected dimension. Intuitively, multiple attention heads allow for attending to parts of the sequence differently (e.g., longer-term dependencies versus shorter-term dependencies).
In an embodiment of the present technology, referring still to
Because the output of the convolutions is different, the input to the multi-headed self-attention network for each stream will be different. Similarly, the output from each stream is also different but each stream basically works on the same size of the input so the compute in this case is also equal.
Multi-headed self-attention networks are useful for natural language processing tasks because they allow the network to attend to different parts of the input simultaneously and learn different representations for different parts of the input.
More specifically, multi-head self-attention is a technique used in language translation that splits word embedding into multiple chunks, called heads. Each head is passed through a separate set of attention weights.
The attention heads output their own attention filters, which in turn output their own filtered value matrices. Each attention filter zooms in on different combinations of linguistic features.
Here are some steps for using multi-head self-attention in language translation: (i) the word embedding is split into multiple chunks, called heads; (ii) the self-attention is applied on each head; (iii) all outputs are concatenated together into a tensor of shape (batch size, sequence length, values dimensionality); and (iv) the result is passed through one final dense layer.
The optimal number of heads for multi-head attention is typically chosen based on the specific task and the available computational resources. Using more heads allows the model to attend to more aspects of the input sequence in parallel, potentially leading to better performance.
In an embodiment of the present technology, as shown in
In an embodiment of the present technology, more specifically, the feedforward linear layer looks at the five streams and makes one common decision based on the input.
In an embodiment of the present technology, the number of streams and the size of streams is configurable and can be changed in accordance with each specific problem. Once the problem is defined, for example, a speech recognition problem, based on level of accuracy of the speech recognition system, the level of accuracy can be adjusted simply by increasing the number of streams run in parallel as well as also increase the size of each stream by introducing more convolutions and bigger attention blocks. Thus, the model basically has a larger representational capacity to make predictions on increasingly complex sequences.
In an embodiment of the present technology, model 14 (
In an embodiment of the present technology, model 14 (
For example, in the simplest way to treat phonemes as letters from A-Z and the temporal phoneme probability matrix 22 generates a probability that such letter has been spoken at the first time step, and for the next time step the temporal phoneme probability matrix 22 generates a probability that such letter has been spoken at that time step.
Thus, the output of the temporal phoneme probability is post processed and converted into actual text.
According to an embodiment of the present technology,
In an embodiment of the present technology, the encoder 24 of
In an embodiment of the present technology, the encoder 24 of
In an embodiment of the present technology, the encoder 24 of
In an embodiment of the present technology, the encoder 24 of
Field Programmable Gate Arrays (FPGAs) are a type of hardware that can be programmed to implement neural networks. FPGAs can be reconfigured to best fit a neural network's processing demands.
Indeed, FPGAs are a flexible, low-latency architecture that enables deep learning acceleration in a power-efficient solution. They provide adaptable, high-performance solutions that exceed traditional GPUs and CPUs.
FPGAs are made up of many programmable logic blocks such as look-up tables (LUT), memory, and reconfigurable connections. The circuitry inside an FPGA chip is not hard etched, so it can be reprogrammed as needed.
FPGAs are a good choice for specific applications that demand hardware acceleration and real-time processing. For example, FPGA-accelerated large language models can be used for a wide range of natural language processing tasks, including text generation, translation, summarization, and sentiment analysis.
Referring still to
At each time step, the encoder takes in the current word in the input sequence and updates its hidden state. The hidden state captures the information about the input sequence up to that point, which is then used to compute the hidden state at the next time step. This process is repeated until the end of the input sequence is reached.
The encoder's final hidden state is typically used as the context vector that is typically passed to the decoder. The context vector captures the meaning of the entire input sequence, so it provides the decoder with a compact representation of the input that the decoder can use to generate the output sequence. The encoder outputs numerical values that represent the meaning of speech.
Additionally, the encoder also generates a set of hidden states, one for each input token. This set of hidden states is also passed to the decoder, the decoder uses these hidden states to attend to the relevant parts of the input sequence.
The attention mechanism computes a weighted sum of the encoder's hidden states, where the weights are computed based on the similarity between the decoder's current hidden state and the encoder's hidden states. The similarity between the decoder's current hidden state and the encoder's hidden states is determined by using the inner product (or dot product).
This weighted sum is used as input to the decoder at each time step, allowing the decoder to focus on the most relevant parts of the input sequence.
The encoder is responsible for understanding the inputs, for example in the case of a speech transcription program. The encoder basically understands speech at a fundamental level and then the decoder, which runs in a step-by-step manner converts the understanding of the encoder into actual text. The encoder provides real-time understanding, and the decoder is responsible for converting into actual output.
As noted earlier, most such Seq2Seq models are very inefficient because they run in a step-by-step fashion. More specifically the Seq2Seqmodel runs in a step-by-step fashion to convert the output of the encoder into actual text, this means that the decoder generates one word of the output sequence at a time, based on the context vector produced by the encoder and the previously generated words.
Referring still to
Deploying deep neural networks (DNNs), e.g., when targeting constrained devices, can be prohibitive due to steep computational requirements of state-of-the-art DNN models. To address this issue, an across-stack approach is utilized, with algorithmic improvements giving better accuracy with fewer operations and novel compression techniques reducing model size further.
DNN inference accelerators can bring improvements for the hardware layer of the systems stack, with reconfigurable accelerators such as MAERI and Eyeriss v2 promising improved performance by adjusting logic paths for a given DNN model architecture. However, finding optimal hardware configurations is still an active area of research.
STONNE, a cycle-accurate simulator for DNN accelerators with reconfigurable dataflow patterns, allows researchers to explore the design space of flexible accelerator architectures.
However, it currently requires significant manual effort to use, such as the requirement to rewrite the PyTorch model definition so it can be parsed by the system, as well as being limited to PyTorch support only. Additionally, the mapping tools to generate optimized dataflow configurations are not directly integrated in STONNE, such as mRNA for MAERI, and thus require further manual steps.
To address these usability issues of STONNE this paper evaluation and optimization of reconfigurable DNN inference Bifrost model is introduced.
As well as automating many of the more tedious manual steps of using STONNE, Bifrost also allows more DNN models to be run, adds a module for automatically generating optimized mappings for reconfigurable accelerators, as well as the ability to leverage existing mapping tools.
Apache Tensor Virtual Machine (TVM) is an open-source compiler framework for machine learning that aims to enable machine learning engineers to optimize and run computations efficiently on any hardware backend. Apache TVM works with deep learning frameworks to provide end-to-end compilation to different backends.
Apache TVM aims to bridge the gap between the creation of machine learning models and launching them into production. It automates the time-consuming work of tuning models to various backend hardware, specifically CPUs, GPUs, and specialized accelerators.
Apache TVM supports runtime bindings for programming languages like Python and Java. There are two main ways to access the features of Apache TVM: the Python API and the C++ API.
Bifrost is built on STONNE and Apache TVM, a state of the-art machine learning compiler framework that enables researchers to transparently execute any of the wide-range of DNN models compatible with TVM (from frameworks such as PyTorch, TensorFlow, and ONNX) using STONNE DNN layers not accelerated by the chosen hardware accelerator in STONNE are executed using an implementation from TVM, which allows end-to-end evaluation and easy verification of correctness. Bifrost also extends the autotuning module of TVM, AutoTVM, for design space and dataflow exploration, for example varying tile sizes to reduce clock cycle counts.
Additionally, Bifrost can integrate specialized mapping tools such as: Enabling Efficient Mapping Space Exploration for a Reconfigurable Neural Accelerator (mRNA) for Multiply-Accumulate Engine with Reconfigurable Interconnects (MAERI) which may provide more optimal mappings in less time assuming that a specialized mapping tool is available for the target hardware architecture.
At each step, the Bifrost engine 32 uses the context vector, which represents the entire input sequence, and its own hidden state, which represents the previously generated words, to produce a probability distribution over the vocabulary for the next word in the output sequence.
The word with the highest probability is then chosen and added to the output sequence, and the Bifrost engine 32 hidden state is updated. This process is repeated until the Bifrost engine 32 generates the end-of-sequence token or a maximum length is reached.
Therefore, the Bifrost engine 32 generates the output sequence one word at a time, based on the context vector and its own hidden state, in a step-by-step fashion. This is also known as “auto-regressive” generation, as the Bifrost engine 32 generates the next word based on the previously generated words in the sequence.
Signal conditioning is a process of data acquisition that manipulates a signal to prepare it for the next stage of processing. A signal conditioner is an instrument that performs this process. It converts one type of electrical or mechanical signal (input-signal) into another (output-signal).
Referring still to
In one embodiment of the present technology, as illustrated in
The increasing reliance of modern industry on artificial intelligence has resulted in a growing demand for specialized microprocessors that perform the tensor calculations (such as vector-matrix and matrix-matrix multiplications) important to many artificial intelligence techniques such as gradient descent techniques for the training of artificial neural networks. Some of these tensor processors perform over one trillion floating point operations (teraflops) per second, which not surprisingly, require large amounts of power.
In one embodiment of the present technology, as illustrated in
The speech-recognized signals generated by Bifrost engine 32 are output to a decoder 30 that efficiently runs a standard algorithm.
In one embodiment of the present technology, decoder 30 of
In one embodiment of then present technology, the decoder 30 of
In one embodiment of the present technology, decoder 30 of
In one embodiment of the present technology,
In one embodiment of the present technology,
The system of
Although a particular configuration of components is described herein, in other embodiments the system of
For example, while
The user device of
In one embodiment, the prediction model is specified as a TensorFlow model, the compiler is a TensorFlow compiler, and the processor is a tensor processor.
In another embodiment, the prediction model is specified as a PyTorch model, the compiler is a PyTorch compiler.
In other embodiments, other machine learning specification languages and compilers are used. For example, in some embodiments, the prediction model defines nodes representing operators (e.g., arithmetic operators, matrix transformation operators, Boolean operators, etc.), tensors representing operands (e.g., values that the operators modify, such as scalar values, vector values, and matrix values, which may be represented in integer or floating-point format), and weight values that are generated and stored in the model after training.
In some embodiments, where the processor is a tensor processor having a functional slice architecture, the compiler generates an explicit plan for how the processor will execute the program, by translating the program into a set of operations that are executed by the processor, specifying when each instruction will be executed, which functional slices will perform the work, and which stream registers will hold the operands. This type of scheduling is known as “deterministic scheduling.” This explicit plan for execution includes information for explicit prediction of excessive power usage by the processor when executing the program.
The assembler of
In some embodiments, the assembler of
The processor of
In one embodiment, the processor is a tensor processor having a functional slice architecture.
In one embodiment, such processor is Groq TSP tensor processor having a functional slice architecture.
In one embodiment, the processor of
In one embodiment of the present technology,
In one embodiment of the present technology,
In one embodiment of the present technology, the processor is an application specific integrated circuit (ASIC) and corresponds to the processor illustrated in
The functional units of processor (also referred to as “functional tiles”) are aggregated into a plurality of functional process units (hereafter referred to as “slices”), each corresponding to a particular function type in some embodiments. For example, different functional slices of the processor correspond to processing units for MEM (memory), VXM (vector execution module), MXM (matrix execution module), NIM (numerical interpretation module), and SXM (switching and permutation module). In other embodiments, each tile may include an aggregation of functional units such as a tile having both MEM and execution units by way of example.
As illustrated in
The instructions in a given instruction queue are executed only by functional units in the queue's associated slice and are not executed by another slice of the processor. In other embodiments, each functional unit has an associated ICU that controls the execution flow of the instructions.
The processor also includes communication lanes to carry data between the functional units of different slices. Each communication lane connects to each of the slices of processor.
In some embodiments, a communication lane that connects a row of functional units of adjacent slices is referred to as a “super-lane,” and comprises multiple data lanes, or “streams,” each configured to transport data values along a particular direction.
For example, in some embodiments, each functional unit of the processor is connected to corresponding functional units on adjacent slices by a super-lane made up of multiple lanes. In other embodiments, the processor includes communication devices, such as a router, to carry data between adjacent functional units.
By arranging the functional units of the processor into different functional slices, the on-chip instruction and control flow of the processor is decoupled from the data flow. Since many types of data are acted upon by the same set of instructions, what is important for visualization is visualizing the flow of instructions, not the flow of data.
In some embodiments,
In some embodiments, the functional units in the same slice execute instructions in a ‘staggered’ fashion where instructions are issued tile-by-tile within the slice over a period of N cycles. For example, the ICU for a given slice may, during a first clock cycle, issues an instruction to a first tile of the slice (e.g., the bottom tile of the slice closest to the ICU), which is passed to subsequent functional units of the slice over subsequent cycles. That is, each row of functional units (corresponding to functional units along a particular super-lane) of processor executes the same set of instructions, albeit offset in time, relative to the functional units of an adjacent row.
The functional slices of the processor are arranged such that operand data read from a memory slice is intercepted by different functional slices as the data moves across the chip, and results flow in the opposite direction where they are then written back to memory. For example, a first data flow from a first memory slice flows in a first direction (e.g., towards the right), where it is intercepted by a VXM slice that performs a vector operation on the received data. The data flow then continues to an MXM slice which performs a matrix operation on the received data. The processed data then flows in a second direction opposite from the first direction (e.g., towards the left), where it is again intercepted by VXM slice to perform an accumulate operation, and then written back to the memory slice.
In some embodiments, the functional slices of the processor are arranged such that data flow between memory and functional slices occur in both the first and second directions. For example, a second data flow originating from a second memory slice that travels in the second direction towards a second slice, where the data is intercepted and processed by VXM slice before traveling to the second MXM slice. The results of the matrix operation performed by the second MXM slice then flow in the first direction back towards the second memory slice.
In some embodiments, stream registers are located along a super-lane of the processor, in accordance with some embodiments. The stream registers are located between functional slices of the processor to facilitate the transport of data (e.g., operands and results) along each super-lane. For example, within the memory region of the processor, stream registers are located between sets of four MEM units. The stream registers are architecturally visible to the compiler and serve as the primary hardware structure through which the compiler has visibility into the program's execution. Each functional unit of the set contains stream circuitry configured to allow the functional unit to read or write to the stream registers in either direction of the super-lane.
In some embodiments, each stream register is implemented as a collection of registers, corresponding to each stream of the super-lane, and sized based upon the basic data type used by the processor (e.g., if the TSP's basic data type is an INT8, each register may be 8-bits wide).
In some embodiments, in order to support larger operands (e.g., FP16 or INT32), multiple registers are collectively treated as one operand, where the operand is transmitted over multiple streams of the super-lane.
All of these functional features—superlanes of functional units, slices of instruction flow, handling of different types of integers and floating-point numbers, occurring trillions of times a second, create complicated power flows and possible disruptive power fluctuations that could negatively impact the performance of the processor. However, given the deterministic nature of executions by the processor, any disruptive power fluctuations (such as voltage droop) can be determined before execution of the program, with information (such as processor instructions, and timing for such instructions) about such fluctuations being supplied by the compiler to the processor, for the processor to use during program execution to mitigate the fluctuations.
The present technology provides fast and accurate transcriptions because, unlike traditional transformer-based models that rely on slow and sequential recurrence network during recurrence inference, the Groq Bifrost engine (of
DeepSpeech2 is a set of speech recognition models based on Baidu DeepSpeech2 model. The preprocessing part takes a raw audio waveform signal and converts it into a log-spectrogram of size (N_timesteps, N_frequency_features). N_timesteps depends on an original audio file's duration, N_frequency_features can be assigned in the model's configuration file as “num_audio_features” parameter.
The Deep Neural Network (DNN) part produces a probability distribution P_t(c) over vocabulary characters c per each time step t. DeepSpeech2 is trained with CTC loss.
CTC is an algorithm used to train deep neural networks in speech recognition, handwriting recognition and other sequence problems. CTC is used if a priori is known how the input aligns with the output (how the characters in the transcript align to the audio).
The Groq Bifrost engine (of
Logits are the outputs of a neural network before the activation function is applied. They are the unnormalized probabilities of the item belonging to a certain class. Logits are often used in classification tasks, where the goal is to predict the class label of an input. Greedy decoding takes the maximum probable character at each instant and uses it as the output. This is a simple implementation that does not consider context.
Connectionist temporal classification (CTC) is a type of neural network output and associated scoring function. CTC is used for training recurrent neural networks (RNNs) to tackle sequence problems where the timing is variable.
Greedy CTC recording is a type of neural network output that performs greedy decoding on the logits given in input.
The greedy decoder can predict incorrectly spelled words like “affrayd” and “shoktd.” The transcript with the lexicon-constrained beam search decoder produces a more accurate result consisting of real words.
In one embodiment of the present technology, the decoder 30 of
Thus, the Groq Bifrost engine can avoid and/or mitigate any post processing transactions or any sort of recurrence infinite inference.
More specifically, as was explained above, the speech-to-text encoder 24 of
Unlike other platforms based on GPU or CPU processing units, there is no need to know the length of the speech signals beforehand with the disclosed embodiments. Also, when a model is processed by a streaming processor such as the TSP available from Groq, Inc, there is low latency for a smooth user experience.
The speech-to-text encoder 24 of
To implement the engine of the disclosed embodiments on the Groq TSP, there are several steps.
In an embodiment of the present technology,
In an embodiment of the present technology,
In an embodiment of the present technology
In an embodiment of the present technology
In an embodiment of the present technology
The following are the definitions of a set of global configuration parameters for a speech recognition model and sets up the environment for training the model.
The configuration parameters include:
The code also sets up the environment for training the model by importing necessary libraries, such as ‘torch’, ‘torchaudio’, ‘numpy’, and ‘ipython’. The code also sets environment variables for GroqFlow, including ‘GROQFLOW_SKIP_SDK_CHECK’, ‘GROQMODEL_BACKEND’, and ‘GROQFLOW_BAKE_SDK’, and sets the seed for consistency.
The function of this code is to define the configuration parameters and set up the environment for training a speech recognition model using GroqFlow.
In this flow graph, the first block represents the import of necessary libraries, including ‘torch’, ‘torchaudio’, ‘numpy’, and ‘ipython’.
Block 2 defines the configuration parameters for the model, including the target batch size, batch size for single GPU, and various other parameters related to the model's architecture and training.
In this section, a model for speech-to-text conversion is defined that consists of several components:
1. Feature Extractor: The first component of the model is a feature extractor, which processes the input audio data and extracts relevant features. The feature extractor used in this implementation is a pre-trained model from the “facebook/s2t-medium-librispeech-asr” repository. The specific model used is a convolutional neural network (CNN) that takes the input audio data and outputs a sequence of feature maps.
2. Tokenizer: The second component of the model is a tokenizer, which converts the input audio data into a sequence of phonemes (units of sound). The tokenizer used in this implementation is a PhonemeTokenizer, which maps the input audio data to a sequence of phonemes based on their acoustic properties.
3. Multi-Stream Model: The third component of the model is a multi-stream model, which processes the sequence of phonemes and outputs a sequence of hidden states. The multi-stream model is defined using the MultiStreamModel class, and it consists of several components:
4. Encoder Model: The fourth component of the model is an encoder model, which processes the input audio data and outputs a sequence of hidden states. The encoder model used in this implementation is a pre-trained Speech2TextForConditionalGeneration model from the “facebook/s2t-medium-librispeech-asr” repository. The encoder model is a transformer-based model that takes the output of the feature extractor and outputs a sequence of hidden states.
5. Language Model Head: The fifth component of the model is a language model head, which predicts the next word in the sequence given the previous words. The language model head used in this implementation is the lm_head attribute of the encoder model. The language model head is a linear layer that takes the output of the encoder model and outputs a probability distribution over the vocabulary of the model.
6. Overall Model: The overall model combines the feature extractor, multi-stream model, encoder model, and language model head. The overall model is defined using the ModifiedS2T class, and it takes the following parameters:
The model is then loaded from a checkpoint using the torch.load function, and the model is set to evaluation mode using the model.eval( ) function.
The flow graph shows the audio data entering the feature extractor, which extracts relevant features from the audio data. The output of the feature extractor is then passed through the tokenizer, which converts the audio data into a sequence of phonemes. The output of the tokenizer is then passed through the multi-stream model, which processes the sequence of phonemes and outputs a sequence of hidden states. The output of the multi-stream model is then passed through the encoder model, which processes the sequence of hidden states and outputs a sequence of hidden states. Finally, the output of the encoder model is passed through the language model head, which predicts the next word in the sequence given the previous words. The overall model combines the feature extractor, multi-stream model, encoder model, and language model head to generate the final predictions.
The third block sets environment variables for GroqFlow, including: ‘GROQFLOW_SKIP_SDK_CHECK’, ‘GROQMODEL_BACKEND’, and ‘GROQFLOW_BAKE_SDK’.
The Unformated Third Block of Code.
MAX_SEQ_LEN=2048 dummy_inputs_encoder={“input_features”: torch.rand(1, MAX_SEQ_LEN, 80, dtype=torch.float32), “attention_mask”: torch.ones(1, MAX_SEQ_LEN, dtype=torch.int32),} # prepare dummy inputs to compile bifrost decoder_model=torch.nn.Sequential(model.decoder, model.lm_head, model.linear) dummy_inputs_decoder={“input”: model.encoder(**dummy_inputs_encoder)[0].permute(0, 2, 1).detach( )} # groq the model compiler_flags=[“--ia-mode=all”] build_name=f“s2t_bifrost” groq_bifrost_model=groqit(decoder_model, dummy_inputs_decoder, build_name=build_name, num_chips=8, rebuild=“never”,)
This code block is preparing dummy inputs for the encoder and decoder models and then using the Groq compiler to build a Groq model from the given PyTorch model.
The below represents a flow diagram illustrating the functionality of the code:
The flow diagram shows the input data entering the feature extractor, which extracts relevant features from the input data. The output of the feature extractor is then passed through the encoder model, which processes the sequence of phonemes and outputs a sequence of hidden states. The output of the encoder model is then passed through the decoder model, which generates the final predictions.
The code then prepares dummy inputs for the encoder and decoder models by creating random inputs with the same shape and size as the expected inputs. The dummy inputs are created using the ‘torch.rand’ function and are passed through the encoder and decoder models to generate dummy outputs.
The Groq compiler is then used to build a Groq model from the given PyTorch model. The Groq compiler takes the PyTorch model, dummy inputs, and various flags as input and generates a Groq model that can be run on a GPU. The ‘build_name’ parameter is used to specify the name of the Groq model, and the ‘num_chips’ parameter is used to specify the number of chips to use for the Groq model. The ‘rebuild’ parameter is used to specify whether the Groq model should be rebuilt when the PyTorch model changes.
The resulting Groq model can be used for inference on a GPU, which can significantly speed up the inference process compared to using the PyTorch model on a CPU.
The fourth block sets the seed for consistency.
Text that is truncated along the right margin of
Text that is truncated along right margin is shown below:
This code block is responsible for preparing the input data for the speech-to-text model. It loads audio files from a dictionary ‘test_files’ and processes them using a feature extractor and a s2t encoder. The processed data is then stored in a dictionary ‘active_files’.
The code first loops through the keys of ‘test_files’ and loads the audio files using ‘torchaudio.load( )’. It then extracts the audio data and sample rate using ‘torchaudio.load( )’ and passes it through a feature extractor function ‘feature_extractor( )’. The feature extractor function takes the audio data and sample rate as input and returns a tuple containing the extracted features and the attention mask.
The code then checks if the length of the extracted features is less than or equal to a maximum sequence length ‘max_seq_len’. If it is, the key is added to a dictionary ‘active_files’. The code then pads the extracted features and attention mask with zeros to make their length equal to ‘max_seq_len’.
The padded features and attention mask are then passed through the s2t encoder ‘model.encoder( )’ to generate the encoder outputs. The encoder outputs are then converted to a tensor and stored in ‘active_files’ under the key “encoder_outputs_cpu”.
Below is a flow diagram illustrating the functionality of this code block:
The flow diagram shows the audio files being loaded from ‘test_files’ and passed through the feature extractor and s2t encoder to generate the encoder outputs, which are then stored in ‘active_files’.
The diagram illustrates the flow of data through the different components of the code block, with each component modifying or processing the data in some way before passing it on to the next component.
The flow diagram shows the audio files being loaded from ‘test_files’ and passes.
active file.
2. It decodes the text for each active file using the decoder outputs.
Below is a flow diagram illustrating the functionality of this code block:
The flow diagram shows the active files being passed through the Bifrost decoder model to generate the decoder outputs, which are then used to decode the text for each active file. The decoded text is then printed to the console.
Here's a technical description of the functionality of each line of the code block:
1. ‘for key in active_files.keys( ):’: This line loops through the keys of the ‘active_files’ dictionary.
2. ‘active_files[key][“bifrost_outputs_cpu”] decoder_model(active_files[key][“encoder_outputs_cpu”]):’ This line runs the Bifrost decoder model on the CPU to generate the decoder outputs for the current active file. The decoder outputs are stored in the ‘bifrost_outputs_cpu’ key of the ‘active_files’ dictionary.
3. ‘for key in active_files.keys( ):’: This line loops through the keys of the ‘active_files’ dictionary again.
4. ‘ipd.display(ipd.Audio(key))’: This line displays the audio file associated with the current key using the IPython display system.
5. ‘active_files[key][“text”]=decoder.decode(active_files[key][“bifrost_outputs_cpu”].squeeze(0).detach( ).numpy( ), beam_width=500)’: This line decodes the text for the current active file using the decoder outputs generated by the Bifrost decoder model. The decoded text is stored in the ‘text’ key of the ‘active_files’ dictionary.
6. ‘print(active_files[key][“text”])’: This line prints the decoded text for the current active file to the console.
The code block can be summarized as follows: it runs the Bifrost decoder model on the CPU to generate decoder outputs for each active file, and then decodes the text for each active file using the decoder outputs. The decoded text is then printed to the console.
Unformatted Sixth block of code
# import os os.environ[‘PYTHONPATH’]=‘/opt/groq/runtime/site-packages/’ # run bifrost on groq bifrost_outputs_groq=groq_bifrost_model.run_abunch([{“input”: active_files[key][“encoder_outputs_cpu”]} for key in active_files.keys( )])
This code block is responsible for running the Bifrost decoder model on the Groq accelerator to generate the decoder outputs for each active file.
The below is a technical description of the functionality of this code block:
1. ‘os.environ[‘PYTHONPATH’]=‘/opt/groq/runtime/site-packages/’]’: This line sets the Python path environment variable to include the Groq runtime site-packages directory. This allows the code to import the Groq Bifrost model.
2. ‘bifrost_outputs_groq=groq_bifrost_model.run_abunch([{“input”: active_files[key][“encoder_outputs_cpu”]} for key in active_files.keys( )])’: This line runs the Groq Bifrost model on the Groq accelerator to generate the decoder outputs for each active file. The input to the model is the encoder outputs for each active file, which are stored in the ‘encoder_outputs_cpu’ key of the ‘active_files’ dictionary. The output of the model is a list of decoder outputs, which are stored in the ‘bifrost_outputs_groq’ variable.
Below is a flow diagram illustrating the functionality of this code block:
The flow diagram shows the active files being passed through the Groq Bifrost model to generate the decoder outputs. The decoder outputs are then stored in the ‘bifrost_outputs_groq’ variable.
Below is a summary of the functionality of this code block: it runs the Bifrost decoder model on the Groq accelerator to generate the decoder outputs for each active file. The decoder outputs are then stored in the ‘bifrost_outputs_groq’ variable.
1. ‘os.environ[‘PYTHONPATH’]=‘/opt/groq/runtime/site-packages/’’: This line sets the ‘PYTHONPATH’ environment variable to the directory containing the Groq runtime site-packages.
2. ‘bifrost_outputs_groq=groq_bifrost_model.run_abunch([{“input”: active_files[key][“encoder_outputs_cpu”]} for key in active_files.keys( )])’: This line runs the Bifrost decoder model on the Groq accelerator to generate the decoder outputs for each active file. The input to the model is the encoder outputs for each active file, which are stored in the ‘encoder_outputs_cpu’ key of the ‘active_files’ dictionary. The output of the model is a list of decoder outputs, which are stored in the ‘bifrost_outputs_groq’ variable.
3. ‘for key, bifrost_output_groq in zip(active_files.keys( ), bifrost_outputs_groq):’: This line loops over the keys of the ‘active_files’ dictionary and the corresponding decoder outputs in the ‘bifrost_outputs_groq’ list.
4. ‘active_files[key][“bifrost_outputs_groq”]=bifrost_output_groq’: This line stores the decoder outputs for each active file in the ‘bifrost_outputs groq’ key of the ‘active_files’ dictionary.
5. ‘for key in active_files.keys( ):’: This line loops over the keys of the ‘active_files’ dictionary.
6. ‘active_files[key][“text”]=decoder.decode(active_files[key][“bifrost_outputs_groq”].squeeze(0).detach( ).numpy( ), beam_width=500)’: This line runs the decoder on the decoder outputs for each active file to generate the final decoded text. The decoder outputs are first squeezed to remove any unnecessary dimensions, then detached from the tensor to prevent memory leaks, and finally converted to a numpy array. The ‘decoder.decode( )’ function takes the decoder outputs and the beam width as input and returns the decoded text.
7. ‘print(active_files[key][“text”])’: This line prints the decoded text for each active file
The flow diagram shows the active files being passed through the Bifrost decoder model to generate the decoder outputs, which are then passed through the decoder to generate the final decoded text. The decoded text is then printed to the console.
This code block runs the Bifrost decoder model on the Groq accelerator to generate the decoder outputs for each active file, and then runs the decoder on the decoder outputs to generate the final decoded text. The decoded text is then printed to the console.
The final block defines the model and optimizer and trains the model.
A comparison of Bifrost-Speech2Text running on Groq hardware and GPUs is a natural point of interest. However, it is suspected that customers may be more interested in evaluating the performance of “Bifrost-Speech2Text running on Groq hardware” versus “Original Speech2Text running on GPUs.”
In an enterprise environment, streaming and stateful transcription of audio are crucial. To simulate this setting, 20 seconds of audio are concatenated in batches, resulting in a total audio length exceeding two days. The compute performance is evaluated across three dimensions:
1. Low latency for a seamless user experience, equivalent to the transcription delay; the time it takes for text to appear as you speak.
2. High throughput, as enterprise systems must support hundreds of millions of queries per second. A denial-of-service (DOS) attack in an enterprise setting is particularly detrimental.
Due to the incompatibility of the Speech2Text encoder with Groq hardware (output mismatch), Bifrost is run on Groq with the Speech2Text encoder running on a GPU. Additionally, the performance of the entire Bifrost-Speech2Text model running on Groq hardware is estimated by comparing the speedup of Bifrost against the GPU and applying the same factor to the encoder. To optimize time efficiency, the performance numbers of the models across the three device types have been preemptively logged. The code below is used to present these numbers.
The Detailed Description signifies in isolation the individual features, structures, functions, or characteristics described herein and any combination of two or more such features, structures, functions or characteristics, to the extent that such features, structures, functions or characteristics or combinations thereof are enabled by the Detailed Description as a whole in light of the knowledge and understanding of a skilled person, irrespective of whether such features, structures, functions or characteristics, or combinations thereof, solve any problems disclosed herein, and without limitation to the scope of the Claims of the patent. When an ECIN comprises a particular feature, structure, function, or characteristic, it is within the knowledge and understanding of a skilled person to use such feature, structure, function, or characteristic in connection with another ECIN whether or not explicitly described, for example, as a substitute for another feature, structure, function, or characteristic.
In view of the Detailed Description, a skilled person will understand that many variations of any ECIN can be enabled, such as function and structure of elements, described herein while being as useful as the ECIN. One or more elements of an ECIN can be substituted for one or more elements in another ECIN, as will be understood by a skilled person. Writings about any ECIN signify its use in commerce, thereby enabling other skilled people to similarly use this ECIN in commerce.
This Detailed Description is fitly written to provide knowledge and understanding. It is neither exhaustive nor limiting of the precise structures described but is to be accorded the widest scope consistent with the disclosed principles and features. Without limitation, any and all equivalents described, signified or Incorporated by Reference (or explicitly incorporated) in this patent application are specifically incorporated into the Detailed Description. In addition, any and all variations described, signified, or incorporated with respect to any one ECIN also can be included with any other ECIN. Any such variations include both currently known variations as well as future variations, for example any element used for enablement includes a future equivalent element that provides the same function, regardless of the structure of the future equivalent element.
This application claims the benefit of priority to U.S. Provisional Application No. 63/481,224, filed Jan. 24, 2023, and entitled “MODEL SUBSTITUTION FOR EFFICIENT DEVELOPMENT OF A SOLUTION,” the entirety of which is expressly incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63481224 | Jan 2023 | US |