The present application claims the benefit of priority from the commonly owned Greece Provisional Patent Application No. 20220100243, filed Mar. 21, 2022, the contents of which are expressly incorporated herein by reference in their entirety.
The present disclosure is generally related to encoding and/or decoding data.
Advances in technology have resulted in smaller and more powerful computing devices. For example, there currently exist a variety of portable personal computing devices, including wireless telephones such as mobile and smart phones, tablets and laptop computers that are small, lightweight, and easily carried by users. These devices can communicate voice packets, data packets, or both, over wired or wireless networks. Further, many such devices incorporate additional functionality such as a digital still camera, a digital video camera, a digital recorder, and an audio file player. Also, such devices can process executable instructions, including software applications, such as a web browser application, that can be used to access the Internet. As such, these devices can include significant computing capabilities.
One common use of such wireless devices is communications (e.g., voice, video, and/or data communications). In wireless communications, a device that has data to send generates a signal that represents the data as a set of bits. Often, the signal also includes other information, such as packet headers. Because wireless devices are often power constrained (e.g., battery powered) and because wireless communications resources (e.g., radiofrequency channels) can be crowded, it may be desirable to send particular data using as few bits as possible. However, many techniques for representing data using fewer bits are lossy. That is, encoding the data to be transmitted using fewer bits leads to a less faithful representation of the data. Thus, there may be tension between a goal of sending a higher fidelity representation of the data to be transmitted (e.g., using more bits) and sending data efficiently (e.g., using fewer bit).
According to a particular aspect, a device includes a memory and one or more processors coupled to the memory. The one or more processors are operably configured to generate a first input data state for data samples in a time series of data samples of a portion of an audio data stream. The one or more processors are also operably configured to provide the first input data state to a first bottleneck and a second input data state, different from the first input data state, to a second bottleneck. The first bottleneck is associated with a first bitrate and the second bottleneck is associated with a second bitrate. According to one implementation, the first bottleneck and the second bottleneck can correspond to a common bottleneck that is operable to dynamically change a bitrate from the first bitrate to the second bitrate. The one or more processors are further operably configured to generate a first encoded frame based on a first output data state from the first bottleneck and a second encoded frame based on a second output data state from the second bottleneck. The first encoded frame and the second encoded frame are bundled in a packet.
According to another particular aspect, a method includes generating a first input data state for data samples in a time series of data samples of a portion of an audio data stream. The method also includes providing the first input data state to a first bottleneck and a second input data state, different from the first input data state, to a second bottleneck. The first bottleneck is associated with a first bitrate and the second bottleneck is associated with a second bitrate. According to one implementation, the first bottleneck and the second bottleneck can correspond to a common bottleneck that is operable to dynamically change a bitrate from the first bitrate to the second bitrate. The method further includes generating a first encoded frame based on a first output data state from the first bottleneck and a second encoded frame based on a second output data state from the second bottleneck. The first encoded frame and the second encoded frame are bundled in a packet.
According to another particular aspect, an apparatus includes means for generating a first input data state for data samples in a time series of data samples of a portion of an audio data stream. The apparatus also includes means for providing the first input data state to a first bottleneck and a second input data state, different from the first input data state, to a second bottleneck. The first bottleneck is associated with a first bitrate and the second bottleneck is associated with a second bitrate. According to one implementation, the first bottleneck and the second bottleneck can correspond to a common bottleneck that is operable to dynamically change a bitrate from the first bitrate to the second bitrate. The apparatus further includes means for generating a first encoded frame based on a first output data state from the first bottleneck and a second encoded frame based on a second output data state from the second bottleneck. The first encoded frame and the second encoded frame are bundled in a packet.
According to another particular aspect, a non-transitory computer-readable medium stores instructions executable by one or more processors to generate a first input data state for data samples in a time series of data samples of a portion of an audio data stream. Execution of the instructions also causes the one or more processors to provide the first input data state to a first bottleneck and a second input data state, different from the first input data state, to a second bottleneck. The first bottleneck is associated with a first bitrate and the second bottleneck is associated with a second bitrate. According to one implementation, the first bottleneck and the second bottleneck can correspond to a common bottleneck that is operable to dynamically change a bitrate from the first bitrate to the second bitrate. Execution of the instructions further causes the one or more processors to generate a first encoded frame based on a first output data state from the first bottleneck and a second encoded frame based on a second output data state from the second bottleneck. The first encoded frame and the second encoded frame are bundled in a packet.
According to another particular aspect, a device includes a memory and one or more processors coupled to the memory. The one or more processors are operably configured to receive, at a decoder network, a packet that includes a first encoded frame bundled with a second encoded frame. The first encoded frame includes a first output data state generated from a first bottleneck of a feedback autoencoder, and the second encoded frame includes a second output data state generated from a second bottleneck of the feedback autoencoder. The first bottleneck is associated with a first bitrate, and the second bottleneck is associated with a second bitrate. According to one implementation, the first bottleneck and the second bottleneck can correspond to a common bottleneck that is operable to dynamically change a bitrate from the first bitrate to the second bitrate. The one or more processors are also operably configured to generate a reconstructed first data sample based on the first output data state. The reconstructed first data sample corresponds to a first data sample in a time series of data samples of a portion of an audio data stream. The one or more processors are further operably configured to generate a reconstructed second data sample based on the second output data state. The reconstructed second data sample corresponds to a second data sample in the time series of data samples.
According to another particular aspect, a method includes receiving, at a decoder network, a packet that includes a first encoded frame bundled with a second encoded frame. The first encoded frame includes a first output data state generated from a first bottleneck of a feedback autoencoder, and the second encoded frame includes a second output data state generated from a second bottleneck of the feedback autoencoder. The first bottleneck is associated with a first bitrate, and the second bottleneck is associated with a second bitrate. According to one implementation, the first bottleneck and the second bottleneck can correspond to a common bottleneck that is operable to dynamically change a bitrate from the first bitrate to the second bitrate. The method also includes generating a reconstructed first data sample based on the first output data state. The reconstructed first data sample corresponds to a first data sample in a time series of data samples of a portion of an audio data stream. The method further includes generating a reconstructed second data sample based on the second output data state. The reconstructed second data sample corresponds to a second data sample in the time series of data samples.
According to another particular aspect, an apparatus includes means for receiving a packet that includes a first encoded frame bundled with a second encoded frame. The first encoded frame includes a first output data state generated from a first bottleneck of a feedback autoencoder, and the second encoded frame includes a second output data state generated from a second bottleneck of the feedback autoencoder. The first bottleneck is associated with a first bitrate, and the second bottleneck is associated with a second bitrate. According to one implementation, the first bottleneck and the second bottleneck can correspond to a common bottleneck that is operable to dynamically change a bitrate from the first bitrate to the second bitrate. The apparatus also includes means for generating a reconstructed first data sample based on the first output data state. The reconstructed first data sample corresponds to a first data sample in a time series of data samples of a portion of an audio data stream. The apparatus further includes means for generating a reconstructed second data sample based on the second output data state. The reconstructed second data sample corresponds to a second data sample in the time series of data samples.
According to another particular aspect, a non-transitory computer-readable medium stores instructions executable by one or more processors to receive, at a decoder network, a packet that includes a first encoded frame bundled with a second encoded frame. The first encoded frame includes a first output data state generated from a first bottleneck of a feedback autoencoder, and the second encoded frame includes a second output data state generated from a second bottleneck of the feedback autoencoder. The first bottleneck is associated with a first bitrate, and the second bottleneck is associated with a second bitrate. According to one implementation, the first bottleneck and the second bottleneck can correspond to a common bottleneck that is operable to dynamically change a bitrate from the first bitrate to the second bitrate. Execution of the instructions also causes the one or more processors to generate a reconstructed first data sample based on the first output data state. The reconstructed first data sample corresponds to a first data sample in a time series of data samples of a portion of an audio data stream. Execution of the instructions further causes the one or more processors to generate a reconstructed second data sample based on the second output data state. The reconstructed second data sample corresponds to a second data sample in the time series of data samples.
An encoder can be used to encode data samples into frames that are transmitted to a receiving device. In some scenarios, usage of an encoder can be inefficient because frames are not bundled in a packet. For example, a separate packet (including a large header section) can be used to send each encoded frame to the receiving device. Using a separate packet for each frame results in multiple bits being allocated to headers as opposed to data portions of the frame, which is inefficient. In other scenarios where, frames encoded using a FRAE are bundled into a packet, the FRAE typically uses the same number of bits to encode each frame in a packet. Using the same number of bits to encode each frame in the packet can also result in allocation of a relatively large amount of bits, which is inefficient and can result in an increased transmission bandwidth.
Aspects disclosed herein enable a bundled multi-rate feedback autoencoder to encode frames that are bundled into a packet while selectively allocating a different number of bits to the frames. For example, the aspects disclosed herein exploit redundancy data in frames bundled into a packet by allocating a relatively large amount of bits to a reference frame and allocating a smaller amount of bits to other frames in the packet (e.g., “predicted frames”) that can be reconstructed using data from the reference frame. To allocate a different number of bits to different frames and encode the frames at a similar frequency to facilitate packetizing the frames, data associated with the reference frames can be encoded at a different bitrate than data associated with the predicted frames. For example, the bundled multi-rate feedback autoencoder can include a multi-rate bottleneck layer that encodes data at different bitrates. To illustrate, a first bottleneck layer can encode data for a first frame at a first bitrate and a second bottleneck layer can encode data for a second frame at a second bitrate that is greater than the first bitrate. By encoding data for the second frame at a higher bitrate, a greater number of bits can be allocated to the second frame while encoding each frame at a similar frequency (e.g., a packet frequency). Thus, the techniques described herein enable frames encoded using a bundled multi-rate feedback autoencoder to be bundled into a packet, which improves header efficiency. Additionally, the techniques described herein enable a different amount of bits to be allocated to frames within the packet, which improves transmission bandwidth.
Unless expressly limited by its context, the term “generating” is used herein to indicate any of its ordinary meanings, such as computing or otherwise producing. Unless expressly limited by its context, the term “calculating” is used herein to indicate any of its ordinary meanings, such as computing, evaluating, smoothing, and/or selecting from a plurality of values. Unless expressly limited by its context, the term “obtaining” is used to indicate any of its ordinary meanings, such as calculating, deriving, receiving (e.g., from another component, block, or device), and/or retrieving (e.g., from a memory register or an array of storage elements).
Unless expressly limited by its context, the term “producing” is used to indicate any of its ordinary meanings, such as calculating, generating, and/or providing. Unless expressly limited by its context, the term “providing” is used to indicate any of its ordinary meanings, such as calculating, generating, and/or producing. Unless expressly limited by its context, the term “coupled” is used to indicate a direct or indirect electrical or physical connection. If the connection is indirect, there may be other blocks or components between the structures being “coupled.” For example, a loudspeaker may be acoustically coupled to a nearby wall via an intervening medium (e.g., air) that enables propagation of waves (e.g., sound) from the loudspeaker to the wall (or vice-versa).
The term “configuration” may be used in reference to a method, apparatus, device, system, or any combination thereof, as indicated by its particular context. Where the term “comprising” is used in the present description and claims, it does not exclude other elements or operations. The term “based on” (as in “A is based on B”) is used to indicate any of its ordinary meanings, including the cases (i) “based on at least” (e.g., “A is based on at least B”) and, if appropriate in the particular context, (ii) “equal to” (e.g., “A is equal to B”). In the case (i) where A is based on B includes based on at least, this may include the configuration where A is coupled to B. Similarly, the term “in response to” is used to indicate any of its ordinary meanings, including “in response to at least.” The term “at least one” is used to indicate any of its ordinary meanings, including “one or more”. The term “at least two” is used to indicate any of its ordinary meanings, including “two or more”.
The terms “apparatus” and “device” are used generically and interchangeably unless otherwise indicated by the particular context. Unless indicated otherwise, any disclosure of an operation of an apparatus having a particular feature is also expressly intended to disclose a method having an analogous feature (and vice versa), and any disclosure of an operation of an apparatus according to a particular configuration is also expressly intended to disclose a method according to an analogous configuration (and vice versa). The terms “method,” “process,” “procedure,” and “technique” are used generically and interchangeably unless otherwise indicated by the particular context. The terms “element” and “module” may be used to indicate a portion of a greater configuration. The term “packet” may correspond to a unit of data that includes a header portion and a payload portion. Any incorporation by reference of a portion of a document shall also be understood to incorporate definitions of terms or variables that are referenced within the portion, where such definitions appear elsewhere in the document, as well as any figures referenced in the incorporated portion.
As used herein, the term “communication device” refers to an electronic device that may be used for voice and/or data communication over a wireless communication network. Examples of communication devices include speaker bars, smart speakers, cellular phones, personal digital assistants (PDAs), handheld devices, headsets, wireless modems, laptop computers, personal computers, etc.
Particular aspects are described herein with reference to the drawings. In the description, common features are designated by common reference numbers throughout the drawings. In some drawings, multiple instances of a particular type of feature are used. Although these features are physically and/or logically distinct, the same reference number is used for each, and the different instances are distinguished by addition of a letter to the reference number. When the features as a group or a type are referred to herein (e.g., when no particular one of the features is being referenced), the reference number is used without a distinguishing letter. However, when one particular feature of multiple features of the same type is referred to herein, the reference number is used with the distinguishing letter. For example, referring to
According to one implementation, the system 100 corresponds to a bundled multi-rate feedback autoencoder architecture. The system 100 includes an encoder portion 180A and a decoder portion 190A that provides feedback to the encoder portion 180A. The encoder portion 180A of the system 100 includes one or more frontend neural network preprocessing layers 102, a bidirectional gated recurrent unit (GRU) layer 105, and a bottleneck layer 107.
The bidirectional GRU layer 105 includes a bidirectional GRU network that is implemented over time instances 106A-106E. For example, the bidirectional GRU layer 105 includes a bidirectional GRU network that is implemented over a time instance 106A, a time instance 106B, a time instance 106C, a time instance 106D, and a time instance 106E. Although the bidirectional GRU network of the bidirectional GRU layer 105 is illustrated as being implemented at five time instances 106A-106E, in other implementations, the bidirectional GRU network of the bidirectional GRU layer 105 can be implemented at fewer time instances 106 or at more time instances 106. As a non-limiting example, the bidirectional GRU network of the bidirectional GRU layer 105 can be implemented at four (4) time instances 106. As another non-limiting example, the bidirectional GRU network of the bidirectional GRU layer 105 can be implemented at six (6) time instances 106.
According to some implementation, the frontend neural network preprocessing layer(s) 102 can include frontend GRUs, as opposed to fully connected layers, and the number of outputs of the frontend neural network preprocessing layer(s) 102 can differ from the number of data samples 120 provided to the frontend neural network preprocessing layer(s) 102. For example, the frontend neural network preprocessing layer(s) 102 can summarize the inputs (e.g., the data samples 120) or up-sample the inputs based on the number of times the frontend GRUs (within the frontend neural network preprocessing layer(s) 102) are tapped. In these scenarios, there may not be a one-to-one correspondence between the data samples 120A-120E and the time instances 106A-106E.
The bottleneck layer 107 includes a plurality of bottlenecks 108A-108E. For example, the bottleneck layer 107 includes a bottleneck 108A, a bottleneck 108B, a bottleneck 108C, a bottleneck 108D, and a bottleneck 108E. Although five bottlenecks 108 are illustrated, in other implementations, the bottleneck layer 107 can include fewer (or additional) bottlenecks 108. As a non-limiting example, the bottleneck layer 107 can include four (4) bottlenecks 108. As another non-limiting example, the bottleneck layer 107 can include six (6) bottlenecks 108. The architecture for the bottlenecks 108 is described in greater detail with respect to
The decoder portion 190A of the system 100 includes a bidirectional GRU layer 109 and one or more backend neural network postprocessing layers 112. The bidirectional GRU layer 109 includes a bidirectional GRU network that is implemented over time instances 110A-110E. For example, the bidirectional GRU layer 109 includes a bidirectional GRU network that is implemented over a time instance 110A, a time instance 110B, a time instance 110C, a time instance 110D, and a time instance 110E. Although the bidirectional GRU network of the bidirectional GRU layer 109 is illustrated as being implemented at five time instances 110A-110E, in other implementations, the bidirectional GRU network of the bidirectional GRU layer 109 can be implemented at fewer time instances 110 or at more time instances 110. As a non-limiting example, the bidirectional GRU network of the bidirectional GRU layer 109 can be implemented at four (4) time instances 110. As another non-limiting example, the bidirectional GRU network of the bidirectional GRU layer 109 can be implemented at six (6) time instances 110. According to some implementations, the backend neural network postprocessing layer(s) 112 can include backend GRUs, as opposed to fully connected layers, and the number of outputs of the backend neural network postprocessing layer(s) 112 can differ from the inputs provided to the backend neural network postprocessing layer(s) 112. In these scenarios, there may not be a one-to-one correspondence between the reconstructed data samples 126A-126E and the time instances 110A-110E.
A data stream that includes data arranged in a time series can be provided to the encoder portion 180A of the system 100. For example, the data stream can include a time series of data samples 120, where each data sample 120 represents a time-windowed portion of data. Although described as “data samples,” in other implementations, each data sample 120 can correspond to a “frame” of the data stream. As illustrated in
To reduce the amount of bits that are used to encode the data samples 120, the system 100 can select a data sample (e.g., a reference frame data sample) that is to be encoded into a reference frame (e.g., the frame 420A) and can select data samples (e.g., predicted frame data samples) that are to be encoded into predicted frames (e.g., the frames 420B-420E). For ease of illustration and description, reference frames and data samples associated with reference frames are shaded in gray. The above-described temporal redundancies are exploited during encoding and decoding of the data samples 120 by using the reference frame to predict the other frames.
To illustrate, the system 100 can designate M bits to encode the reference frame data sample and can designate N bits to encode each of the predicted frame data samples, where M is significantly greater than N (e.g., M>>N). For illustrative purposes, the data sample 120A is designated as the reference frame data sample, indicated by the gray shading, and the data samples 120B-120E are designated as the predicted frame data samples. Thus, in the example of
However, to account for the increased number of bits used to encode the reference frame data sample 120A, the reference frame data sample 120A has to be encoded at an increased bitrate (compared to the bitrate of the predicted frame data samples 120B-120E) if each of the data samples 120A-120E are to be encoded at the same frequency (e.g., a packet frequency). Thus, as described below, the bottleneck 108A associated with the reference frame data sample 120A can operate at a higher bitrate than the bottlenecks 108B-108E associated with the predicted frame data samples 120B-120E such that each data sample 120 is encoded into a frame at the same frequency (e.g., the packet frequency) as to facilitate bundling the corresponding frames 420A-420E into a packet 430, as illustrated in
The data samples 120A-120E are provided to the one or more frontend neural network preprocessing layers 102. In some implementations, the one or more frontend neural network preprocessing layers 102 include one or more fully connected layers. As described herein, a “fully connected layer” is a feed-forward neural network that includes multiple input nodes and generates one or more outputs based on weighting and mapping functions. According to some implementations, a fully connected layer can include multiple node levels (e.g., input level nodes, intermediate level nodes, and output level nodes) that have unique weighting and mapping patterns. For ease of explanation, the fully connected layers are described as receiving one or more inputs and generating one or more outputs based on neural network operations. However, it should be understood that the architecture of each fully connected layer described herein can be unique and can have unique weighting and mapping patterns as to generate simple or complex neural networks. The one or more frontend neural network preprocessing layers 102 can receive the data samples 120A-120E and generate corresponding neural network encoded data that is provided to the bidirectional GRU layer 105.
The bidirectional GRU layer 105 is configured to generate an input data state 122A-122E at each time instance 106A-106E, respectively. The input data states 122A-122E are provided to corresponding bottlenecks 108A-108E of the bottleneck layer 107. As a non-limiting example, the bidirectional GRU layer 105 is configured to generate an input data state 122A associated with the data sample 120A at the time instance 106A, the bidirectional GRU layer 105 is configured to generate an input data state 122B associated with the data sample 120B at the time instance 106B, the bidirectional GRU layer 105 is configured to generate an input data state 122C associated with the data sample 120C at the time instance 106C, the bidirectional GRU layer 105 is configured to generate an input data state 122D associated with the data sample 120D at the time instance 106D, and the bidirectional GRU layer 105 is configured to generate an input data state 122E associated with the data sample 120E at the time instance 106E.
The bidirectional GRU layer 105 can also generate the respective input data states 122A-122E based on data associated with previous time steps and future time steps. For example, each bidirectional GRU layer 105 can access input data states 122 from neighboring time instances 106 to generate the respective input data state 122A-122E. Thus, the bidirectional GRU layer 105 can generate the input data states in a manner that accounts for data states associated with the other data samples 120.
Additionally, the input data states 122A-122E can be generated based on feedback 150 (e.g., an output data state 124) from previously decoded data samples. To illustrate, the decoder portion 190A of the system 100 can provide feedback 150 to the bidirectional GRU layer 105. The feedback 150 can include data states (e.g., output data states) of previously decoded packets. As a result, the bidirectional GRU layer 105 can encode data associated with the data samples 120A-120E (e.g., the outputs of the one or more frontend neural network preprocessing layers 102) in a manner that accounts for previously encoded/decoded data samples. Although not illustrated in
In
The bottlenecks 108B-108E are associated with a first bitrate and the bottleneck 108A is associated with a second bitrate that is greater than the first bitrate. That is, because more bits are used to encode the data sample 120A into a reference frame (e.g., the frame 420A) than to encode the data samples 120B-120E into predicted data frames (e.g., the frames 420B-420E), the bitrate of the bottleneck 108A associated with the data sample 120A is higher than the bitrate of the bottlenecks 108B-108E associated with the other data samples 120B-120E. Although illustrated as four separate bottlenecks, the bottlenecks 108B-108E can be a single bottleneck that encode the input data states 122B-122E at different times instances. According to one implementation, to reduce the bitrate of the bottlenecks 108B-108E compared to the bitrate of the bottleneck 108A, a smaller number of bit (e.g., units) are allocated to latent codes generated at the bottlenecks 108B-108E compared to the number of bits (e.g., units) allocated to latent codes generated at the bottleneck 108A, as further described with respect to
The bottleneck 108A is configured to generate an output data state 124A based on the input data state 122A. According to one implementation and as further described with respect to
The decoder portion 190A of the system 100 is configured to reconstruct the data samples 120A-120E based on the output data states 124A-124E. To illustrate, the output data states 124 are provided to the bidirectional GRU layer 109. The bidirectional GRU layer 109 can use the feedback 150 (e.g., an output data state from a previous packet) to initialize the bidirectional GRU layer 109. Based on the feedback 150, the bidirectional GRU layer 109 can perform decoding operations on the output data states 124 to generate outputs that are provided to the one or more backend neural network postprocessing layers 112. Operation of the bidirectional GRU layer 109 is described in greater detail with respect to
It should be appreciated that the system 100 of
The system 100 further enables customized bit allocation for the encoding of different data samples 120. For example, a greater number of bits is allocated to a reference frame data sample (e.g., the data sample 120A) than to the predicted frame data samples (e.g., the data samples 120B-120E). Thus, if the data samples 120A-120E are encoded into five respective frames (e.g., the frames 420A-420E of
According to some implementations, system complexity is reduced by sharing the frontend neural network preprocessing layers 102 and the backend neural network postprocessing layers 112 across time steps. For example, the frontend neural network preprocessing layers 102 and the backend neural network postprocessing layers 112 can perform pre-processing operations and post-processing operations in parallel. As a result, the system 100 can experience lower memory usage and reduced complexity. Additionally, by sharing the frontend neural network preprocessing layers 102 and the backend neural network postprocessing layers 112 across time steps, the system 100 can adapt to network conditions on the fly by changing the bitrate with reduced weight loading. For example, the bitrate of the bottlenecks 108 can be changed in response to a change in network conditions, but the frontend neural network preprocessing layers 102, the bidirectional GRU layer 105, and the backend neural network postprocessing layers 112 can remain unchanged.
The system 200 of
The encoder-side attention mechanism 205 can receive the outputs of the frontend neural network preprocessing layers 102 and receive the feedback 150 from decoded frames of a previous packet. According to one implementation, the encoder-side attention mechanism 205 includes a transformer. The encoder-side attention mechanism 205 can have direct access to each token (e.g., each input data state 122A-122E) instead of having access to neighboring tokens. For example, in
The decoder-side attention mechanism 209 can receive the output data states 124A-124E from the bottlenecks 108A-108E and can receive the feedback 150 from decoded frames of a previous packet. Based on the feedback 150, the decoder-side attention mechanism 209 can perform decoding operations on the output data state 124A-124E to generate outputs that are provided to the backend neural network postprocessing layers 112 for processing. According to one implementation, the decoder-side attention mechanism 209 includes a transformer.
The bottleneck 108A includes a fully connected layer 302A, a quantizer 304A, one or more codebooks 306A, and a fully connected layer 308A. The input data state 122A is provided to the fully connected layer 302A. The fully connected layer 302A is configured to generate a pre-quantization latent 350A based on the input data state 122A. The pre-quantization latent 350A can correspond to an encoding indicative of an array of floating point values. The pre-quantization latent 350A is provided to the quantizer 304A. The quantizer 304A is configured to map each floating point value of the pre-quantization latent 350A to a representative value of the one or more codebooks 306A to generate a post-quantization latent 352A. According to one implementation, the post-quantization latent 352A can correspond to the output data state 124A of the bottleneck 108A. According to another implementation, the post-quantization latent 352A is provided to the fully connected layer 308A, and the fully connected layer 308A can generate the output data state 124A based on the post-quantization latent 352A.
The bottleneck 108B includes a fully connected layer 302B, a quantizer 304B, one or more codebooks 306B, and a fully connected layer 308B. The input data state 122B is provided to the fully connected layer 302B. The fully connected layer 302B is configured to generate a pre-quantization latent 350B based on the input data state 122B. The pre-quantization latent 350B can correspond to an encoding indicative of an array of floating point values. The pre-quantization latent 350B is provided to the quantizer 304B. The quantizer 304B is configured to map each floating point value of the pre-quantization latent 350B to a representative value of the one or more codebooks 306B to generate a post-quantization latent 352B. According to one implementation, the post-quantization latent 352B can correspond to the output data state 124B of the bottleneck 108B. According to another implementation, the post-quantization latent 352B is provided to the fully connected layer 308B, and the fully connected layer 308B can generate the output data state 124B based on the post-quantization latent 352B.
As described above with respect to
The frame generator 402 is configured to receive the output data state 124A-124E from each bottleneck 108A-108E and generate corresponding frames 420A-420E. To illustrate, the frame generator 402 can generate the frame 420A based on the output data state 124A. For example, the output data state 124A can correspond to an encoded version of the data sample 120A that is included in the frame 420A. The frame generator 402 can also generate the frame 420B based on the output data state 124B. For example, the output data state 124B can correspond to an encoded version of the data sample 120B that is included in the frame 420B. The frame generator 402 can also generate the frame 420C based on the output data state 124C. For example, the output data state 124C can correspond to an encoded version of the data sample 120C that is included in the frame 420C. The frame generator 402 can also generate the frame 420D based on the output data state 124D. For example, the output data state 124D can correspond to an encoded version of the data sample 120D that is included in the frame 420D. The frame generator 402 can also generate the frame 420e based on the output data state 124E. For example, the output data state 124E can correspond to an encoded version of the data sample 120E that is included in the frame 420E.
As described above, the frame 420A can correspond to a reference frame that includes more bits than the frames 420B-420E. Because of the additional bits associated with the reference frame 420A, the bottleneck 108A associated with generation of the corresponding output data state 124A operates at a higher bitrate than the bottlenecks 108B-108E associated with generation of the other output data states 124B-124E.
The packet generator 404 is configured to receive each frame 420A-420E from the frame generator 402. The packet generator 404 can further be configured to bundle the frames 420A-420E into the packet 430 to be transmitted to a receiving device. For example, the packet generator 404 can operate as a frame bundler that bundles (e.g., combines) the frames 420A-420E into a single packet 430. Because the frames 420A-420E are jointly encoded and temporal redundancies are exploited during the encoding process, the frames 420A-420E are smaller in size than frames generated without exploiting temporal redundancies and the number of bits that make up the packet 430 is reduced.
The system 500 includes the bidirectional GRU layer 109 and the one or more backend neural network postprocessing layers 112. At each time instance 110, the bidirectional GRU layer 109 is configured to generate left and right hidden data states to facilitate decoding of the data samples 120. Thus, in the example of
The bidirectional GRU layer 109 can be initialized based on a data state from a previous frame or packet. For example, at the time instance 110, the bidirectional GRU layer 109 can receive a left data state 530 from a previous frame and a right data state 532 from the previous frame. According to one implementation, the data states 530, 532 from the previous frame correspond to the feedback 150 provided to the GRU layer 109. The bidirectional GRU layer 109 can use the data states 530, 532 from the previous frame used in a manner that accounts for previously encoded/decoded data samples.
Based on left data state 530 from the previous frame and the output data state 124E, the bidirectional GRU layer 109 is configured to generate a left data state 520E. According to some implementations, the bidirectional GRU layer 109 can generate the left data state 520E based on a left data state 520D. Based on the left data state 520E, the output data state 124D, the left data state 520C, or a combination thereof, the bidirectional GRU layer 109 is configured to generate the left data state 520D. Based on the left data state 520D, the output data state 124C, the left data state 520B, or a combination thereof, the bidirectional GRU layer 109 is configured to generate the left data state 520C. Based on the left data state 520C, the output data state 124B, the left data state 520A, or a combination thereof, the bidirectional GRU layer 109 is configured to generate the left data state 520B. Based on the left data state 520B, the output data state 124A, or both, the bidirectional GRU layer 109 is configured to generate the left data state 520A.
Based on right data state 532 from the previous frame and the output data state 124E, the bidirectional GRU layer 109 is configured to generate a right data state 522A. According to some implementations, the bidirectional GRU layer 109 can generate the right data state 522A based on a right data state 522B. Based on the right data state 522A, the output data state 124B, the right data state 522C, or a combination thereof, the bidirectional GRU layer 109 is configured to generate the right data state 522B. Based on the right data state 522B, the output data state 124C, the right data state 522D, or a combination thereof, the bidirectional GRU layer 109 is configured to generate the right data state 522C. Based on the right data state 522C, the output data state 124D, the right data state 522E, or a combination thereof, the bidirectional GRU layer 109 is configured to generate the right data state 522D. Based on the right data state 522D, the output data state 124E, or both, the bidirectional GRU layer 109 is configured to generate the right data state 522E.
The left data state 520A and the right data state 522A can be provided to the one or more backend neural network postprocessing layers 112, and the one or more backend neural network postprocessing layers 112 are configured to generate the reconstructed data sample 126A based on the data states 520A, 522A. In a similar manner, the left data state 520B and the right data state 522B can be provided to the one or more backend neural network postprocessing layers 112, and the one or more backend neural network postprocessing layers 112 are configured to generate the reconstructed data sample 126B based on the data states 520B, 522B. Similarly, the left data state 520C and the right data state 522C can be provided to the one or more backend neural network postprocessing layers 112, and the one or more backend neural network postprocessing layers 112 are configured to generate the reconstructed data sample 126C based on the data states 520C, 522C.
In a similar manner, the left data state 520D and the right data state 522D can be provided to the one or more backend neural network postprocessing layers 112, and the one or more backend neural network postprocessing layers 112 are configured to generate the reconstructed data sample 126D based on the data states 520D, 522D. Similarly, the left data state 520E and the right data state 522E can be provided to the one or more backend neural network postprocessing layers 112, and the one or more backend neural network postprocessing layers 112 are configured to generate the reconstructed data sample 126E based on the data states 520E, 522E.
It should be appreciated that the system 500 of
In the example of
The data stream 604 in
The feature extractor 606 is configured to generate the data samples 120 based on the data stream 604. The data samples 120 include data representing a portion (e.g., a single data frame, multiple data frames, or a segment or subset of a data frame) of the data stream 604. The feature extraction technique(s) used by the feature extractor 606 may include, for example, data aggregation, interpolation, compression, windowing, domain transformation, sampling, smoothing, statistical analysis, etc. To illustrate, when the data stream 604 includes voice data or other audio data, the feature extractor 606 may be configured to determine time-domain or frequency-domain spectral information descriptive of a time-windowed portion of the data stream 604. In this example, the data samples 120 may include the spectral information. As one non-limiting example, the data samples 120 may include data describing a cepstrum of voice data of the data stream 604, data describing pitch associated with the voice data, other data indicating characteristics of the voice data, or a combination thereof. As another illustrative example, when the data stream 604 includes video data, game data, or both, the feature extractor 606 may be configured to determine pixel information associated with an image frame of the data stream 604. In the same or other examples, the data samples 120 may include other information, such as metadata associated with the data stream 604, compression data (e.g., keyframe identifiers), or other information used by the subsystem 610 to encode the data samples 120.
The subsystem 610 includes an encoder portion 180 of a bundled multi-rate feedback autoencoder. The encoder portion 180 can correspond to the encoder portion 180A of system 100, the encoder portion 180B of the system 200, or both. In some implementations, the encoder portion 180 can include the bottleneck layer 107. In other implementations, the bottleneck layer 107 can be coupled to the encoder portion 180. In a similar manner as described with respect to
The frame generator 402 is configured to generate the frame 420A (e.g., the reference frame) and the frame 420B (e.g., the predicted frame). It should be understood that the frame generator 402 can generate additional frames (e.g., the frames 420C-420E), as described with respect to
The modem 628 is configured to modulate a baseband, according to a particular communication protocol, to generate signals representing the packet 430 and a previous packet 634. The transmitter 630 is configured to send the signals representing the packets 430, 634 via the transmission medium 632. The transmission medium 632 may include a wireline medium, an optical medium, or a wireless medium. To illustrate, the transmitter 630 may include or correspond to a wireless transmitter configured to send the signals via free-space propagation of electromagnetic waves.
According to one implementation, bitrates of the bottlenecks 108 in the bottleneck layer 107 can be dynamically changed based on network conditions associated with the transmission medium 632. As a non-limiting example, if the network is congested such that packets are more frequently lost or delayed, the bitrates can be increased to allocate additional bits to the frames. As another example, if the network has a relatively large bandwidth such that packets are rarely lost or delayed, the bitrates can be decreased.
In the example of
In
The receiver 654 is configured to receive the signals representative of packets 430, 634 and to provide the signals (after initial signal processing, such as amplification, filtering, etc.) to the modem 656. As noted above, the receiving device 652 may not receive all of the packets 430, 634 sent by the transmitting device 602. Additionally, or in the alternative, the packets 430, 634 may be received in a different order than they are transmitted by the transmitting device 602.
The modem 656 is configured to demodulate the signals to generate bits representing the received packets 430, 634 and to provide the bits representing the received data packets to the depacketizer 658. The depacketizer 658 is configured to extract one or more data frames 420 from the payload of each received packets 430, 634 and to store the frames 420 at the buffer(s) 660. For example, in
In the example illustrated in
To decode a particular data sample, the decoder controller 665 extracts the output data states 124 from the frames 420 and provides the output data states 124 to a decoder 672 of the decoder networks 670. The decoder 672 can include the components of the system 500 and can operate in a substantially similar manner. For example, the decoder 672 can generate the reconstructed data samples 126 based on the output states 124 in a similar manner as described with respect to
The reconstructed data samples 126 may be stored at the buffer(s) 660 (e.g., at one or more playout buffers 674). At a playback time, the renderer 678 retrieves the reconstructed data samples 126 from the buffer(s) 660 and processes the reconstructed data samples 126 to generate output signals, such as audio signals, video signals, game update signals, etc. The renderer 678 provides the signals to a user interface device 680 to generate a user perceivable output based on the reconstructed data samples 126. For example, the user perceivable output may include one or more of a sound, an image, or a vibration. In some implementations, the renderer 678 includes or corresponds to a game engine that generates the user perceivable output in response to modifying a game state based on the reconstructed data samples 126.
In the example of
The method 700 also includes providing the first input data state to a first bottleneck and a second input data state, different from the first input data state, to a second bottleneck, at block 704. The first bottleneck is associated with a first bitrate and the second bottleneck is associated with a second bitrate. According to some implementations, the second bitrate is distinct from the first bitrate. For example, referring to
According to one implementation, the method 700 includes allocating a smaller number of units to latent codes generated at the first bottleneck than latent codes generated at the second bottleneck to reduce the bitrate of the first bottleneck. For example, referring to
According to one implementation of the method 700, a first codebook associated with the first bottleneck has a smaller size than a second codebook associated with the second bottleneck. For example, referring to
According to one implementation, the method 700 includes dynamically changing the first bitrate and the second bitrate based on network conditions. For example, if the network is congested such that packets are more frequently lost or delayed, the second bitrate can be increased to allocate additional bits to the reference frame 402A. As another example, if the network has a relatively large bandwidth such that packets are rarely lost or delayed, the second bitrate can be decreased.
The method 700 also includes generating a first encoded frame based on a first output data state from the first bottleneck and a second encoded frame based on a second output data state from the second bottleneck, at block 706. The first encoded frame and the second encoded frame are bundled in a packet. For example, referring to
The method 700 of
The method 700 of
In the example of
The method 800 also includes generating a reconstructed first data sample based on the first output data state, at block 804. The reconstructed data sample corresponds to a first data sample in a time series of data samples of a portion of an audio data stream. For example, referring to
The method 800 also includes generating a reconstructed second data sample based on the second output data state, at block 806. The reconstructed second data sample corresponds to a second data sample in the time series of data samples. For example, referring to
The method 800 of
In the illustrated implementation 900, the device 902 includes a memory 920 (e.g., one or more memory devices) that includes instructions 922 and one or more codebooks 306. The device 902 also includes one or more processors 910 coupled to the memory 920 and configured to execute the instructions 922 from the memory 920. In this implementation 900, the feature extractor 606, the subsystem 610, the encoder portion 180 of the bundled multi-rate feedback autoencoder, the frame generator 402, and the packet generator 404 may correspond to or be implemented via the instructions 922. For example, when the instructions 922 are executed by the processor(s) 910, the processor(s) 910 may generate an input data state 122A-122E for each data sample 120A-122E in a time series of data samples 120 of a portion of an audio data stream. The processor(s) 910 may also provide at least one input data state 122B to a first bottleneck 108B and at least one other input data state 122A to a second bottleneck 108A. The first bottleneck 108B can be associated with a first bitrate and the second bottleneck 108A can be associated with a second bitrate that is distinct from the first bitrate. The processor(s) 910 may also generate a first encoded frame 420B based on a first output data state 124B from the first bottleneck 108B and a second encoded frame 420A based on a second output data state 124A from the second bottleneck 108A. The first encoded frame 420B and the second encoded frame 420A are bundled in a packet 430.
In the illustrated implementation 1000, the device 1002 includes a memory 1020 (e.g., one or more memory devices) that includes instructions 1022 and one or more buffers 660. The device 1002 also includes one or more processors 1010 coupled to the memory 1020 and configured to execute the instructions 1022 from the memory 1020. In this implementation 1000, the depacketizer 658, the decoder controller 665, the decoder network(s) 670, the decoder(s) 672, and/or the renderer 678 may correspond to or be implemented via the instructions 1022. For example, when the instructions 1022 are executed by the processor(s) 1010, the processor(s) 1010 may receive a packet 430 that includes a first encoded frame 420B bundled with a second encoded frame 420A. The first encoded frame 420B can include a first output data state 124B generated from a first bottleneck 108B of a bundled multi-rate feedback autoencoder, and the second encoded frame 420A can include a second output data state 124A generated from a second bottleneck 108A of a bundled multi-rate feedback autoencoder. The first bottleneck can be associated with a first bitrate and the second bottleneck can be associated with a second bitrate that is distinct from the first bitrate. The processor(s) 1010 may further generate a reconstructed first data sample 126B based on the first output data state 124B. The reconstructed first data sample 126B can correspond to a first data sample 120B in a time series of data samples 120 of a portion of an audio data stream 604. The processor(s) 1010 may further generate a reconstructed second data sample 126A based on the second output data state 124A. The reconstructed second data sample 126A can correspond to a second data sample 120A in the time series of data samples 120.
Referring to
In a particular implementation, the device 1100 includes a processor 1106 (e.g., a CPU). The device 1100 may include one or more additional processors 1110 (e.g., one or more DSPs, one or more GPUs, or a combination thereof). The processor(s) 1110 may include a speech and music coder-decoder (CODEC) 1108. The speech and music codec 1108 may include a voice coder (“vocoder”) encoder 1136, a vocoder decoder 1138, or both. In a particular aspect, the vocoder encoder 1136 includes the encoder portion 180 of the bundled multi-rate feedback autoencoder. In a particular aspect, the vocoder decoder 1138 includes the decoder portion of the bundled multi-rate feedback autoencoder.
The device 1100 also includes a memory 1186 and a CODEC 1134. The memory 1186 may include instructions 1156 that are executable by the one or more additional processors 1110 (or the processor 1106) to implement the functionality described with reference to the transmitting device 602 of
The device 1100 may include a display 1128 coupled to a display controller 1126. A speaker 1196 and a microphone 1194 may be coupled to the CODEC 1134. The CODEC 1134 may include a digital-to-analog converter (DAC) 1102 and an analog-to-digital converter (ADC) 1104. In a particular implementation, the CODEC 1134 may receive an analog signal from the microphone 1194, convert the analog signal to a digital signal using the analog-to-digital converter 1104, and provide the digital signal to the speech and music codec 1108 (e.g., as the data stream 604 of
In a particular implementation, the device 1100 may be included in a system-in-package or system-on-chip device 1122 that corresponds to the transmitting device 602 of
In a particular implementation, the memory 1186, the processor 1106, the processors 1110, the display controller 1126, the CODEC 1134, and the modem 1140 are included in the system-in-package or system-on-chip device 1122. In a particular implementation, an input device 1130 and a power supply 1144 are coupled to the system-in-package or system-on-chip device 1122. Moreover, in a particular implementation, as illustrated in
The device 1100 may include a smart speaker (e.g., the processor 1106 may execute the instructions 1156 to run a voice-controlled digital assistant application), a speaker bar, a mobile communication device, a smart phone, a cellular phone, a laptop computer, a computer, a tablet, a personal digital assistant, a display device, a television, a gaming console, a music player, a radio, a digital video player, a DVD player, a tuner, a camera, a navigation device, a headset, an augmented realty headset, a mixed reality headset, a virtual reality headset, a vehicle, or any combination thereof.
In conjunction with the described implementations, an apparatus includes means for generating a first input data state for data samples in a time series of data samples of a portion of an audio data stream. For example, the means for generating the input data states includes the frontend neural network preprocessing layers 102, the bidirectional GRU layer 105, the encoder-side attention mechanism 205, the encoder portion 180 of the bundled multi-rate feedback autoencoder, the transmitting device 602, the device 902, the processor(s) 910, the processor 1106, the processor(s) 1110, the speech and music codec 1108, the vocoder decoder 1138, one or more other circuits or components configured to generate the input data states, or any combination thereof.
The apparatus also includes means for providing the first input data state to a first bottleneck and a second input data state, different from the first input data state, to a second bottleneck. The first bottleneck is associated with a first bitrate and the second bottleneck is associated with a second bitrate that is distinct from the first bitrate. For example, the means for providing includes the frontend neural network preprocessing layers 102, the bidirectional GRU layers 105, the encoder-side attention mechanism 205, the encoder portion 180 of the bundled multi-rate feedback autoencoder, the transmitting device 602, the device 902, the processor(s) 910, the processor 1106, the processor(s) 1110, the speech and music codec 1108, the vocoder decoder 1138, one or more other circuits or components configured to provide the input data states to bottlenecks, or any combination thereof.
The apparatus further includes means for generating a first encoded frame based on a first output data state from the first bottleneck and a second encoded frame based on a second output data state from the second bottleneck. The first encoded frame and the second encoded frame are bundled in a packet. For example, the means for generating includes the frame generator 402, the transmitting device 602, the device 902, the processor(s) 910, the processor 1106, the processor(s) 1110, the speech and music codec 1108, the vocoder decoder 1138, one or more other circuits or components configured to generate the first and second encoded frames, or any combination thereof.
In conjunction with the described implementations, an apparatus includes means for receiving a packet that includes a first encoded frame bundled with a second encoded frame. The first encoded frame includes a first output data state generated from a first bottleneck of a feedback autoencoder, and the second encoded frame includes a second output data state generated from a second bottleneck of a feedback autoencoder. The first bottleneck is associated with a first bitrate, and the second bottleneck is associated with a second bitrate that is distinct from the first bitrate. For example, the means for receiving the packet includes the receiver 654, the modem 656, the depacketizer 658, the input interface 1004, the processor(s) 1010, the antenna 1190, the transceiver 1150, the modem 1140, the processor(s) 1110, the processor 1106, one or more other circuits or components configured to receive the packet, or any combination thereof.
The apparatus also includes means for generating a reconstructed first data sample based on the first output data state. The reconstructed first data sample corresponds to a first data sample in a time series of data samples of a portion of an audio data stream. For example, the means for generating the reconstructed first data sample includes the bidirectional GRU layer 109, the backend neural network postprocessing layers 112, the decoder controller 665, the decoder network(s) 670, the decoder 672, the processor(s) 1010, the processor(s) 1110, the processor 1106, one or more other circuits or components configured to generate the reconstructed first data sample, or any combination thereof.
The apparatus further includes means for generating a reconstructed second data sample based on the second output data state. The reconstructed second data sample corresponds to a second data sample in the time series of data samples. For example, the means for generating the reconstructed second data sample includes the bidirectional GRU layer 109, the backend neural network postprocessing layers 112, the decoder controller 665, the decoder network(s) 670, the decoder 672, the processor(s) 1010, the processor(s) 1110, the processor 1106, one or more other circuits or components configured to generate the reconstructed second data sample, or any combination thereof.
In some implementations, a non-transitory computer-readable medium includes instructions that, when executed by one or more processors of a device, cause the one or more processors to generate an input data state (e.g., the input data states 122) for each data sample (e.g., the data samples 120) in a time series of data samples of a portion of an audio data stream (e.g., the data stream 604). Execution of the instructions also causes the one or more processors to provide at least one input data state to a first bottleneck (e.g., the bottleneck 108B) and at least one other input data state to a second bottleneck (e.g., the bottleneck 108A). The first bottleneck is associated with a first bitrate and the second bottleneck is associated with a second bitrate that is distinct from the first bitrate. Execution of the instructions further causes the one or more processors to generate a first encoded frame (e.g., the frame 420B) based on a first output data state (e.g., the output data state 124B) from the first bottleneck and a second encoded frame (e.g., the frame 420A) based on a second output data state (e.g., the output data state 124A) from the second bottleneck. The first encoded frame and the second encoded frame are bundled in a packet (e.g., the packet 430).
In some implementations, a non-transitory computer-readable medium includes instructions that, when executed by one or more processors of a device, cause the one or more processors to receive, at a decoder network, a packet (e.g., the packet 430) that includes a first encoded frame (e.g., the frame 420B) bundled with a second encoded frame (e.g., the frame 420A). The first encoded frame includes a first output data state (e.g., the output data state 124B) generated from a first bottleneck (e.g., the bottleneck 108B) of a feedback autoencoder, and the second encoded frame includes a second output data state (E.g., the output data state 124A) generated from a second bottleneck (e.g., the bottleneck 108A) of a feedback autoencoder. The first bottleneck is associated with a first bitrate, and the second bottleneck is associated with a second bitrate that is distinct from the first bitrate. Execution of the instructions also causes the one or more processors to generate a reconstructed first data sample (e.g., the reconstructed data sample 126B) based on the first output data state. The reconstructed first data sample corresponds to a first data sample (e.g., the data sample 120B) in a time series of data samples (e.g., the data samples 120) of a portion of an audio data stream (e.g., the data stream 604). Execution of the instructions further causes the one or more processors to generate a reconstructed second data sample (e.g., the reconstructed data sample 126A) based on the second output data state. The reconstructed second data sample corresponds to a second data sample (e.g., the data sample 120A) in the time series of data samples.
Particular aspects of the disclosure are described below in sets of interrelated examples:
According to Example 1, a device includes: a memory; and one or more processors coupled to the memory and operably configured to: generate a first input data state for data samples in a time series of data samples of a portion of an audio data stream; provide the first input data state to a first bottleneck and a second input data state, different from the first input data state, to a second bottleneck, the first bottleneck associated with a first bitrate and the second bottleneck associated with a second bitrate; and generate a first encoded frame based on a first output data state from the first bottleneck and a second encoded frame based on a second output data state from the second bottleneck, the first encoded frame and the second encoded framed bundled in a packet.
Example 2 includes the device of Example 1, wherein the first bottleneck and the second bottleneck are integrated into a bottleneck layer of a feedback autoencoder.
Example 3 includes the device of any of Examples 1 to 2, wherein the first and second input data states correspond to first and second encoder hidden states generated at a bidirectional gated recurrent unit (GRU) layer of the feedback autoencoder.
Example 4 includes the device of any of Examples 1 to 3, wherein the first bitrate is distinct from the second bitrate.
Example 5 includes the device of any of Examples 1 to 4, wherein the one or more processors are operably configured to allocate a smaller number of bits to latent codes generated at the first bottleneck than latent codes generated at the second bottleneck.
Example 6 includes the device of any of Examples 1 to 5, wherein a first codebook associated with the first bottleneck has a smaller size than a second codebook associated with the second bottleneck.
Example 7 includes the device of any of Examples 1 to 6, wherein the packet comprises a predicted frame and a reference frame, wherein an input data state associated with the predicted frame is provided to the first bottleneck, and wherein an input data state associated with the reference frame is provided to the second bottleneck.
Example 8 includes the device of any of Examples 1 to 7, wherein a bit size of the predicted frame is less than a bit size of the reference frame.
Example 9 includes the device of any of Examples 1 to 8, wherein the input data state for each frame of the packet is generated using an attention mechanism.
Example 10 includes the device of any of Examples 1 to 9, wherein the attention mechanism comprises a transformer.
Example 11 includes the device of any of Examples 1 to 10, wherein the one or more processors are operably configured to dynamically change the first bitrate and the second bitrate based on network conditions.
Example 12 includes a method comprising: generating a first input data state for data samples in a time series of data samples of a portion of an audio data stream; providing the first input data state to a first bottleneck and a second input data state, different from the first input data state, to a second bottleneck, the first bottleneck associated with a first bitrate and the second bottleneck associated with a second bitrate; and generating a first encoded frame based on a first output data state from the first bottleneck and a second encoded frame based on a second output data state from the second bottleneck, the first encoded frame and the second encoded framed bundled in a packet.
Example 13 includes the method of Example 12, wherein the first bottleneck and the second bottleneck are integrated into a bottleneck layer of a feedback autoencoder.
Example 14 includes the method of any of Examples 12 to 13, wherein the first and second input data states correspond to first and second encoder hidden states generated at a bidirectional gated recurrent unit (GRU) layer of the feedback autoencoder.
Example 15 includes the method of any of Examples 12 to 14, wherein the first bitrate is distinct from the second bitrate.
Example 16 includes the method of any of Examples 12 to 15, further comprising allocating a smaller number of bits to latent codes generated at the first bottleneck than latent codes generated at the second bottleneck.
Example 17 includes the method of any of Examples 12 to 16, wherein a first codebook associated with the first bottleneck has a smaller size than a second codebook associated with the second bottleneck.
Example 18 includes the method of any of Examples 12 to 17, wherein the packet comprises a predicted frame and a reference frame, wherein an input data state associated with the predicted frame is provided to the first bottleneck, and wherein an input data state associated with the reference frame is provided to the second bottleneck.
Example 19 includes the method of any of Examples 12 to 18, wherein a bit size of the predicted frame is less than a bit size of the reference frame.
Example 20 includes the method of any of Examples 12 to 19, wherein the input data state for each frame of the packet is generated using an attention mechanism.
Example 21 includes the method of any of Examples 12 to 20, wherein the attention mechanism comprises a transformer.
Example 22 includes the method of any of Examples 12 to 21, further comprising dynamically changing the first bitrate and the second bitrate based on network conditions.
Example 23 includes a non-transitory computer-readable medium stores instructions executable by one or more processors to: generate a first input data state for data samples in a time series of data samples of a portion of an audio data stream; provide the first input data state to a first bottleneck and a second input data state, different from the first input data state, to a second bottleneck, the first bottleneck associated with a first bitrate and the second bottleneck associated with a second bitrate; and generate a first encoded frame based on a first output data state from the first bottleneck and a second encoded frame based on a second output data state from the second bottleneck, the first encoded frame and the second encoded framed bundled in a packet.
Example 24 includes the non-transitory computer-readable medium of Example 23, wherein the first bottleneck and the second bottleneck are integrated into a bottleneck layer of a feedback autoencoder.
Example 25 includes the non-transitory computer-readable medium of any of Examples 23 to 24, wherein the first and second input data states correspond to first and second encoder hidden states generated at a bidirectional gated recurrent unit (GRU) layer of the feedback autoencoder.
Example 26 includes the non-transitory computer-readable medium of any of Examples 23 to 25, wherein the first bitrate is distinct from the second bitrate.
Example 27 includes the non-transitory computer-readable medium of any of Examples 23 to 26, wherein the instructions, when executed, further cause the one or more processors to allocate a smaller number of bits to latent codes generated at the first bottleneck than latent codes generated at the second bottleneck.
Example 28 includes the non-transitory computer-readable medium of any of Examples 23 to 27, wherein a first codebook associated with the first bottleneck has a smaller size than a second codebook associated with the second bottleneck.
Example 29 includes the non-transitory computer-readable medium of any of Examples 23 to 28, wherein the packet comprises a predicted frame and a reference frame, wherein an input data state associated with the predicted frame is provided to the first bottleneck, and wherein an input data state associated with the reference frame is provided to the second bottleneck.
Example 30 includes the non-transitory computer-readable medium of any of Examples 23 to 29, wherein a bit size of the predicted frame is less than a bit size of the reference frame.
Example 31 includes the non-transitory computer-readable medium of any of Examples 23 to 30, wherein the input data state for each frame of the packet is generated using an attention mechanism.
Example 32 includes the non-transitory computer-readable medium of any of Examples 23 to 31, wherein the attention mechanism comprises a transformer.
Example 33 includes the non-transitory computer-readable medium of any of Examples 23 to 32, wherein the instructions, when executed, further cause the one or more processors to dynamically change the first bitrate and the second bitrate based on network conditions.
Example 34 includes an apparatus comprising: means for generating a first input data state for data samples in a time series of data samples of a portion of an audio data stream; means for providing the first input data state to a first bottleneck and a second input data state, different from the first input data state, to a second bottleneck, the first bottleneck associated with a first bitrate and the second bottleneck associated with a second bitrate; and means for generating a first encoded frame based on a first output data state from the first bottleneck and a second encoded frame based on a second output data state from the second bottleneck, the first encoded frame and the second encoded framed bundled in a packet.
Example 35 includes the apparatus of Example 34, wherein the first bottleneck and the second bottleneck are integrated into a bottleneck layer of a feedback autoencoder.
Example 36 includes the apparatus of any of Examples 34 to 35, wherein the first and second input data states correspond to first and second encoder hidden states generated at a bidirectional gated recurrent unit (GRU) layer of the feedback autoencoder.
Example 37 includes the apparatus of any of Examples 34 to 36, wherein the first bitrate is distinct from the second bitrate.
Example 38 includes the apparatus of any of Examples 34 to 37, wherein a smaller number of bits is allocated to latent codes generated at the first bottleneck than latent codes generated at the second bottleneck.
Example 39 includes the apparatus of any of Examples 34 to 38, wherein a first codebook associated with the first bottleneck has a smaller size than a second codebook associated with the second bottleneck.
Example 40 includes the apparatus of any of Examples 34 to 39, wherein the packet comprises a predicted frame and a reference frame, wherein an input data state associated with the predicted frame is provided to the first bottleneck, and wherein an input data state associated with the reference frame is provided to the second bottleneck.
Example 41 includes the apparatus of any of Examples 34 to 40, wherein a bit size of the predicted frame is less than a bit size of the reference frame.
Example 42 includes the apparatus of any of Examples 34 to 41, wherein the input data state for each frame of the packet is generated using an attention mechanism.
Example 43 includes the apparatus of any of Examples 34 to 42, wherein the attention mechanism comprises a transformer.
Example 44 includes the apparatus of any of Examples 34 to 43, further comprising means for dynamically changing the first bitrate and the second bitrate based on network conditions.
Example 45 includes a device comprising: a memory; and one or more processors coupled to the memory and configured to execute instructions from the memory to: receive, at a decoder network, a packet that includes a first encoded frame bundled with a second encoded frame, the first encoded frame comprising a first output data state generated from a first bottleneck of a feedback autoencoder, the second encoded frame comprising a second output data state generated from a second bottleneck of the feedback autoencoder, wherein the first bottleneck is associated with a first bitrate and the second bottleneck is associated with a second bitrate; generate a reconstructed first data sample based on the first output data state, the reconstructed first data sample corresponding to a first data sample in a time series of data samples of a portion of an audio data stream; and generate a reconstructed second data sample based on the second output data state, the second reconstructed data sample corresponding to a second data sample in the time series of data samples.
Example 46 includes the device of Example 45, wherein the first output data state is distinct from the second output data state.
Example 47 includes the device of any of Examples 45 to 46, wherein the first output data state and the second output data state are received at a bidirectional gated recurrent unit (GRU) layer of the decoder network.
Example 48 includes the device of any of Examples 45 to 47, wherein the first bitrate is less than the second bitrate.
Example 49 includes the device of any of Examples 45 to 48, wherein the first output data state and the second output data state are received by an attention mechanism.
Example 50 includes the device of any of Examples 45 to 49, wherein the attention mechanism comprises a transformer.
Example 51 includes a method comprising: receiving, at a decoder network, a packet that includes a first encoded frame bundled with a second encoded frame, the first encoded frame comprising a first output data state generated from a first bottleneck of a feedback autoencoder, the second encoded frame comprising a second output data state generated from a second bottleneck of the feedback autoencoder, wherein the first bottleneck is associated with a first bitrate and the second bottleneck is associated with a second bitrate; generating a reconstructed first data sample based on the first output data state, the reconstructed first data sample corresponding to a first data sample in a time series of data samples of a portion of an audio data stream; and generating a reconstructed second data sample based on the second output data state, the reconstructed second data sample corresponding to a second data sample in the time series of data samples.
Example 52 includes the method of Example 51, wherein the first output data state is distinct from the second output data state.
Example 53 includes the method of any of Examples 51 to 52, wherein the first output data state and the second output data state are received at a bidirectional gated recurrent unit (GRU) layer of the decoder network.
Example 54 includes the method of any of Examples 51 to 53, wherein the first bitrate is less than the second bitrate.
Example 55 includes the method of any of Examples 51 to 54, wherein the first output data state and the second output data state are received by an attention mechanism.
Example 56 includes the method of any of Examples 51 to 55, wherein the attention mechanism comprises a transformer.
Example 57 includes a non-transitory computer-readable medium stores instructions executable by one or more processors to: receive, at a decoder network, a packet that includes a first encoded frame bundled with a second encoded frame, the first encoded frame comprising a first output data state generated from a first bottleneck of a feedback autoencoder, the second encoded frame comprising a second output data state generated from a second bottleneck of the feedback autoencoder, wherein the first bottleneck is associated with a first bitrate and the second bottleneck is associated with a second bitrate; generate a reconstructed first data sample based on the first output data state, the reconstructed first data sample corresponding to a first data sample in a time series of data samples of a portion of an audio data stream; and generate a reconstructed second data sample based on the second output data state, the second reconstructed data sample corresponding to a second data sample in the time series of data samples.
Example 58 includes the non-transitory computer-readable medium of Example 56, wherein the first output data state is distinct from the second output data state.
Example 59 includes the non-transitory computer-readable medium of any of Examples 56 to 57, wherein the first output data state and the second output data state are received at a bidirectional gated recurrent unit (GRU) layer of the decoder network.
Example 60 includes the non-transitory computer-readable medium of any of Examples 57 to 59, wherein the first bitrate is less than the second bitrate.
Example 61 includes the non-transitory computer-readable medium of any of Examples 57 to 60, wherein the first output data state and the second output data state are received by an attention mechanism.
Example 62 includes the non-transitory computer-readable medium of any of Examples 57 to 61, wherein the attention mechanism comprises a transformer.
Example 63 includes an apparatus comprising: means for receiving, at a decoder network, a packet that includes a first encoded frame bundled with a second encoded frame, the first encoded frame comprising a first output data state generated from a first bottleneck of a feedback autoencoder, the second encoded frame comprising a second output data state generated from a second bottleneck of the feedback autoencoder, wherein the first bottleneck is associated with a first bitrate and the second bottleneck is associated with a second bitrate; means for generating a reconstructed first data sample based on the first output data state, the reconstructed first data sample corresponding to a first data sample in a time series of data samples of a portion of an audio data stream; and means for generating a reconstructed second data sample based on the second output data state, the reconstructed second data sample corresponding to a second data sample in the time series of data samples.
Example 64 includes the apparatus of Example 63, wherein the first output data state is distinct from the second output data state.
Example 65 includes the apparatus of any of Examples 63 to 64, wherein the first output data state and the second output data state are received at a bidirectional gated recurrent unit (GRU) layer of the decoder network.
Example 66 includes the apparatus of any of Examples 63 to 65, wherein the first bitrate is less than the second bitrate.
Example 67 includes the apparatus of any of Examples 63 to 66, wherein the first output data state and the second output data state are received by an attention mechanism.
Example 68 includes the apparatus of any of Examples 63 to 67, wherein the attention mechanism comprises a transformer.
Those of skill would further appreciate that the various illustrative logical blocks, configurations, modules, circuits, and algorithm steps described in connection with the implementations disclosed herein may be implemented as electronic hardware, computer software executed by a processor, or combinations of both. Various illustrative components, blocks, configurations, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or processor executable instructions depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, such implementation decisions are not to be interpreted as causing a departure from the scope of the present disclosure.
The steps of a method or algorithm described in connection with the implementations disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disk, a removable disk, a compact disc read-only memory (CD-ROM), or any other form of non-transient storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor may read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an application-specific integrated circuit (ASIC). The ASIC may reside in a computing device or a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a computing device or user terminal.
The previous description of the disclosed implementations is provided to enable a person skilled in the art to make or use the disclosed implementations. Various modifications to these implementations will be readily apparent to those skilled in the art, and the principles defined herein may be applied to other implementations without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the implementations shown herein and is to be accorded the widest scope possible consistent with the principles and novel features as defined by the following claims.
Number | Date | Country | Kind |
---|---|---|---|
20220100243 | Mar 2022 | GR | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US23/61086 | 1/23/2023 | WO |