This invention relates to adaptive real-time video streaming, particularly methods and systems using deep reinforcement learning for adaptive bitrate selection.
In real-time video systems, such as video conferencing, cloud gaming, and virtual reality (VR), videos are encoded at the sender, and streamed over the Internet to the receiver. Since the network conditions across the Internet change dynamically, and vary noticeably among different end users, an adaptive bitrate (ABR) algorithm is usually deployed in such system to adapt sending bitrate to combat network dynamics.
Widely deployed ABR algorithms include for example GCC (Google Congestion Control) and BBR (Bottleneck Bandwidth and Round-trip propagation time). These existing ABR algorithms typically include congestion detection, slow start and quick recovery.
Due to the tight millisecond-level latency restriction for real-time video streaming, HTTP based video streaming systems (such as the HTTP Live Streaming (“HLS”) and Dynamic Adaptive Streaming over HTTP (“DASH”) protocols) with trunk-level granularity are not suited for performing real-time video streaming, because they need to prepare video segments in advance, which introduces at least another layer of delay. For this reason, the conventional buffer-based, rate-based or even learning-based ABR algorithms for HTTP protocols are not suited for low-delay/real-time video scenarios, such as cloud gaming and video conference.
In the conventional real-time streaming systems, after the video session is established, the streaming server (video server) first streams compressed video to a service gateway, which forwards the video stream to a client. The client periodically returns its playback status and current network Quality of Service (QoS) parameters to the service gateway. Using an existing adaptive bitrate (ABR) algorithm, the service gateway outputs a target bitrate to the streaming server for bitrate adaptation. The existing ABR algorithms use a variety of different inputs (e.g., playback status and network QoS parameters) to change the bitrate for future streaming. In this type of systems, the client playbacks the video frames instantly upon receipt to guarantee real-time interaction. To meet the low-latency requirement, the service gateway in the conventional real-time streaming systems would request the streaming server to force an Instantaneous Decoding Refresh (IDR) or Random Access frame to restart a new group of picture (GoP) over TCP, if no new frames are received over a certain time period. The policies produced by ABR algorithms heavily influence the video streaming performance. For real-time interaction scenarios, user's quality of experience (QoE) depends greatly on the video steaming performance.
The existing ABR algorithms face multiple challenges. For example, only network QoS parameters are considered in these algorithms to derive policies, which may fail to produce consistent user QoE. As an example, Google Congestion Control (GCC) only takes delay and packet loss rate into consideration to perform congestion control and bitrate adaptation, without considering other relevant factors such as user's QoS requirements.
Existing ABR algorithms also have no knowledge of the underlying network, so they are mainly heuristic algorithms and have difficulty in determining the optimal bitrate to avoid frame freezing and improve video quality. When there is no congestion, the bitrate is increased conservatively to achieve higher video quality. Once the bitrate is overly adjusted, the performance would decrease sharply from its peak. Then the bitrate would decrease to a significantly lower level and another round of conservative bitrate growth is triggered when the network condition is getting better. Since the existing algorithms (such as GCC) has no knowledge of the underlying network, it tends to be trapped in this vicious circle of bitrate adaption, resulting in a low QoE with network underutilization.
Deep Reinforcement Learning (DRL)-based ABR algorithm discussed herein overcomes these constraints of the conventional ABR algorithms, improves the bitrate adaption, user QoE, and network utilization, and offers advantageous solutions in the fields of information theory, game theory, automatic control, such as AlphaGo and cloud video gaming.
The present invention relates to a deep reinforcement learning-based ABR algorithm, hereinafter referred to as Adaptive Real-time Streaming (ARS). ARS uses deep reinforcement learning tools to observe the features of the underlying network in real time. ABS learns to make subsequent ABR decisions automatically through observing the performance of past decisions, without using any pre-programmed control rules about the operating environment or heuristically probing the network. In one embodiment, the ARS system utilizes TCP or UDP to conduct an end-to-end process of streaming a real-time video (for example, gaming video). The ARS system includes a Streaming Server, a Forwarder, and a user end. This ARS system also includes an ARS Controller, which receives network/playback status, and performs the ABR algorithm. The user end sends the playback status to the Forwarder and the ARS Controller periodically. The ARS Controller in the service gateway uses ARS to determine the bitrate for the next chunk of video data and output the target bitrate to the streaming server for bitrate adaptation.
In one embodiment, the ARS system using UDP also includes and a Network Address Translation (NTA) module, which performs the transversal of UDP address in the phase of session establishment between the user end and the Forwarder.
In one embodiment, the ARS system using TCP also includes a Frame Buffer to manage the real-time video stream sent to the user end through the Forwarder.
In one embodiment, the ARS system employs reinforcement learning tools to train and optimize the ABR algorithm.
In one embodiment, each user end serves as an agent, which takes an action At (i.e., streaming at a certain bitrate) in the environment.
In another embodiment, two categories of states St including the network QoS and the playback status are provided to the agent from the environment. For example, the network QoS parameters comprise the round-trip time (RTT), the received bitrate, the packet loss rate, the retransmission packet count and so on. The play back status includes the received frame rate, the maximum received frame interval and the minimum received frame interval.
In another embodiment, the environment will provide a reward Rt to the agent, on which the agent is based to decide next action At+1 to keep increasing the reward Rt. The action frequency is confined to per second or GoP to enable fast reaction to network changes. This is supported by the fact that video encoding is operated in real time for real-time video streaming systems. The decision is made following a control policy, which is generated using a neural network. Hence, ARS does not need to use a network estimator, which is normally included in the conventional video streaming systems to estimate the bitrate for the next moment using ABR algorithms. ARS instead maps “raw” observations (i.e., states) to perform the bitrate adaptation through the neural network for the next ground (“ground” represents a bitrate adaptation event in the frequency of per second or GoP).
In a further embodiment, ARS balances a variety of QoE goals and determines the reward Rt, such as maximizing video quality (i.e., using highest average bitrate), minimizing video freezing events (i.e., minimizing scenarios where the received frame rate is less than the sending frame rate), maintaining video quality smoothness (i.e., avoiding frequent bitrate fluctuations), and minimizing video latency (i.e., achieving the minimum interactive delay).
In another embodiment, to accelerate the training speed, ARS enables multiple agents to train the ABR algorithms concurrently.
In another embodiment, ARS supports training of the ABR algorithms both online and offline.
In a further embodiment, to further accelerate the training speed, ABR algorithms are trained in a simulation environment offline that closely models the network dynamics of video streaming with real client applications.
In another embodiment, ARS supports a variety of different training algorithms (such as DQN (Deep Q-learning Network), REINFORCE, Q-learning and A3C (Asynchronous advantage actor-critic)) in the abstract reinforcement learning framework.
The novel features of the invention are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings (also “Figure” and “FIG.” herein), of which:
In the ARS system using UDP, as shown in
As shown in
Two categories of states St including the network QoS (such as the round-trip time (RTT), the received bitrate, the packet loss rate, the retransmission packet count) and the playback status (such as received frame rate, maximum received frame interval, and minimum received frame interval) are provided to the agent 301/302/303/304 by the environment 321.
Specifically, RTT is calculated by combining transmission delay (which is derived by dividing the current sending bitrate by the current throughput) and queuing delay (which is derived by considering loss packet retransmission), propagation delay and processing delay. The packet loss rate is calculated during video packet transmission according to the frame size and the current throughput. Due to the packet loss, retransmission packets are repeatedly sent from the Streaming Server to the user end until they are received or overdue, which is also counted by ARS. And the received frame rate and the maximum/minimum frame interval are inferred based on the packet receiving condition. These status observations are further normalized to the range [−1,1] to speed up the training process.
The environment 321 also provides a reward Rt to the agent, which the agent 301/302/303/304 is based on to decide next action At+1 at the time t+1, to keep increasing the reward. ARS balances a variety of QoE goals to determine the reward Rt. As an example, Equation (1) below represents an ARS QoE matrices considering the past N grounds for a real-time video streaming.
QoE=Σ
t=1
Nαtq(rt)−μΣt=1NαtFt−kΣt=1Nαt|q(rt)−q(rt−1)|−ιΣt=1NαtLt (1)
In Equation (1), within the first term, rt represents the sending bitrate in the near t ground and q(rt) maps that sending bitrate to the quality perceived by a user. The choice of rt could be linear, logarithmic or other functions. The second term Ft represents the freezing time that results from streaming the video in the near t ground at bitrate rt. The third term penalizes the changes in video quality in favor of smoothness, and the final term penalizes the end-to-end interaction latency at bitrate rt. In other words, the QoE or reward can be computed by subtracting the freezing penalty, the smoothness penalty and the latency penalty from the bitrate utility. μ, k and ι denote for freezing, smoothness, and latency penalty factor respectively. The parameter αt is introduced as a temporal significance factor to place QoE factors in a time domain for reward computation.
In another embodiment, apart from the regular agents 301/302/303, a central agent 304 is included to handle the tuple (St, At, Rt) received from the regular agents and to compute updated network parameters via a gradient descent method. By jointly considering the output gradients produced by these regular agents in the central agent 304, such as using averaging operation, the oscillation of reward curve over epoch decreases, making the control policy faster to converge. With the result gradient, the parameters or weights in the neural network are updated and then passed to the regular agents 301/302/303 to update their own networks.
In a further embodiment, ARS supports training of the ABR algorithms both online and offline. In the online scenario, training could take place using actual video streaming user ends. Using a pre-trained offline model as a priori, ARS enables the ABR algorithms to be updated periodically as new actual data arrives even after the algorithms have been deployed in the real environment. By collecting real environment statuses, it makes ARS more effective to train a specific ABR algorithm that best suits the user's actual network conditions. Each specific ABR algorithm could be individually trained based on its underlying network and used for that underlying network dedicatedly to improve the accuracy and performance of ARS.
Normally, ABR algorithms can only be trained and updated until all video packets are completely streamed, resulting in very slow training speed. To train a general ABR algorithm applicable to all users, it calls for more training work on diverse types of network environment and more training samples and time. In addition, it incurs extra computational overhead for the devices in which ARS is deployed, either at the server side or the user end side. To overcome these constraints, in one embodiment, training ABR algorithms in a simulation environment offline that closely models the dynamics of video streaming with real client applications is performed to further accelerate the training speed. The training set used for simulation is obtained by simulating real video streaming processes to get state observations (i.e., the network QoS and the playback status) over various patterns of network environment. For example, a corpus of network throughput traces is first created by combining several public bandwidth datasets (i.e., FCC, Norway, 3G/HSDPA, and 4G/Belgium), and these network throughput traces are then used to simulate the actual network conditions. The network throughput traces are down sampled to an augment sample size. To make the simulation faithful to the actual environment, ARS uses real video sequences for encoding at diverse fine-grained bitrates. By streaming these videos over simulated networks with network throughput traces closely following the actual network environment, the network QoS parameters and playback status can be obtained.
In another embodiment, ARS also supports a variety of different training algorithms to train the agent in an abstract reinforcement learning framework. Taking A3C as an example, which is a state-of-the-art actor-critic method involving training two neural networks, the basic training algorithm of ARS using an A3C network in the agent is illustrated in
In a further embodiment, the agent selects actions based on a policy, defined as a probability distribution over actions π: π(St, At)→[0,1]. π(St, At) is the probability that action At is taken in state St. ARS can use a neural network (NN) including a convolutional neural network (CNN) and recurrent neural network (RNN) to generate the policy with a manageable number of an adjustable parameter, θ, as the policy parameter. The actor network 412 in
An example of RNN framework used in ARS comprises five layers: Input layer 401, where the states are reshaped with temporal components of each state type that serve as another dimension; First RNN layer 421/424, where the tensor from the last layer is passed to a GRU network with the time step equaling to the count of past grounds considered. All the sequential results are passed to the next layer; Second RNN layer 422/425, where the sequential tensor from the last layer is passed to another GRU network and only the latest results are passed to the next layer; Full connection layer 423/426, where the tensor from the last layer is passed into a dense layer with full connection; Output layer 424/427, a full-connection layer, where the tensor from the last layer is reshaped to a new tensor with the dimension (1, ActionDimension), using the softmax activation function 427 in the actor network 412 or to a tensor with the dimension (1,1) using the linear activation function 424 in the critic network 411.
After applying each action, the simulated environment provides the agent (such as the agents 301/302/303/304 in
Upon training and optimizing an ABR algorithm, it can be deployed in an ARS system. Besides implementing in the service gateway (which can be implemented in any suitable devices, such as edge servers) as shown in
By using DRL-based ARS to handle ABR control in real-time video streaming systems, it optimizes its policy for different network characteristics and QoE metrices directly from user QoE, without using assumptions on fixed heuristics or inaccurate network models or patterns. Considering both Network QoS factors and playback statuses using the DRL technology, ARS achieves higher performance in term of user QoE, compared to existing closed-form ABR algorithms.
It should be noted that one or more of the methods described herein may be implemented in and/or performed using any DRL Network algorithm, such as DQN (Deep Q-learning Network), REINFORCE, Q-learning and A3C (Asynchronous advantage actor-critic). And the Neural Network (NN) to be used in the ARS systems is not limited to the form and operation discussed herein.
The electronic device 500 includes a processor 520 that controls operation of the electronic device 500. The processor 520 may also be referred to as a CPU. Memory 510, which may include both read-only memory (ROM), random access memory (RAM) or any type of device that may store information, provides instructions 515a (e.g., executable instructions) and data 525a to the processor 520. A portion of the memory 510 may also include non-volatile random access memory (NVRAM). The memory 510 may be in electronic communication with the processor 520.
Instructions 515b and data 525b may also reside in the processor 520. Instructions 515b and data 525b loaded into the processor 520 may also include instructions 515a and/or data 525a from memory 610 that were loaded for execution or processing by the processor 520. The instructions 515b may be executed by the processor 520 to implement the systems and methods disclosed herein.
The electronic device 500 may include one or more communication interfaces 530 for communicating with other electronic devices. The communication interfaces 530 may be based on wired communication technology, wireless communication technology, or both. Examples of communication interfaces 530 include a serial port, a parallel port, a Universal Serial Bus (USB), an Ethernet adapter, an IEEE 1394 bus interface, a small computer system interface (SCSI) bus interface, an infrared (IR) communication port, a Bluetooth wireless communication adapter, a wireless transceiver in accordance with 3rd Generation Partnership Project (3GPP) specifications and so forth.
The electronic device 500 may include one or more output devices 550 and one or more input devices 540. Examples of output devices 550 include a speaker, printer, etc. One type of output device that may be included in an electronic device 500 is a display device 560. Display devices 560 used with configurations disclosed herein may utilize any suitable image projection technology, such as a cathode ray tube (CRT), liquid crystal display (LCD), light-emitting diode (LED), gas plasma, electroluminescence or the like. A display controller 565 may be provided for converting data stored in the memory 510 into text, graphics, and/or moving images (as appropriate) shown on the display 560. Examples of input devices 540 include a keyboard, mouse, microphone, remote control device, button, joystick, trackball, touchpad, touchscreen, lightpen, etc.
The various components of the electronic device 500 are coupled together by a bus system 570, which may include a power bus, a control signal bus and a status signal bus, in addition to a data bus. However, for the sake of clarity, the various buses are illustrated in
The term “computer-readable medium” refers to any available medium that can be accessed by a computer or a processor. The term “computer-readable medium,” as used herein, may denote a computer- and/or processor-readable medium that is non-transitory and tangible. By way of example, and not limitation, a computer-readable or processor-readable medium may comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer or processor. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray® disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers.
It should be noted that one or more of the methods described herein may be implemented in and/or performed using hardware. For example, one or more of the methods or approaches described herein may be implemented in and/or realized using a chipset, an application-specific integrated circuit (ASIC), a large-scale integrated circuit (LSI) or integrated circuit, etc.
Each of the methods disclosed herein comprises one or more steps or actions for achieving the described method. The method steps and/or actions may be interchanged with one another and/or combined into a single step without departing from the scope of the claims. In other words, unless a specific order of steps or actions is required for proper operation of the method that is being described, the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims.
It is to be understood that the claims are not limited to the precise configuration and components illustrated above. Various modifications, changes and variations may be made in the arrangement, operation and details of the systems, methods, and apparatus described herein without departing from the scope of the claims.
This application claims priority to the following patent application, which is hereby incorporated by reference in its entirety for all purposes: U.S. Patent Provisional Application No. 62/769,534, filed on Nov. 19, 2018.
Number | Date | Country | |
---|---|---|---|
62769534 | Nov 2018 | US |