This invention relates to a method and apparatus for predictive coding of 360-degree video signals.
(Note: This application references several different publications as indicated throughout the specification by one or more reference numbers within brackets, e.g., [x]. A list of these different publications ordered according to these reference numbers can be found below in the section entitled “References.” Each of these publications is incorporated by reference herein.)
The spherical video signal, or 360° (360-degree) video signal, is video captured by omnidirectional or multiple cameras, comprising visual information on a sphere that encloses the viewer. This enables the user to view in any desired direction. Spherical video is emerging as the next important multimedia format, revolutionizing different areas including social media, gaming, business, health and education as well as numerous other virtual reality and augmented reality applications. In many applications such as robotics, navigation systems, entertainment, gaming, the dominant component of the motion in the spherical video is due to camera motion, and often specifically camera translation. Spherical video dominated by camera motion is often the application scenario envisioned by large-scale multimedia distributors such as Google™/YouTube™ and Facebook™; 360° video-based game developers such as Microsoft™ and Facebook™; and other broadcast providers such as ESPN™ and BBC™. Given the significance of these applications, there is considerable need for compression tools tailored to this scenario.
With increased field of view, 360° video applications require acquisition at higher resolution compared to standard 2D video applications. Given the enormous amount of data consumed by spherical video, practical applications critically depends on powerful compression algorithms that are tailored to this signal characteristics. In the absence of codecs that are tailored to spherical video, prevalent approaches simply project the spherical video onto a plane or set of planes via a projection format such as Equirectangular Projection [1], or Equiangular Cubemap Projection [2], and then use standard video codecs to compress the projected video. A key observation is that a uniform sampling in the projected domain induces a varying sampling density on the sphere, which further varies across different projection formats. A brief review of some popular projection formats is provided next:
Equirectangular Projection (ERP): This format is obtained by taking the latitude and longitude coordinates of a point on the sphere as its 2D Cartesian coordinates on the plane. The ERP projection is illustrated in
Cubemap Projection: In standard cubemap projection, points are radially projected onto faces of a cube enclosing the sphere as illustrated in
The Joint Video Exploration Team (JVET) document [3] provides a more detailed description of these formats including procedures to map back and forth between the sphere and the plane for each projection format.
Modern video coders such as H.264 [4] and HEVC [5] use motion compensated prediction or “inter-prediction” to exploit temporal redundancies resulting in significant compression gains. Standard video codecs use a (piecewise) translational motion model for inter prediction, while some nonstandard approaches considered extensions to affine motion models that may be able to handle more complex motion, at a potentially significant cost in side-information (see recent approaches in [6, 7]). Still, in 360° video, the amount of warping induced by the projection varies for different regions of the sphere. This results in complex non-linear motion of objects in projected video in the current scenario involving camera translation, for which both the translation motion model and its affine motion extension are ineffective. Moreover, motion vector in the projected domain doesn't have any meaningful physical interpretation. Thus, a new motion compensated prediction technique that is tailored to the setting of 360° video signals with camera translation is needed.
Notable attempts to meet the challenges in motion compensation for spherical video, include:
Translation in 3D space: Li et al., proposed a 3D translational motion model for the cubemap projection [8]. In this approach, both centers of the current coding block and the reference block are mapped to the sphere, and the 3D displacement between these two vectors is calculated. The remaining pixels in the current coding block are then also mapped to the sphere and all are translated by the 3D displacement vector calculated for the centers. However, after displacement, only the block center is guaranteed to be on the sphere, which necessitates an additional step of projecting the displaced pixels onto the sphere, which in turn causes distortion. This method does not exploit properties of perceived motion on the sphere due to camera motion.
Rotation on the sphere: Vishwanath et al, introduced a rotation motion model for spherical video in [9]. A block of pixels on the projected plane is mapped back to the sphere. The block is then rotated on the sphere about a specified axis, and mapped back to the reference frame in the projected domain. Since rotation is a unitary operation, it preserves the shape and size of the object on the sphere. This approach significantly outperforms its predecessors, but still does not account for the nature of the perceived motion on the sphere due to camera translation.
Tosic et al., propose in [10] a multi-resolution motion estimation algorithm to match omnidirectional images, while operating on the sphere. However, their motion model is largely equivalent to operating in the ERP projected domain, and suffers from the suboptimalities associated with this projection.
A closely related problem is that of motion-compensated prediction in video captured with fish-eye cameras, where the projection to a plane also causes significant warping. Interesting approaches have been proposed to address this problem in [11, 12], but these do not apply to motion under different projection geometries for 360° videos.
The method in [13] processes the motion side information produced by a standard codec operating on video after projection by ERP, to identify two static (low motion) regions that are antipodal on the sphere. It then rotates the sphere so as to align the polar axis with these static regions and re-performs coding using ERP for the new orientation. This method is restricted to ERP. Moreover, it does not attempt to exploit properties of camera motion and is hence suboptimal for the problem at hand. It was shown to offer benefits, but for a small subset of the tested video sequences.
The suboptimal motion compensation procedures employed by the standard approach and other recent approaches, and specifically the inability to fully exploit the properties of perceived motion on the sphere due to camera motion, strongly motivate the present invention whose objectives are to devise and exploit new and effective motion models tailored to the critical needs of spherical video coding dominated by camera motion.
The present invention provides an effective solution for motion estimation and compensation to enable substantially better compression of spherical videos dominated by camera motion. Standard video coders perform motion compensated prediction in the projected domain and suffer from considerable suboptimality. Other state of the art technologies fail to exploit the nature of motion perceived on sphere which is due to camera motion. The invention introduces appropriately designed geodesic translation of pixels on the sphere, which captures the perceived motion of objects on sphere due to either pure camera motion or combined camera and object motion. The invention builds on the realization that, given camera and object motions, an object's pixels are perceived to move on the sphere along certain geodesics which all intersect at the point where a line determined by the camera and object velocity vectors intersects the sphere. Preferred embodiments of the invention comprise calculation of the optimal line (determined by camera and object motion) and thereby identification of the geodesics along which object pixels are expected to translate on the sphere. In one embodiment, this line is chosen to lie along the camera velocity vector. In another embodiment it is chosen to lie along a vector obtained by subtracting an object's velocity vector from the velocity vector of the camera. The video codec's motion vectors, in a preferred embodiment of the present invention, specify the amount of translation to apply to pixels along their geodesics. It is important to note that camera or object translation causes perspective distortions. For example, with decrease in the distance between object and camera, the object appears to grow larger. The invention, with its geodesic motion model on the sphere, perfectly accounts for these perspective distortions. Moreover, in the case of pure camera motion with stationary surroundings, there is a unique set of geodesics along which pixels can move, as determined by the camera velocity vector, and the video codec only requires 1D motion vectors to specify block motion to the decoder, resulting in significant bit-rate savings in side-information. An additional significant benefit is that the invention performs motion compensation on the sphere, regardless of the projection geometry in use, which makes it universally applicable to all current projection geometries, as well as any that may be discovered in the future. Experimental results demonstrate that the preferred embodiments of the invention achieve considerable performance gains over prevalent motion models, across different projection geometries and a variety of spherical video sequences.
The present invention provides an apparatus and method for processing a multimedia data stream, comprising: a codec for processing a multimedia data stream comprised of a plurality of frames, wherein the codec comprises an encoder, a decoder, or both an encoder and a decoder; the encoder processes the multimedia data stream to generate encoded data and the decoder processes the encoded data to reconstruct the multimedia data stream; the multimedia data stream contains a spherical video signal with dominant camera translation; and the encoder or the decoder comprises a motion-compensated predictor, which predicts a portion of a current frame from a corresponding portion of one or more reference frames, after motion compensation; and the motion compensation comprises translating the pixels along geodesics on the sphere, where the geodesics are the shortest paths on the sphere from the pixels to the two points where the sphere is intersected by a line determined by the motion of the camera and surrounding objects.
In one embodiment of the present invention, the line determined by the motion of the camera and surrounding objects is along the velocity vector of the camera.
In another embodiment, the line determined by the motion of the camera and surrounding objects is along a vector obtained by subtracting the velocity vector of a surrounding object from the velocity vector of the camera.
The line determined by the motion of the camera and surrounding objects varies from one portion of the current frame to another portion of the current frame.
The motion compensated further comprises rotation of the pixels about an axis.
In one embodiment of the present invention, the axis coincides with the line determined by the motion of the camera and surrounding objects.
In another embodiment of the present invention, the axis coincides with the axis of rotation of the camera.
The motion-compensated predictor further performs interpolation in the one or more reference frames to enable motion compensation at a sub-pixel resolution.
The encoded data comprises information, which specifies for a portion of the current frame, the amount of translation of pixels along the geodesics on the sphere.
Referring now to the drawings in which like reference numbers represent corresponding parts throughout:
In the following description of the preferred embodiment, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration a specific embodiment in which the invention may be practiced. It is to be understood that other embodiments may be utilized, and structural changes may be made without departing from the scope of the present invention.
Overview
The practicality of many virtual reality applications involving spherical videos with camera motion critically depends on efficient compression techniques for this scenario. In this invention, the nature of perceived motion of objects on sphere due to camera motion is carefully analyzed and exploited for effective compression. The present invention defines a geodesic translation model to characterize motion of objects on the sphere. The analysis shows that an object is perceived to move along geodesics that all intersect at the point where a line determined by the camera and object velocity vectors intersects the sphere. The preferred embodiment of an encoder according to the present invention thus finds this line that is determined by the camera and object velocity vectors, identifies the geodesics representing the shortest path on the sphere between an object pixel and the points where the line intersects the sphere, and signals to the decoder the amount of translation of a block along these geodesics. Motion compensation, according to the present invention, operates completely on the sphere and is thus agnostic of the projection geometry and is hence applicable to any projection formats.
Embodiments of the present invention have been evaluated after incorporation within existing coding platforms, such as HEVC. Experimental results show considerable gains, and provide strong evidence for its effectiveness.
Technical Description
Illustration of Perceived Motion of Objects on the Sphere:
Consider a simple case in which a viewer is at the origin, enclosed by a sphere as shown in
The above analysis provides valuable insights into the perceived motion of objects on the sphere, and implies that it can be effectively captured in terms of translation of pixels along their respective geodesics. These geodesics are determined by the two intersection points where a line that depends on the camera and object velocity vectors intersects the sphere. Thus the present invention performs motion compensation on the sphere, where it capture the true nature of perceived object motion, in contrast to standard techniques that perform motion compensation in the projected plane where motion is warped and lacks precise physical meaning. The overall paradigm for the motion compensated prediction consists of several steps, each employed in one or more embodiments of the invention, as illustrated in
Consider a portion of the current frame, typically a block of pixels, in the projected domain, which is to be predicted from the reference frame. As noted above, an example of such a block 400 in the ERP domain is illustrated in
One or more embodiments of the present invention employ spherical co-ordinates that are defined with respect to the line 401 determined by the camera and the object motions. In the case of static objects, the line 401 is along the direction of the camera velocity vector ‘vc’ as illustrated in
The following governs how the video codec's motion vector side information is employed by one or more embodiments of the present invention. Given a motion vector (m, n), the first component is used to specify the change in azimuth and the second component ‘n’ is used to specify the change in elevation, wherein azimuth and elevation are defined with respect to line 401, which is determined by the physical motion of the camera and the object. A pixel with spherical coordinate (θij, φij) will be motion compensated to a point whose spherical coordinates (θ′ij, φ′ij) are given by:
θ′ij=θij+mΔθ
φ′ij=φij+nΔφ
In another embodiment of the present invention, the motion compensation further comprises a step of rotating a block about an axis 406 by an angle ‘a’ as illustrated in
A preferred embodiment of this invention for motion compensation is summarized in the algorithm below.
1. Map the block of pixels in the current coding unit on to the sphere.
2. Calculate spherical coordinates with respect to the line 401, as determined by the camera and object velocity vectors.
3. Identify the geodesics that intersect at the point where the line 401 intersects the sphere.
4. Translate pixels in the block along the geodesics identified in Step 3 and, in one or more embodiments, further perform rotation about line 406.
5. After motion compensation on the sphere map the pixels to the reference frame in the projected geometry.
6. Perform interpolation in the reference frame to obtain the required predicted values.
To obtain experimental results, the preferred embodiment of this invention was implemented in HM-16.14 [14]. The geometry mappings were performed using the projection conversion tool of [15]. Results are provided for the Random-Access profile in HEVC. The Lanczos 2 filter was used at the projected coordinate for interpolation in the reference frame. Also sphere padding was employed [16] in the reference frame for improved prediction along the frame edges for all the competing methods. The step size Δφ and Δθ was chosen to be π/H for ERP. For EAC, since each face has field of view of π/2, Δθ and Δφ were chosen to be λ/2W, where W is the width of each face.
30 frames of five 360-video sequences were encoded over four QP values of 22, 27, 32 and 37 in both ERP and EAC. All the sequences in ERP were at 2K resolution and the sequences in EAC had a face-width of 512. The distortion was measured in terms of Weighted-Spherical PSNR as advocated in [17]. Bitrate reduction was calculated as per [18]. The preferred embodiment of this invention provided significant overall bitrate reduction of about 23% on average for ERP and of about 6% on average for EAC.
The hardware and software environment includes a computer 602 and may include peripherals. The computer 602 comprises a general purpose hardware processor 604A and/or a special purpose hardware processor 604B (hereinafter alternatively collectively referred to as processor 604) and a memory 607, such as random access memory (RAM). The computer 602 may be coupled to, and/or integrated with, other devices, including input/output (I/O) devices such as a keyboard 612 and a cursor control device 614 (e.g., a mouse, a pointing device, pen and tablet, touch screen, multi-touch device, etc.), a display 617, a speaker 618 (or multiple speakers or a headset), a microphone 620, and/or a video capture equipment 622 (such as a camera). In yet another embodiment, the computer 602 may comprise a multi-touch device, mobile phone, gaming system, internet enabled television, television set top box, multimedia content delivery server, or other internet enabled device executing on various platforms and operating systems.
In one embodiment, the computer 602 operates by the general purpose processor 604A performing instructions defined by the computer program 610 under control of an operating system 608. The computer program 610 and/or the operating system 608 may be stored in the memory 607 and may interface with the user and/or other devices to accept input and commands and, based on such input and commands and the instructions defined by the computer program 610 and operating system 608, to provide output and results.
Alternatively, some or all of the operations performed by the computer 602 according to the computer program 610 instructions may be implemented in a special purpose processor 604B, wherein some or all of the computer program 610 instructions may be implemented via firmware instructions stored in a read only memory (ROM), a programmable read only memory (PROM) or flash memory, or in memory 607. The special purpose processor 604B may also comprise an application specific integrated circuit (ASIC) or other dedicated hardware or circuitry.
The encoder 504, the transmission/reception or storage/retrieval 508, and/or the decoder 512, and any related components, may be performed within/by computer program 610 and/or may be executed by processors 604. Alternatively, or in addition, the encoder 504, the transmission/reception or storage/retrieval 508, and/or the decoder 512, and any related components, may be part of computer 602 or accessed via computer 602.
Output/results may be played back on video display 617 or provided to another device for playback or further processing or action.
Of course, those skilled in the art will recognize that any combination of the above components, or any number of different components, peripherals, and other devices, may be used with the computer 602.
Block 702 represents a signal to be processed (coded and/or decoded). The signal comprises a video data stream, or other multimedia data streams comprised of a plurality of frames.
Block 704 represents a coding step or function, which processes the signal in an encoder 504 to generate encoded data 706.
Block 708 represents a decoding step or function, which processes the encoded data 806 in a decoder 512 to generate a reconstructed multimedia data stream 710.
In one embodiment, the multimedia data stream contains a spherical video signal comprising visual information on a sphere that encloses a viewer, and the encoder 504 or the decoder 512 comprises a motion-compensated predictor, which predicts a portion of a current frame of the spherical video signal from a corresponding portion of one or more reference frames of the spherical video signal, after motion compensation, and the motion compensation comprises translating the pixels along geodesics on the sphere, where the geodesics are along shortest paths on the sphere from the pixels to two points where the sphere is intersected by a line determined by motion of the camera and surrounding objects. In another embodiment, the line determined by the motion of the camera and surrounding objects is along a velocity vector of the camera. In another embodiment, the line determined by the motion of the camera and surrounding objects is along a vector obtained by subtracting a velocity vector of one of the surrounding objects from the velocity vector of the camera. In another embodiment, the line determined by the motion of the camera and surrounding objects varies from one portion of the current frame to another portion of the current frame. In another embodiment, the motion compensation further comprises rotation of the pixels about an axis. In another embodiment, the line determined by the motion of the camera and surrounding objects varies from one portion of the current frame to another portion of the current frame. In another embodiment, the motion-compensated predictor further performs interpolation in the one or more reference frames to enable motion compensation at a sub-pixel resolution. In another embodiment, the encoded data 706 comprises information, which specifies for a portion of the current frame, a distance that pixels are to be translated along the geodesics on the sphere.
The following references are incorporated by reference herein to the description and specification of the present application.
In conclusion, embodiments of the present invention provide an efficient and effective solution for motion compensated prediction of spherical video dominated by camera motion. The solution involves a per-pixel geodesic translation motion model that perfectly captures the perceived motion of objects on sphere due to physical motions of camera and object, as well as the resulting perspective distortions. The effectiveness of such an approach has been demonstrated for different projection formats with HEVC based coding.
Accordingly, embodiments of the invention enable performance improvement in various multimedia related applications, including for example, multimedia storage and distribution (e.g., YouTube™, Facebook™, Microsoft™). Further embodiments may also be utilized in multimedia applications that involve spherical video.
In view of the above, embodiments of the present invention disclose methods and devices for motion compensated prediction of spherical video and particularly in cases where the dynamics are dominated by camera motion.
Although the present invention has been described in connection with the preferred embodiments, it is to be understood that modifications and variations may be utilized without departing from the principles and scope of the invention, as those skilled in the art will readily understand. Accordingly, such modifications may be practiced within the scope of the invention and the following claims, and the full range of equivalents of the claims.
This concludes the description of the preferred embodiment of the present invention. The foregoing description of one or more embodiments of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto and the full range of equivalents of the claims. The attached claims are presented merely as one aspect of the present invention. The Applicant does not disclaim any claim scope of the present invention through the inclusion of this or any other claim language that is presented or may be presented in the future. Any disclaimers, expressed or implied, made during prosecution of the present application regarding these or other changes are hereby rescinded for at least the reason of recapturing any potential disclaimed claim scope affected by these changes during prosecution of this and any related applications. Applicant reserves the right to file broader claims in one or more continuation or divisional applications in accordance within the full breadth of disclosure, and the full range of doctrine of equivalents of the disclosure, as recited in the original specification.
This application claims the benefit under 35 U.S.C. Section 119(e) of the following co-pending and commonly-assigned U.S. provisional patent application(s), which is/are incorporated by reference herein: Provisional Application Ser. No. 62/688,771, filed on Jun. 22, 2018, by Kenneth Rose, Tejaswi Nanjundaswamy, and Bharath Vishwanath, entitled “Method and Apparatus for Predictive Coding of 360-degree Video Dominated by Camera Motion,” attorneys' docket number 30794.0687USP1.
Number | Date | Country | |
---|---|---|---|
62688771 | Jun 2018 | US |