The present disclosure relates to the field of encoding of video streams. In particular, the present disclosure relates to methods and devices for encoding both lower- and higher-resolution images of a scene within a same video stream.
When monitoring a scene using a video camera, certain regions of the scene may, for various reasons, be considered as being more interesting than others. Such regions of interest (ROIs) may for example include faces of humans, facial features such as e.g. one or more eyes, license plates of vehicles, and similar, and may be either fixed or dynamically defined using e.g. object detection and/or tracking.
To consume less bandwidth, contemporary cameras may be capable of outputting such ROIs at higher-resolution, while other, non- or less interesting regions are output at lower-resolution. For example, during surveillance, ROIs may be output (and e.g. stored) at sufficiently high resolution to be used as evidence if needed. As another example, license plates may be detected by a video camera (i.e. “on the edge”), and images of the detected license plates at sufficiently high resolution may be output by the camera and provided to a server responsible for e.g. license plate number recognition or similar.
In contemporary solutions, a video camera may be configured to capture images of the scene at a higher-resolution, such as e.g. HD, full HD, 4K, 8K, and similar, and to output a lower-resolution overview of the scene by first downsampling the captured images before encoding them in an output video stream. To also provide the higher-resolution ROIs, the camera may be further configured to produce higher-resolution JPEG crops of the ROIs, and to e.g. send these crops in parallel with the lower-resolution overview of the scene. Issues pertinent to such a procedure may include e.g. higher bitrate consumption and lack of synchronization between the lower-resolution overview and the higher-resolution ROIs.
The present disclosure aims at improving upon such contemporary technology.
To at least partially resolve some of the above-identified issues with contemporary technology, the present disclosure provides an improved method, device, computer program and computer program product of/for encoding a video stream as defined in and by the accompanying independent claims. Various embodiments of the method, device, computer program and computer program product are defined in and by the accompanying dependent claims.
According to a first aspect of the present disclosure, there is provided a method for encoding a video stream. The method includes obtaining one or more subsequent images of a scene captured by at least one camera, wherein each of the one or more images has a respective first resolution. The method further includes, for each of the one or more images, identifying one or more regions of interest (ROIs) in the image, and adding a set of video frames to an encoded video stream. The set of video frames (for each of the one or more images) includes at least: i) a first video frame encoding at least part of the image at a respective second resolution lower than the first resolution; ii) a second video frame marked as a no-display frame, and being an inter-frame referencing the first video frame of the set and including motion vectors for upscaling of the one or more ROIs; and iii) a third video frame encoding the one or more ROIs of the image at a respective third resolution higher than the respective second resolution, the third video frame being an inter-frame referencing the second video frame of the set of video frames.
As used herein, a “no-display frame” is to be understood as a frame which is in any way flagged to instruct a decoder that the frame is not to be rendered as part of a decoded video stream, but that the frame is still available such that information may be obtained from it and used for the decoding of one or more other image frames which are to be displayed (i.e. not marked as no-display frames). As used herein, “motion vectors” are to be understood as information provided as part of one encoded frame and serving as instructions for where data useful for rendering one or more particular regions of the frame can be found in one or more other frames. For example, a motion vector may be considered as a two-dimensional vector used for inter-prediction that provides an offset from the coordinates in the decoded image to the coordinates in a reference image. For example, a motion vector may be used to represent a macroblock in the decoded image based on the position of this macroblock (or a similar one) in the reference image. It is envisaged that any video coding standard which supports the above concepts of no-display frames, motion vectors, and the possibility of one frame referencing one or more other (previous and/or later) frames, may be used to realize the disclosed method. Examples of such standards include (but are not necessarily limited to) High Efficiency Video Coding (HEVC) H.265, Advanced Video Coding (AVC) H.264, VP8, VP9, AV1, and Versatile Video Coding (VVC) H.266. For example, in H.265, a “no-display frame” may be created by setting pic_output_flag syntax element in the slice header to false, or e.g. by setting the no_display flag in the SEI header equal to true.
As will be described in more detail later herein, the envisaged method of encoding a video stream improves upon currently available technology in that it allows to provide both a lower-resolution overview of a scene and higher-resolution crops of one or more ROIs within a same encoded video stream, and with less bit rate consumption and without synchronization issues.
In one or more embodiments of the method, the method may include generating the encoded video stream using a layered type of coding such as scalable video coding (SVC). For at least one set of video frames, the first and second video frames may be inserted in a base layer of the encoded video stream, and the third video frame may be inserted in an enhancement layer of the encoded video stream. Using an enhancement layer may e.g. help to provide the third video frame at a resolution higher than that used to encode the first and/or second video frames, and without limitations to the total area of ROIs.
In one or more embodiments of the method, the respective third resolution may equal the respective first resolution, i.e. the resolution of the ROIs may match that used by the camera to capture the scene.
In one or more embodiments of the method, for at least one set, the operation ii) of adding the second video frame may further include rearranging the upscaled one or more ROIs to cover more pixels of the second video frame. Phrased differently, the position of a ROI in the second video frame may not necessarily match the position of the ROI in the corresponding first video frame. By rearranging the upscaled one or more ROIs, the available pixels of the second video frame may be better utilized, and e.g. allow to upscale the one or more ROIs without the one or more ROIs starting to overlap.
In one or more embodiments of the method, for at least one set of video frames, the operation iii) of adding the third video frame may further include inserting one or more skip-blocks for parts of the third video frame not encoding the one or more ROIs.
In one or more embodiments of the method, for at least one set of video frames, the first video frame may be an inter-frame referencing the first video frame of a previous set of video frames.
In one or more embodiments of the method, for at least one set of video frames, the third video frame may further reference the third video frame of a previous set of video frames.
In one or more embodiments of the method, the method may be performed by a same camera used to capture the one or more images of the scene. Phrased differently, the method may be performed on the edge, and e.g. without requiring additional, intermediate video processing equipment.
In one or more embodiments of the method, the first and/or third video frame may also be marked as a no-display frame. In general, if e.g. the video stream is not layered, it may be desirable to show (after decoding) only the first video frames providing the lower-resolution overview of the scene or the third video frames providing the higher-resolution ROIs of the scene, etc. It is also envisaged that e.g. the first and/or third video frame(s) may be marked using other types of metadata, such that a particular flag indicating whether the video frame corresponds to an overview or ROIs, and similar. A suitable decoder may then select whether to display only the overview, only the ROI, or e.g. both the overview and ROI (preferably on different parts of a screen or on different screens). Other examples include to e.g. configure the decoder such that it knows in what sequence the video frames are arranged, such that it may selectively e.g. render only the overview, only the ROIs, or e.g. both the overview and ROIs. For example, the decoder may be informed that the video frames arrive ordered as “1-2-3-1-2-3-1-2-3-1-2-3 . . . ” (where “1” indicates a first video frame, “2” a second video frame, and “3” a third video frame), or similar. If the video stream is layered, whether to only show the overview or the ROIs, or both, may be controlled accordingly by marking the first and third video frames as no-display frames or not. In general, the envisaged method still enables to provide the information required to decode and render both a lower-resolution overview of the scene as well as higher-resolution ROIs, in a same encoded video stream.
According to a second aspect of the present disclosure, there is provided a device for encoding a video stream. The device includes a processor (e.g. processing circuitry) and a memory storing instructions that, when executed by the processor, cause the device to: a) obtain (e.g. by receiving from another device, reading from an external or internal memory, reading and interpreting a received signal, receiving from one or more camera image sensors, and similar) one or more subsequent images of a scene captured by at least one camera, each of the one or more images having a respective first resolution, and, b) for each of the one or more images: —identify one or more regions of interest (ROIs) in the image (i.e., one or more ROIs of the scene); —add a set of video frames to an encoded video stream, the set comprising at least: i) a first video frame encoding at least part of the image at a respective second resolution lower than the first resolution; ii) a second video frame marked as a no-display frame, and being an inter-frame referencing the first video frame of the set of video frames and including motion vectors for upscaling of the one or more ROIs, and iii) a third video frame encoding the one or more ROIs of the image at a respective third resolution higher than the respective second resolution, the third video frame being an inter-frame referencing the second video frame of the set of video frames. The device of the second aspect may therefore be configured to perform the method of the first aspect as disclosed herein.
In one or more embodiments of the device, the instructions stored in the memory of the device may be further such that they, when executed by the processor, cause the device to perform any embodiment of the method (of the first aspect) as disclosed herein.
In one or more embodiments of the device, the device may be a camera (configured) for capturing the one or more images of the scene. The envisaged method may therefore be performed on the edge, within the camera itself, without intermediate video processing equipment. In other embodiments, the device may e.g. be a server which receives the one or more images of the scene from e.g. the video camera capturing the scene, or e.g. from a storage wherein such images has earlier been received from the camera, and similar.
According to a third aspect of the present disclosure, there is provided a computer program for encoding a video stream. The computer program is configured to, when executed by a processor of a device (such as the device of the second aspect), cause the device to: a) obtain (e.g. by receiving from another device, reading from an external or internal memory, reading and interpreting a received signal, receiving from one or more camera image sensors, and similar) one or more subsequent images of a scene captured by at least one camera, each of the one or more images having a respective first resolution, and, b) for each of the one or more images: —identify one or more regions of interest (ROIs) in the image (i.e., one or more ROIs of the scene); —add a set of video frames to an encoded video stream, the set comprising at least: i) a first video frame encoding at least part of the image at a respective second resolution lower than the first resolution; ii) a second video frame marked as a no-display frame, and being an inter-frame referencing the first video frame of the set of video frames and including motion vectors for upscaling of the one or more ROIs, and iii) a third video frame encoding the one or more ROIs of the image at a respective third resolution higher than the respective second resolution, the third video frame being an inter-frame referencing the second video frame of the set of video frames. The computer program of the third aspect may therefore be configured to cause a device as envisaged herein (or any other suitable device) perform the method of the first aspect as disclosed herein.
In one or more embodiments of the computer program, the computer program is further configured to, when executed by the processor of the device, cause the device to perform any embodiment of the method of the first aspect as disclosed herein.
According to a fourth aspect of the present disclosure, there is provided a computer program product, including a computer-readable storage medium storing a computer program (e.g. computer program code) according to the third aspect (or any embodiment thereof disclosed herein). As used herein, the computer-readable storage medium may e.g. be non-transitory, and be provided as e.g. a hard disk drive (HDD), solid state drive (SDD), USB flash drive, SD card, CD/DVD, and/or as any other storage medium capable of non-transitory storage of data. In other embodiments, the computer-readable storage medium may be transitory and e.g. correspond to a signal (electrical, optical, mechanical, or similar) present on e.g. a communication link, wire, or similar means of signal transferring, in which case the computer-readable storage medium is of course more of a data carrier than a data storing entity.
Other objects and advantages of the present disclosure will be apparent from the following detailed description, the drawings and the claims. Within the scope of the present disclosure, it is envisaged that all features and advantages described with reference to e.g. the method of the first aspect are relevant for, apply to, and may be used in combination with also the device of the second aspect, the computer program of the third aspect, and the computer program product of the fourth aspect, and vice versa.
Exemplifying embodiments will now be described below with reference to the accompanying drawings, in which:
In the drawings, like reference numerals will be used for like elements unless stated otherwise. Unless explicitly stated to the contrary, the drawings show only such elements that are necessary to illustrate the example embodiments, while other elements, in the interest of clarity, may be omitted or merely suggested. As illustrated in the Figures, the (absolute or relative) sizes of elements and regions may be exaggerated or understated vis-à-vis their true values for illustrative purposes and, thus, are provided to illustrate the general structures of the embodiments.
If the camera 120 is for example a surveillance camera, a speed control camera, or similar, it may be assumed that not all parts of the scene 100 are of equal interest to a person (or computer) viewing the scene 100 as captured by the camera 120, e.g. by decoding the encoded video stream 310 output by the encoder 200. To the contrary, there may be one or more regions of the scene 100 that are of particular interest, i.e. so-called regions of interest (ROIs). For example, if the camera 120 is a speed control camera, the most interesting parts of the scene 100 may include vehicles, and in particular the license plates of the vehicles such that each vehicle (and e.g. its owner) may be identified. If the camera 120 is a surveillance camera, the more interesting parts of the scene 100 may include faces of persons in the scene 100, in order to e.g. identify the persons based on known facial characteristics (facial recognition), and similar. Other regions of the scene 100, such as e.g. sky, lawns, walls, empty streets, etc., may be of less interest.
As certain regions (or parts) of the scene 100 may be more interesting than others, it may be beneficial if the depictions of these ROIs can be provided at higher resolution. For example, a facial recognition software, or a software configured to identify license plate numbers from images of license plates, may perform better on higher-resolution images than on lower-resolution images. One solution may include to capture the full images 130 at higher resolution and to avoid compressing (and/or reducing the resolution of) these images as part of the encoding process performed by the encoder 200. However, such overall resolution increase may quickly render the bandwidth required to satisfyingly transfer the encoded video stream 310 over e.g. the Internet too high, and may result in the video stream 310 not being transferred at all or at a rate which results in stuttering and/or lag for the receiver when attempting to decode and view the video stream 310. Another, more plausible, solution is instead of course to only transfer image data belonging to the ROIs at a higher resolution, while reducing the resolution for image data belonging to other parts of the scene 100 than the ROIs. Contemporary such solutions include to generate and encode a lower-resolution overview of the scene 100, and to transfer, in parallel therewith, higher-resolution crops (compressed using e.g. the Joint Photographic Experts Group, JPEG, standard or similar). However, the bandwidth required to transfer such higher-resolution JPEG crops may still be higher than desired, and issues may arise related to how to e.g. synchronize such higher-resolution crops with the images of the lower-resolution overview.
As will now be described in more detail, the present disclosure aims at improving on such contemporary solutions by providing an improved way of encoding a video stream, suitable to handle the situation wherein it is desirable to provide both a lower-resolution overview of the scene as well as higher-resolution crops/excerpts of one or more ROIs of the scene. The envisaged method will be described below with reference first also to
The method 500 includes, as part of an operation S501, obtaining at least one image of the scene 100 at a first resolution.
An operation S503 of the method 500 includes generating a first video frame 230 encoding at least part of the image 132 at a second resolution that is lower than the first resolution. This may e.g. be performed by providing the image 132 to a downscaling module 202 that may be provided/implemented as part of e.g. the encoder 200 or any other suitable device in accordance with the present disclosure. The first video frame 230 encodes a downscaled image 220 of the scene 100, with Nx horizontal/width pixels and Ny vertical/height pixels. The second resolution, defined by Nx and Ny is lower than the first resolution, e.g. Nx<Mx, Ny<Ny, or at least Nx×Ny<Mx×My. The first video frame 230 may be an intra-frame not referencing any previous (or later) video frame, or the first video frame 230 may be an inter-frame referencing one or more previous video frames and/or one or more later video frames in the encoded video stream.
An operation S504 of the method 500 includes generating a second video frame.
The bottom portion of
Instead of encoding the actual crops 212 and 213 directly in the second video frame 250, the method 500 as envisaged herein uses the motion vectors 230 for generating the up-scaled crops 212 and 213 from the lower-resolution overview of the scene 100 encoded in the first video frame 230. This may be achieved by e.g. deciding the target size and position of the crops 212 and 213 when the second video frame 250 is decoded, and to then, for each macroblock or similar of the second video frame 250, deciding where in the first video frame 230 and the lower-resolution image 220 encoded therein the corresponding features are to be found. This is schematically illustrated by the arrows corresponding to the first motion vectors 242 (for the first ROI 210) and the second motion vectors 244 (for the second ROI 211). Here, it may be seen that e.g. a macroblock belonging to the lower-right corner of the crop 212 (if assuming that the first ROI 210 is a rectangle) is defined by a motion vector 242 pointing towards the corresponding lower-right corner of the first ROI 210 in the first video frame 230. Similarly, a macroblock belonging to the lower-left corner of the crop 212 is defined by a motion vector pointing towards the corresponding lower-left (macroblock) corner of the first ROI 210 in the first video frame 230, and similar, and the same applies also to the motion vector(s) 244 for the second crop 213, pointing towards corresponding features/macroblocks of the second ROI 211 of the first video frame 230, etc. In
In particular, the motion vectors 240 are such that when the second video frame 250, which is an inter-frame referencing the first video frame 230, is decoded, the resulting image encoded by the second video frame 250 will include up-scaled (and potentially also rearranged) crops 212 and 213 of the ROIs 210 and 211 of the first video frame 210. The resolution of the image encoded by the second video frame 250 may be the same as the resolution of the image encoded by the first video frame 230, e.g. Nx×Ny pixels. An example of a resulting decoded image obtained after decoding the second video frame 250 is shown on the bottom-right of
A second video frame 250 as envisaged herein may be considered as a “synthetic” or “auxiliary” video frame, and may be generated using e.g. software and manually inserted into the encoded video stream. Further details about how to generate such synthetic frames as well as about the use of motion vectors to transform (e.g. scale, translate, rearrange) features of images may be found in Applicants own previously published applications US20190116371A1 and US20190116382A1.
In an operation S504, the method 500 includes generating a third video frame encoding the ROIs 210 and 211 at a third resolution that is higher than the second resolution.
Herein, that the third video frame 270 encodes the ROIs 210 and 211 at a higher (third) resolution than the second resolution means that, in at least one dimension, more pixels are used to represent the ROIs 210 and 211 in the third video frame 270 than in the downscaled overview image 220 of the scene 100 provided/encoded in the first video frame 230. For example, the resolution of the crops 214 and 215 may be equal to the resolution of the image 132 as captured by the camera 120. Depending on whether e.g. a layered type of coding is used or not, the overall resolution (i.e., full width/height) of the image 260 encoded in the third video frame 270 may be the same or e.g. higher than that of the image 220 encoded by the first video frame 230. The overall resolution of the image 260 may e.g. correspond to Kx horizontal/width pixels and Kx vertical/height pixels. If using a layered coding, the third video frame 270 may e.g. be include as part of an enhancement layer, and such that Kx>My, Kx>My, or at least Kx×Ky>Mx×My. If using e.g. a non-layered coding, it may be such that Kx=Mx, Ky=My, or at least Ky×Kx=Mx×My. In any case, the resolution of the actual ROI crops 214 and 215 will be higher than that of the ROIs 210 and 211 in the downscaled overview image 220, either due to a proportion of the area of the crops 214 and 215 to the total area of the image 260 being higher than a proportion of the ROIs 210 and 211 to the total area of the image 220, or by Kx>Mx (and/or Ky>My), or both.
The image 260 may be decoded based on information found in e.g. the image 132 as captured by the camera 120. However, instead of directly encoding the image 260 in the third video frame 270, the third video frame 270 is an inter-frame which references the second video frame 250, such that any redundant information between the image 260 and the image 231 encoded using the motion vectors 240 of the second video frame 250 may be utilized to more efficiently encode the third video frame 270 in terms of e.g. bandwidth and memory requirements. The arrangement and/or scaling of the crops 214 and 215 may e.g. be such that it matches the arrangement and/or scaling of the crops 212 and 213 of the second video frame 250, or the arrangement and/or scaling of the crops 214 and 215 may be different from the arrangement and/or scaling of the crops 212 and 213 of the second video frame 250. If required, the third video frame 270 may include motion vectors allowing to identify where information for each macroblock required to decode the crops 214 and 215 is to be found in the (decoded) second video frame 250. In the third video frame 270, macroblocks not relevant for the crops 214 and 215 may e.g. be provided as empty blocks, as skip-blocks, or similar.
In an operation S506 of the method 500, the first, second and third video frames 230, 250 and 270 are used to form a set of video frames added to the encoded video stream output from the encoder 200. The frames 230, 250 and 270 may e.g. be added in the order they are generated or in any other suitable order, and may e.g. be added together or individually. Generally herein, the exact order of the various video frames is not necessarily important, as long as sufficient information is somehow provided for the decoder to figure out in which order to display the decoded images in order to e.g. play the decoded video stream in a way which is representative of the sequence of images of the scene captured by the camera 120.
In an optional operation S507, the method 500 may include to check whether there are more images available and ready to be encoded. If yes, as illustrated by the dashed arrow 510, the method 500 may loop-back to operation S501 and receive a new such image, and continue to generate a new set of video frames for the new image, etc. If there are no more images available for encoding, the method 500 may (as illustrated by the dashed arrow 512) e.g. stop, or repeat checking for new images (operation S507) until one or more such new images of the scene arrives.
As result of receiving a second (e.g. later) image 132b of the scene 100 at another (e.g. later) time instance, the method 500 repeats in order to add, to a second set of video frames 280b, a first video frame 230b encoding a lower-resolution, downscaled version/overview image 220b of the second image 132b. Likewise, a second (no-display) video frame 250b of the second set 280b includes motion vectors 240b and references the first video frame 250b in order to generate upscaled versions of the ROIs identified in the second image 132b. Similarly, a third video frame 270b of the second set 280b references the second video frame 250b in order to more efficiently encode (as part of an image 260b) higher-resolution versions/crops of the ROIs identified in the second image 132b, and so on. In this example video stream 300, all images may have a same resolution (i.e. size), and the higher-resolution ROIs may be formed by the crops being scaled such that they consume a larger portion of the overall image area than in the overview image, i.e. as both the scaled ROI crops obtained by decoding the second video frames and the scaled ROI crops obtained by decoding the third video frames are wider and/or higher (in terms of pixels) than their counterparts in the overview images in the first video frames.
Optionally, to further enhance coding efficiency, the first video frame of a particular set of video frames may reference first video frames of e.g. one or more previous or one or more later set of video frames. As an example, in
Herein, it is also envisaged to provide a device, computer program and computer program product for encoding a video stream, as will now be described in more detail with reference also to
The device 600 may for example be a monitoring camera mounted or mountable on a building, e.g. in form of a PTZ-camera or e.g. a fisheye-camera capable of providing a wider perspective of the scene, or any other type of monitoring/surveillance camera. The device 600 may for example be a body camera, action camera, dashcam, or similar, suitable for mounting on persons, animals and/or various vehicles, or similar. The device 600 may for example be a smartphone or tablet which a user can carry and film a scene. In any such examples of the device 600, it is envisaged that the device 600 may include all necessary components (if any) other than those already explained herein, as long as the device 600 is still able to perform the method 500 or any embodiments thereof as envisaged herein.
In general terms, each functional module 610a-g may be implemented in hardware or in software. Preferably, one or more or all functional modules 610a-g may be implemented by the processing circuitry 610, possibly in cooperation with the storage medium/memory 612 and/or the communications interface 616. The processing circuitry 610 may thus be arranged to from the memory 612 fetch instructions as provided by a functional module 610a-g, and to execute these instructions and thereby perform any operations of the method 500 performed by/in the device 600 as disclosed herein.
Generally, the device 600 may e.g. be an encoder such as the encoder 200, or at least include encoding functionality.
In the example of
In summary of the various embodiments presented herein, the present disclosure provides an improved way of encoding a video stream in order to provide both a lower-resolution overview of a scene as well as one or more higher-resolution crops of one or more ROIs of the scene, as part of a same encoded video stream. By inserting synthetic/auxiliary no-display video frames with motion vectors for upscaling (and possibly also rearranging) of the ROIs, a basis is created from which higher-resolution crops of the ROIs may more efficiently be encoded, to thereby enable provision of such lower- and higher-resolution images at lower bitrate than if e.g. providing the higher-resolution images as JPEG crops in parallel with a video stream encoding only the lower-resolution overview images. The present disclosure and the method, device, computer program and computer program product envisaged herein also eliminates potential issues with how to synchronize the higher-resolution ROI images with the lower-resolution overview image, as all video frames are inserted as part of a same encoded video stream, and synchronization thus obtained “out of the box”. Another advantage is that the envisaged method is standard compliant, in the sense that an ordinary video player may still decode and play back e.g. the overview-part of the video stream even if the video stream is also provided with the higher-resolution ROIs. A more advanced decoder/video player, such as a decoder/video player specifically configured to handle the enhanced video stream as envisaged herein, may however unlock the full capability of the video stream, and render visible both the overview and the higher-resolution ROIs.
Although features and elements may be described above in particular combinations, each feature or element may be used alone without the other features and elements or in various combinations with or without other features and elements. Additionally, variations to the disclosed embodiments may be understood and effected by the skilled person in practicing the claimed invention, from a study of the drawings, the disclosure, and the appended claims.
In the claims, the words “comprising” and “including” does not exclude other elements, and the indefinite article “a” or “an” does not exclude a plurality. The mere fact that certain features are recited in mutually different dependent claims does not indicate that a combination of these features cannot be used to advantage.
Number | Date | Country | Kind |
---|---|---|---|
23204444.6 | Oct 2023 | EP | regional |