The present invention relates to a method, a device, and a computer program for improving encapsulating and parsing of media data, making it possible to optimize the indexing and transmission of portions of encapsulated media content data.
The invention relates to encapsulating, parsing, and streaming media content data, e.g. according to ISO Base Media File Format as defined by the MPEG standardization organization, to provide a flexible and extensible format that facilitates interchange, management, editing, and presentation of group of media content and to improve its delivery for example over an IP network such as the Internet using adaptive http streaming protocol.
The International Standard Organization Base Media File Format (ISO BMFF, ISO/IEC 14496-12) is a well-known flexible and extensible format that describes encoded timed media content data or bit-streams either for local storage or transmission via a network or via another bit-stream delivery mechanism. This file format has several extensions, e.g. Part-15, ISO/IEC 14496-15 that describes encapsulation tools for various NAL (Network Abstraction Layer) unit-based video encoding formats. Examples of such encoding formats are AVC (Advanced Video Coding), SVC (Scalable Video Coding), HEVC (High Efficiency Video Coding), L-HEVC (Layered HEVC), or VVC (Versatile Video Coding). This file format is object-oriented. It is composed of building blocks called boxes (or data structures, each of which being identified by a four character code) that are sequentially or hierarchically organized and that define descriptive parameters of the encoded timed media content data or bit-stream such as timing and structure parameters. In the file format, the overall presentation over time is called a movie. The movie is described by a movie box (with four character code ‘moov’) at the top level of the media or presentation file. This movie box represents an initialization information container containing a set of various boxes describing the presentation. It may be logically divided into tracks represented by track boxes (with four character code ‘trak’). Each track (uniquely identified by a track identifier (track_ID)) represents a timed sequence of media content data pertaining to the presentation (frames of video, for example). Within each track, each timed unit of media content data is called a sample; this might be a frame of video, a sample of audio, or a set of timed metadata. Samples are implicitly numbered in sequence. The actual samples data are in boxes called Media Data boxes (with four character code ‘mdat’) or Identified Media Data boxes (with four character code ‘imda’) at the same level as the movie box. The movie may also be fragmented, i.e. organized temporally as a movie box containing information for the whole presentation followed by a list of movie fragment and Media Data box pairs or movie fragment and Identified Media Data box pairs. Within a movie fragment (box with four-character code ‘moof’) there is a set of track fragments (box with four character code ‘traf’), zero or more per movie fragment. The track fragments in turn contain zero or more track run boxes (‘trun’), each of which documents a contiguous run of samples for that track fragment.
Media data encapsulated with ISOBMFF can be used for adaptive streaming with HTTP. For example, MPEG DASH (for “Dynamic Adaptive Streaming over HTTP”) and Smooth Streaming are HTTP adaptive streaming protocols enabling segment or fragment based delivery of media files. In the following, it is considered that media data designate encapsulated data comprising metadata and media content data (the latter designating the bit-stream that is encapsulated). The MPEG DASH standard (see “ISO/IEC 23009-1, Dynamic adaptive streaming over HTTP (DASH), Part1: Media presentation description and segment formats”) makes it possible to establish a link between a compact description of the content(s) of a media presentation and the HTTP addresses. Usually, this association is described in a file called a manifest file or description file. In the context of DASH, this manifest file is a file also called the MPD file (for Media Presentation Description). When a client device gets the MPD file, the description of each encoded and deliverable version of media content can easily be determined by the client. By reading or parsing the manifest file, the client is aware of the kind of media content components proposed in the media presentation and is aware of the HTTP addresses for downloading the associated media content components. Therefore, it can decide which media content components to download (via HTTP requests) and to play (decoding and playing after reception of the media data segments). DASH defines several types of segments, mainly initialization segments, media segments, or index segments. Initialization segments contain setup information and metadata describing the media content, typically at least the ‘ftyp’ and ‘moov’ boxes of an ISOBMFF media file. A media segment contains the media data. It can be for example one or more ‘moof’ plus ‘mdaf’ or ‘imda’ boxes of an ISOBMFF file or a byte range in the ‘mdat’ or ‘imda’ box of an ISOBMFF file. A media segment may be further subdivided into sub-segments (also corresponding to one or more complete ‘moof’ plus ‘mdaf’ or ‘imda’ boxes). The DASH manifest may provide segment URLs or a base URL to the file with byte ranges to segments for a streaming client to address these segments through HTTP requests. The byte range information may be provided by index segments or by specific ISOBMFF boxes such as the Segment Index box ‘sidx’ or the SubSegment Index box ‘ssix’.
As illustrated, a server 100 comprises an encapsulation module 105 connected, via a network interface (not represented), to a communication network 110 to which is also connected, via a network interface (not represented), a de-encapsulation module 115 of a client 120.
Server 100 processes data, e.g. video and/or audio data, for streaming or for storage. To that end, server 100 obtains or receives data comprising, for example, an original sequence of images 125, encodes the sequence of images into media content data (or bit-stream) using a media encoder (e.g. video encoder), not represented, and encapsulates the media content data in one or more media files or media segments 130 using encapsulation module 105. The encapsulation process consists in storing the media content data in ISOBMFF boxes and generating and/or storing associated metadata describing the media content data. Encapsulation module 105 comprises at least one of a writer or a packager to encapsulate the media content data. The media encoder may be implemented within encapsulation module 105 to encode received data or may be separate from encapsulation module 105.
Client 120 is used for processing data received from communication network 110, or read from a storage device, for example for processing media file 130. After the received data have been de-encapsulated in de-encapsulation module 115 (also known as a parser), the de-encapsulated data (or parsed data), corresponding to a media content data or bit-stream, are decoded, forming, for example, audio and/or video data that may be stored, rendered (e.g. play or display) or output. The media decoder may be implemented within de-encapsulation module 115 or it may be separate from de-encapsulation module 115. The media decoder may be configured to decode one or more media content data or bit-streams in parallel.
It is noted that media file 130 may be communicated to de-encapsulation module 115 in different ways. In particular, encapsulation module 105 may generate media file 130 with a media description (e.g. DASH MPD) and communicates (or streams) it directly to de-encapsulation module 115 upon receiving a request from client 120.
For the sake of illustration, media file 130 may encapsulate media content data (e.g. encoded audio or video) into boxes according to ISO Base Media File Format (ISOBMFF, ISO/IEC 14496-12 and ISO/IEC 14496-15 standards). In such a case, media file 130 may correspond to one or more media files (indicated by a FileTypeBox ‘ftyp’), as illustrated in
In the example illustrated in
It is recalled that levels represent specific features of subsets of the media content data or bit-stream (e.g. scalability layers) and obeys to the following constraint: samples corresponding to level n may only depend on samples of levels m, where m is smaller than or equal n. The feature actually associated with a given level value is determined from the level assignment box ‘leva’ located into the movie box ‘moov’. For each level, the level assignment box ‘leva’ provides an assignment type. This assignment type indicates the mechanism used to specify the assignment of a feature to a level. For the sake of illustration, the assignment of levels to partial sub-segments (i.e. to byte ranges) may be based on sample groups, tracks, or sub-tracks:
While these file formats and these methods for transmitting media data have proven to be efficient, there is a continuous need to improve selection of the data to be sent to a client while reducing the complexity of the description of the indexation, reducing the requested bandwidth, and taking advantage of the increasing processing capabilities of the client devices.
The present invention has been devised to address one or more of the foregoing concerns.
The present invention has been devised to address one or more of the foregoing concerns.
In this context, there is provided a solution for improving indexing of portions of encapsulated media content data.
According to a first aspect of the invention there is provided a method for encapsulating media data, the media data comprising metadata and data associated with the metadata, the metadata being descriptive of the associated data, the media data comprising a plurality of segments, at least one segment comprising a plurality of sub-segments, the method being carried out by a server and comprising:
Accordingly, the method of the invention makes it possible to improve indexing of encapsulated data and thus, to improve data transmission efficiency and versatility.
According to some embodiments, a same level value is associated with at least two non-contiguous byte ranges of the at least one of the sub-segments.
According to some embodiments, the feature type value indicates that the features associated with level values are defined within metadata descriptive of data of the segments.
According to some embodiments, the feature type value indicates that the level values are representative of dependency levels.
According to some embodiments, the feature type value indicates that the level values are representative of track dependency levels. A track identifier may be associated with a level value.
According to some embodiments,
According to some embodiments, the feature type value indicates that the level values are representative of data integrity of data of the corresponding byte range.
According to some embodiments, the metadata descriptive of partial sub-segments of the at least one of the sub-segments further comprise a flag indicating that an end portion of a byte range can be ignored for decoding the encapsulated media data.
According to some embodiments, the feature type value is a first feature type value, the at least one of the sub-segments being referred to as a first sub-segment, metadata descriptive of partial sub-segments of the sub-segments further comprising a second feature type value representative of features associated with level values of a second sub-segment of the at least one segment, different from the first sub-segment.
According to some embodiments, the metadata descriptive of partial sub-segments of the at least one of the sub-segments belong to a box of the ‘ssix’ type, the media data being encapsulated according to ISOBMFF. The metadata descriptive of data of the segments may belong to a box of the ‘leva’ type.
According to a second aspect of the invention there is provided a method for transmitting media data, the media data comprising metadata and data associated with the metadata, the metadata being descriptive of the associated data, the media data comprising a plurality of segments, at least one segment comprising a plurality of sub-segments, the method comprising encapsulating the media data according to the method described above.
According to a third aspect of the invention there is provided a method for processing received encapsulated media data, the media data being encapsulated according to the method described above.
The method of the second and third aspect of the invention makes it possible to improve indexing of encapsulated data and thus, to improve data transmission efficiency and versatility.
According to a fourth aspect of the invention there is provided a method for processing received encapsulated media data, the media data comprising metadata and data associated with the metadata, the metadata being descriptive of the associated data, the media data comprising a plurality of segments, at least one segment comprising a plurality of sub-segments, the method being carried out by a client and comprising:
Accordingly, the method of the invention makes it possible to improve indexing of encapsulated data and thus, to improve data transmission efficiency and versatility.
According to a fifth aspect of the invention there is provided a device for encapsulating, transmitting, or receiving encapsulated media data, the device comprising a processing unit configured for carrying out each of the steps of the method described above.
The fifth aspect of the present invention has advantages similar to those mentioned above.
At least parts of the methods according to the invention may be computer implemented. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit”, “module” or “system”. Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer usable program code embodied in the medium.
Since the present invention can be implemented in software, the present invention can be embodied as computer readable code for provision to a programmable apparatus on any suitable carrier medium. A tangible carrier medium may comprise a storage medium such as a floppy disk, a CD-ROM, a hard disk drive, a magnetic tape device or a solid state memory device and the like. A transient carrier medium may include a signal such as an electrical signal, an electronic signal, an optical signal, an acoustic signal, a magnetic signal or an electromagnetic signal, e.g. a microwave or RF signal.
Embodiments of the invention will now be described, by way of example only, and with reference to the following drawings in which:
According to some embodiments, the invention makes it possible to reduce the complexity of description of the indexation of multiple byte ranges for a same level, for instance to signal multiple Stream Access Points (SAP) within a sub-segment. The invention also makes it possible to introduce new level values or to change the feature associated with a level on the fly.
This is obtained by providing means to set predefined feature types (also denoted predefined level assignment types) and to use a segment index box ‘sidx’ and a sub-segment index box ‘ssix’ without requiring the definition of a level assignment box ‘leva’ and possibly their associated sample groups.
As illustrated, a first request and response (steps 500 and 505) aim at providing the streaming manifest to the client, that is to say the media presentation description. From the manifest, the client can determine the initialization segments that are required to set up and initialize its decoder(s). Next, the client requests one or more of the initialization segments identified according to the selected media components through HTTP requests (step 510). The server replies with metadata (step 515), typically the ones available in the ISOBMFF ‘moov’ box and its sub-boxes. The client does the set-up (step 520) and may request index information from the server (step 525). This is the case for example in DASH profiles where indexed media segments are in use, e.g. live profile. To achieve this, the client may rely on an indication in the MPD (e.g. indexRange) providing the byte range for the index information. When the media data are encapsulated according to ISOBMFF, the segment index information may correspond to the Segmentlndex box ‘sidx’ and optionally an associated new version of the sub-segment index box ‘ssix’ according to some embodiments of the invention, as described here after. In the case according to which the media data are encapsulated according to MPEG-2 TS, the indication in the MPD may be a specific URL referencing an Index Segment.
Next, the client receives the requested segment index from the server (step 530). From this index, the client may compute byte ranges (step 535) to request movie fragments or portions of a movie fragment at a given time (e.g. corresponding to a given time range) or corresponding to a given feature of the bit-stream (e.g. a point to which the client can seek (e.g. a random-access point or stream access point), a scalability layer, a temporal sub-layer or a spatial sub-part such as a HEVC tile or VVC subpicture. The client may issue one or more requests to get one or more movie fragments or portions of movie fragments (typically portions of data within the Media data box) for the selected media components in the MPD (step 540). The server replies to the requested data by sending one or more sets of data byte ranges comprising ‘moof’, ‘mdat’ boxes, or portions of ‘mdat’ boxes (step 545). It is observed that the requests for the movie fragments may be made directly without requesting the index, for example when media segments are described as segment template and no index information is available.
Upon reception of the requested data, the client decodes and renders the corresponding media data and prepares the request for the next time interval (step 550). This may consist in getting a new index, even sometimes in getting an MPD update or simply to request next media segments as indicated in the MPD (e.g. following a SegmentList or a SegmentTemplate description).
As illustrated, a first step is directed to encoding media content data as including one or more bit-stream features (e.g. points to which the client can seek (i.e. random-access points or stream access points), scalability layers, temporal sub-layers, and/or spatial sub-parts such as HEVC tiles or VVC sub-pictures) (step 600). Potentially, multiple alternatives of the encoded media content can be generated, for example in terms of quality, resolution, etc. The encoding step results in bit-streams that are encapsulated (step 605). The encapsulation step comprises generating structured boxes containing metadata describing the placement and timing of the media content data. The encapsulation step (605) may also comprise generating indexes to make it possible to access sub-parts of the encoded media content, for example as described by reference to
Next, one or more media files or media segments resulting from the encapsulation step are described in a streaming manifest (step 610), for example in a MPD. Next, the media files or segments with their description are published on a streaming server for diffusion to clients (step 615).
A file writer may only conduct steps 600 and 605 to produce encapsulated media data and save them on a storage device.
As illustrated, a first step is directed to requesting and obtaining a media presentation description (step 700). Next, the client gets initialization information (e.g. the initialization segments) from the server and initializes its player(s) and/or decoder(s) (step 705) by using items of information of the obtained media description and initialization segments.
Next, the client selects one or more media components to play from the media description (step 710) and requests information on these media components, for example index information (step 715) including for instance a ‘sidx’ box, a ‘ssix’ box modified according to some embodiments of the invention, and optionally a ‘leva’ box modified according to some embodiments of the invention. Next, after having parsed received index information (step 720), the client may determine byte ranges for data to request, corresponding to portions of the selected media components (step 725). Next, the client issues requests for the data that are actually needed (step 730).
As described by reference to
A file parser may only conduct steps 705 to 725 to access portions of data from an encapsulated media content data located on a local storage device.
According to an aspect of some embodiments of the invention, a new version of the level Issignment box ‘leva’ is defined to authorize multiple byte ranges for a given level.
According to the example illustrated in
When version 0 of the level assignment box ‘leva’ is used, within a fraction, data for each level appear contiguously, and data for levels appear in increasing order of level values. All data in a fraction are assigned to levels. When new version 1 or more of the level assignment box ‘leva’ is used, data for each level need not be stored contiguously and data for levels may be stored in random order of level value. Some data in a fraction may have no level assigned, in which case the level is unknow but is not a level from the levels defined by the level assignment box.
According to particular embodiments, a new version of the sub-segment index box ‘ssix’ is defined to authorize either multiple byte ranges for a given level with the level assignment provided by a ‘leva’ box or to authorize a single or multiple byte ranges for a given level, through predefined feature types (also denoted level assignment types), without defining a ‘leva’ box.
According to this new version, the sub-segment index box ‘ssix’ provides a mapping of levels to byte ranges of the indexed sub-segment, as specified by a level assignment box ‘leva’, (located in the movie box ‘moov’) or as indicated in the ‘ssix’ box itself. The indexed sub-segments are described by a segment index box ‘sidx’. In other words, this ‘ssix’ box provides a compact index describing how the data in a sub-segment are ordered in partial sub-segments, according to levels. It enables a client to easily access data for partial sub-segments by downloading ranges of data in the sub-segment.
According to some embodiments, there is none or one sub-segment index boxes ‘ssix’ per segment index box ‘sidx’ that indexes only leaf sub-segments, i.e. that indexes only sub-segments (but no segment indexes). A sub-segment index box ‘ssix’, if any, is the next box after the associated segment index box ‘sidx’. A sub-segment index box ‘ssix’ documents the sub-segments that are indicated in the immediately preceding segment index box ‘sidx’.
It is observed here that, in general, the media data constructed from the byte ranges are incomplete, i.e. they do not conform to the media format of the entire sub-segment.
According to some embodiments and for version 0 of the ‘ssix’ box, each level is assigned to exactly one partial sub-segment according to an increasing order of level values, i.e. byte ranges associated with one level are contiguous and samples of a partial sub-segment may depend on any sample of preceding partial sub-segments in the same sub-segment (but cannot depend on samples of following partial sub-segments in the same sub-segment). This implies that all data for a given level require a single byte range to be retrieved.
According to some embodiments of the invention, for the new version 1 or higher of the ‘ssix’ box, multiple byte ranges, possibly discontinuous, associated with the same level, may be described. As a consequence, obtaining all the data corresponding to a given level may require multiple byte ranges to be retrieved.
It is noted that when a partial sub-segment is accessed in this way, for any assignment_type value other than three in the level assignment box ‘leva’, the final media data box may be incomplete, that is, less data than indicated by the length indication of the media data box are present. Therefore, the length stored within the media data box may need to be adjusted or padding may be needed.
It is also noted that the byte ranges corresponding to partial sub-segments may include both movie fragment boxes and media data boxes. The first partial sub-segment, i.e. the partial sub-segment associated with the lowest level, corresponds to a movie fragment box as well as (parts of) media data box(es), whereas subsequent partial sub-segments (partial sub-segments associated with higher levels) may correspond to (parts of) media data box(es) only.
According to particular embodiments of the invention and for version 0 of the sub-segment index box ‘ssix’, the presence of the level assignment box ‘leva’ in the movie box ‘moov’ is required and the level assignment box ‘leva’ have a version equal to 0.
Still according to particular embodiments of the invention and for version 1 or higher of the sub-segment index box ‘ssix’, the presence of the level assignment box ‘leva’ is only required for a feature type (or level_assignment_type) equals to 0, in which case the level assignment box ‘leva’ have a version set tof 1. The presence of the level assignment box ‘leva’ is not required for the other feature type values.
Still according to particular embodiments of the invention, the semantics of the attributes in the new version of the ‘ssix’ may be defined as follows:
Alternatively, the flags Isc, lbs, and rbs can be removed from the box syntax and defined as parts of the FullBox flags instead.
In a variant, the incomplete flag is optional or could be removed since this information can be deduced by cross-checking the sum of byte ranges of a sub-segment with the sub-segment size documented in the ‘sidx’ box.
In a variant, different values of incomplete flag or feature type can be signalled for each sub-segment within a segment by declaring them within the subsegment_count loop in the new version of ‘ssix’ box.
Still alternatively, it is possible to define more than one sub-segment index box ‘ssix’ with version 1 or higher per segment index box ‘sidx’ that indexes only leaf sub-segments. In such cases, the multiple sub-segment index boxes ‘ssix’ all document the sub-segments that are indicated in the immediately preceding segment index box ‘sidx’ and each sub-segment index box uses a different predefined feature type, referenced 920 in
According to another aspect of the invention, the data of a sample or a NALU (Network Abstraction Layer (NAL) unit) within a sample that are actually corrupted or lost are signalled. Data corruption may happen, for example, when data are received through an error-prone communication mean. To signal corrupted data in the bit-stream to be encapsulated, a new sample group description with grouping_type ‘corr’ may be defined. This sample group ‘corr’ can be defined in any kind of tracks (e.g. video, audio or metadata). For the sake of illustration, an entry of this sample group description may be defined as follows:
where corrupted is a parameter that indicates the corruption state of the associated data.
According to some embodiments, value 1 means that the entire set of data is lost. In such a case, the associated data size (sample size, or NAL size) should be set to 0. Value 2 means that the data are corrupted in such a way that they cannot be recovered by a resilient decoder (for example, loss of a slice header of a NAL). Value 3 means that the data are corrupted, but that they may still be processed by an error-resilient decoder. Value 0 is reserved.
According to some embodiments, no associated grouping_type_parameter is defined for CorruptedSamplelnfoEntry. If some data are not associated with an entry in CorruptedSamplelnfoEntry, this means these data are not corrupted and not lost.
A SampleToGroup Box ‘sbgp’ with grouping_type equal to ‘corr’ allows associating a CorruptedSamplelnfoEntry with each sample and indicating if the sample contains corrupted or lost data.
This sample group description with grouping_type ‘corr’ can be also advantageously combined within the NALU mapping mechanism composed by a sampletogroup box ‘sbgp’, a sample group description box ‘sgpd’, both with grouping_type ‘nalm’ and sample group description entries NALUMapEntry. A NALU mapping mechanism with a grouping_type_parameter set to ‘corr’ allows signalling corrupted NALUs in a sample. The groupID of the NALUMapEntry map entry indicates the index, beginning from one, in the sample group description of the CorruptedSamplelnfoEntry. A grouplD set to zero indicates that no entry is associated herewith (the identified data are present and not corrupted).
This sample group ‘corr’ with or without NALU mapping may be used in a media file even if no indexing is performed.
This sample group ‘corr’ with or without NALU mapping may also be used in a track with a sample entry of type ‘icpv’ (signalling an incomplete track) to provide more information on which samples or NALUs in a sample (when combined with NALU mapping) are corrupted or missing.
In an alternative, when the sample group ‘corr’ is combined with the NALU mapping, it may be defined as a virtual sample group, i.e, no sample group description box ‘sgpd’ is defined with grouping_type ‘corr’ and entries CorruptedSampleInfoEntry. Instead, when a SampleToGroupBox of grouping_type ‘nalm’ contains a grouping_type_parameter equal to the virtual sample group ‘corr’, the most-significant 2-bits of the grouplD in NALUMapEntry in the SampleGroupDescriptionBox with grouping_type ‘nalm’ directly provides the corrupted parameter value (as described above) associated with the NAL unit(s) mapped to this grouplD.
In an alternative embodiment, the sample group ‘corr’ can be extended to signal codec-specific information describing the type of corruptions or losses in data of a sample. This item of information can be specified for each derived ISOBMFF specification (e.g. storage of NAL unit structured video in ISOBMFF ISO/IEC 14496-15, Omnidirectional MediA Format (OMAF) ISO/IEC 23090-2, Carriage of Visual Volumetric Video-based Coding (V3C) Data ISO/IEC 23090-10) or for each video codec, audio codec, or metadata specification (e.g. AVC, MVC, HEVC, VVC, AV1, VP9, AAC, MP3, MPEG-H 3D audio, XMP...). Each specification can define what should be indicated for such corrupted data in a sample.
For the sake of illustration and according to this alternative embodiment, an entry of a sample group description with grouping type ‘corr’ may be defined as follows,
}
where
If no data are associated with a CorruptedSampleInfoEntry entry by a sample group with the grouping_type ‘corr’, or if data are associated with a description_group_index = 0 by a sample group with the grouping_type ‘corr’, this means that the data are not corrupted.
The processing of a sample with the corrupted parameter equal to 1 or 2 is context and implementation specific.
As an example, for NALU-based video formats (e.g. AVC, SVC, MVC, HEVC, VVC, EVC whose storage in ISOBMFF is specified in ISO/IEC 14496-15), the codec_specific_param parameter of the CorruptedSampleInfoEntry entry can be defined as a bit mask, with most significant bit first, of the following flags:
As another example, it is also possible to define codec specific corruption signalling that remains generic for several codecs, the codec_specific_param parameter of the CorruptedSampleInfoEntry entry can be defined as a bit mask, with most significant bit first, of the following flags:
A codec_specific_param parameter with value 0 means that no information is available for describing the corruption.
A CorruptedSampleInfoEntry entry may be used with a sample group of the grouping_type ‘nalm’ and a NALUMapEntry, using the grouping_type_parameter ‘corr’. The grouplD of the NALUMapEntry map entry indicates the index, starting from 1, in the sample group description of the grouping_type ‘corr’ of the CorruptedSampleInfoEntry entry. A grouplD of 0 indicates that no entry is associated (the data identified by the sample group of grouping_type ‘nalm’ is present and not corrupted).
More generally, a CorruptedSampleInfoEntry entry may be used with any sample group providing a functionality similar to the sample group of the grouping_type ‘nalm’, i.e. that allows associating properties with sub-units of a sample, e.g. NAL units, subpictures, tiles, slices, or Open Bitstream Units.
In a variant, the ParameterSetCorruptedFlag flag may be split per NAL type, i.e. different values of the codec_specific_param bit-mask may be defined for each type of parameter set NAL units to signal if this specific type of parameter set NAL units is corrupted (e.g. the bit-masks DCICorruptedFlag, VPSCorruptedFlag, SPSCorruptedFlag, SPSCorruptedFlag, PPSCorruptedFlag, APSCorruptedFlag, OPICorruptedFlag, etc.).
In another variant, a specific value of the bit-mask codec_specific_param can be defined to signal that Picture Header NAL units are corrupted.
In the following,
This sample group description or alternatives with grouping_type ‘corr’ can also be used to signal corrupted data within a partial sub-segment and its corresponding byte range defined by a sub-segment index box ‘ssix’. A level value can be assigned to a CorruptedSamplelnfoEntry through a level assignment box by setting the assignment type to zero (i.e. using sample groups) and the grouping type to ‘corr’
As another alternative, rather than relying on the level assignment box ‘leva’, a new value of predefined feature type can be defined in the version 1 of sub-segment index box ‘ssix’. For the sake of illustration, such a predefined feature type may correspond to the value three, signalling that each level value corresponds to a data integrity level, and may be defined as follows:
Accordingly, it is possible to signal whether a partial sub-segment is corrupted or not without going through a level assignment box and without defining a sample group of grouping_type ‘corr’
In a variant, when the level indicates that the byte range is corrupted, an additional codec_specific_param parameter may also be defined with the same semantics as described above to indicate codec specific information on the corruption of the byte range.
Still according to another aspect of the invention, the parameter set NAL units (e.g. Video Parameter Set (VPS), Sequence Parameter Set (SPS), Picture Parameter Set (PPS), etc.) are indexed in the encapsulated bit-stream. To ease their indexing and to avoid multiplying the number of byte ranges (e.g. to avoid having one byte range per NAL unit), they can be grouped together in a continuous byte range. This can be done by defining an array of NAL units in the decoder config record in sample entries but in such a case, the sample entries are all defined in the initial movie box ‘moov’ and cannot be updated on the fly. However, when the bit-stream is fragmented and encapsulated into multiple media segments, it may be useful to be able to update the array of parameter set NAL units per fragment.
According to some embodiments of the invention, it is allowed to declare the sample description box, not only in the movie box ‘moov’, but also in the movie fragment box ‘moof’. It is then possible to declare new sample entries with an updated array of parameter set NAL units at movie fragment level. Samples are associated with a sample entry via a sample description index value. The range of values for the sample description index is split into two ranges to allow distinguishing sample entries defined in the movie box ‘moov’ from sample entries defined in a movie fragment box ‘moof’, for example as follows:
The sample entries given in a sample description box ‘stsd’ defined in a movie fragment are only valid for the corresponding media fragment.
The updated parameter set NAL units defined in a movie fragment can be easily retrieved by using a sub-segment index box with version 1, a feature type equal to 1, and a level 0 to index the movie fragment containing the array of parameter set NAL units.
The ability to define new sample entries in a movie fragment box ‘moof’ in addition to the movie box ‘moov’ (denoted as dynamic sample entries) may be used in a media file even if no indexing is performed in order to provide updates of parameter sets without mixing corresponding non-VCL NALUs with VCL NAL units for the samples.
Having dynamic sample entries provides an alternative to in-band signalling of parameter sets or to the use of dedicated parameter set track. This could be useful, for example in VVC coding format for Adaptation Parameter Set (APS) NALUs that may be much more dynamic than other Parameter Set NALUs (e.g. Sequence Parameter Set (SPS), Picture Parameter Set (PPS) NALUs).
In an alternative, new sample entry types may be reserved to indicate that tracks with those sample entry types contain dynamic sample entries.
In a variant use case, the ability to declare new sample entries in a media fragment provides for instance a mean to update along time the table of metadata keys (located in a Metadata Key Table Box declared in sample entry of type ‘mebx’) in a multiplexed timed metadata track.
As illustrated, a first step is directed to determining whether the feature type is equal to zero (step 1000). If the feature type is equal to zero, the level attribute is interpreted according to the level assignment defined by the level assignment box ‘leva’ as defined in ISO/IEC 14496-12 (step 1005).
On the contrary, if the feature type is not equal to zero, a second test is carried out to determine whether the feature type is equal to one (step 1010). If the feature type is equal to one, the level attribute is interpreted as a dependency level (step 1015).
If the feature type is not equal to one, a third test is carried out to determine whether the feature type is equal to two (step 1020). If the feature type is equal to two, the level attribute is interpreted as a multitrack dependency level (step 1025). In such a case, the level attribute is composed of two items of information, a level (also denoted dependency level) as defined for the feature type equal to one and an identifier of the track to which the data of the byte range belong (step 1030).
Next, if the level attribute is interpreted as a dependency level or as a multitrack dependency level, the definition of the dependency level is obtained.
As illustrated, if a level value is equal to zero (reference 1035), this means that the associated byte range contains exactly one or more file-level boxes (e.g. movie fragment, reference 1040). Media data boxes are not included in level 0 byte ranges.
If a level value is equal to one (reference 1045), this means that the associated data are independently decodable (SAP 1, 2 or 3, reference 1050). Byte ranges assigned to level 1 may contain the initial part of the sub-segment (e.g. movie fragment box). The beginning of a byte range assigned to level 1 coincides with the beginning of a top-level box in the sub-segment.
If the level value is equal to two (reference 1055), this means that the associated data are independently decodable (SAP 1, 2 or 3, reference 1060)). The beginning of a byte range assigned to level 2 does not coincide with the beginning of a top-level box in the sub-segment.
If the level value is equal to N (step 1055), N being greater than two, this means that the associated data require data from the preceding byte ranges with lower levels (level N-1 and below) to be processed (step 1065), stopping at the last specified level 0 byte range if specified, otherwise at the last specified level 1 or 2 byte range if specified, otherwise at the first byte range. Byte ranges assigned to levels other than 2 may contain movie fragment box.
As suggested with a dashed line arrow, the meaning of the level value is estimated for each byte-range.
According to this example, the level assignment is used to identify the byte ranges corresponding to the stream access points referenced 1105 and 1110 (e.g. instantaneous decoding refresh (IDR) frames) in the sub-segment referenced 1100. The feature type is set to the predefined value 1 (identifying dependency levels). In this example, there is no explicit range for the movie fragment box ‘moof’. The first byte range begins with a file-level box, the movie fragment box ‘moof’. It also includes the beginning of the media data box ‘mdaf’ (i.e. its box header comprising its four-character code and the size) and the data corresponding to the first IDR frame (reference 1105).
The level value assigned to this first byte range is set to one since the byte range begins with a top-level box and contains independently decodable media data (SAP 1, 2, or 3). The second byte range between the two IDR frames is composed of predictively coded P-frames that depends on the decoding of the first IDR frame. Any level value N greater than two can be used to identify this byte range. The level value indicates that this byte range may depend on preceding byte ranges with level value smaller than N up to previous independently decodable media data, if any. The third byte range corresponds to the second IDR frame (reference 1110). It is assigned to the level value two to indicate that this byte range does not begin with a top-level box and contains independently decodable media data (SAP 1, 2, 3). The client can use this indication to jump directly to this stream access point. The fourth byte range corresponding to another set of P-frames depending on the IDR frame 1110 is assigned to a level N greater than two to signal their dependence to preceding byte ranges with level value smaller than N up to previous independently decodable media data (i.e. the IDR frame 1110).
This example is similar to the one illustrated in
This example illustrates a low latency DASH sub-segment 1160 composed of two chunks referenced 1165 and 1170 (each chunk corresponding to a media fragment). In this example, there is no explicit byte range for the initial ‘moof’. The feature type in the ‘ssix’ is set to the predefined value one (identifying dependency levels). Only the first chunk contains an IDR frame. Accordingly, the first chunk is divided into two byte ranges. The first byte range is assigned with a level one indicating that the byte range is beginning with a file-level box and contains independently decodable data (SAP 1, 2 or 3). The second byte range is assigned to a level three (i.e. to a value greater than two) because it contains dependently decodable data. A third byte range contains the complete second chunk 1170. This second chunk only contains predictively coded P-frames that depends by definition on frames of the preceding byte range. To signal this, the third byte range is assigned with level value four because its data depends on data from byte range with assigned level three.
In these examples, each of the three sub-segments referenced 1200, 1210, and 1220 contains data corresponding to different tracks (described by track fragment boxes ‘traf’ with track_ID=1 and track_ID=2 (noted ID=1 and ID=2 respectively in
It is noted that the level only gives dependency information within the track and not dependency information between the track.
The level value assigned to each byte range in the ‘ssix’ is divided into two parts. A first part containing the level assigned to the byte range (similar to the levels defined with feature type equal to one) and the second part containing the track identifier (track_ID) corresponding to the data of this byte range.
The track_ID within the level attribute allows the client to select byte ranges pertaining to a given track only.
As illustrated in
In
As illustrated, the first byte range includes the movie fragment ‘moof’ that is common to both tracks, but also the data of an IDR frame corresponding to data of the track having its identifier sets to one (ID=1). In such a case, the track identifier (track_ID) assigned to the byte range is set to the identifier of the track to which the data of the IDR frame belong. Accordingly, the track identifier of the first byte range is one (track_ID=1).
The example illustrated in
The timed media content data represent timed data units of media content data (e.g. frames or partial parts of frames of a video bitstream such as, e.g., tiles, subpictures, blocks, open bitstream units or NAL units or samples of an audio bitstream) encapsulated into a media file conformant with ISOBMFF and derived standards. Each timed media content data unit may be encapsulated as a sample or several timed media content data units may be encapsulated as a sample and stored in a data container box (e.g. MediaDataBox ‘mdat’ 1300 or IdentifiedMediaDataBox ‘imda’).
In this example, the corruption state of each sample is signalled by defining a SampleToGroupBox ‘sbgp’ 1350 and a SampleGroupDescriptionBox ‘sgpd’ 1360, both boxes having the same grouping_type, e.g. ‘corr’, identifying a corrupted sample group.
The SampleToGroupBox ‘sbgp’ 1350 describes a sequence of groups of samples and associates with each group an index to a description entry (CorruptedSamplelnfoEntry) in the associated SampleGroupDescriptionBox ‘sgpd’ 1360. Three groups of samples are defined by the SampleToGroupBox 1350 (and noted (a), (b), (c) for the sake of illustration).
The first group is composed of two samples (sample_count = 2) 1310 and 1320 and is associated with the grouping description index 0 indicating that those samples are present and not corrupted.
The second group is composed of a single sample 1330 (sample_count = 1) and is associated with the first entry (CorruptedSamplelnfoEntry) of the SampleGroupDescriptionBox 1360. This first entry indicates that this sample has been lost (corrupted = 0), i.e. this sample has no media content data and the sample size is equal to zero.
The third group is also composed of a single sample 1340 (sample_count = 0) and is associated with the second entry (CorruptedSamplelnfoEntry) of the SampleGroupDescriptionBox 1360. This second entry indicates that this sample is corrupted (corrupted = 2) and provides codec-specific information on the type of corruption. The type of corruption can be the type of media content data that is corrupted (e.g. type of headers or type of descriptive metadata or data in the bitstream). As illustrated, two flag values (SEICorruptedFlag and ParameterSetCorruptedFlag) are set in the bit-mask codec specific_param indicating that at least one SEI NAL unit and at least one Parameter set NAL unit are corrupted in the associated sample.
As illustrated, a first step 1400 is directed to obtaining a plurality of timed media content data units that may suffer from loss or data corruption during the obtaining step. For example, this may happen when a media bitstream is received from an error-prone network using a non-reliable protocol (e.g. Real-time protocol (RTP) or File Delivery over Unidirectional Transport (FLUTE)). This may also happen when a media bitstream is read from a corrupted file storage.
At step 1410, it is determined whether at least one timed media content data unit is lost or corrupted. This can be determined by parsing the obtained media content data in order to detect missing data or syntax errors in the bitstream. This can also be determined by information provided by the storage device, the network or the transport protocol (e.g. RTCP feedbacks, checksum failure, Forward Error Correction failure, missing packets, etc.).
If a data corruption or loss is detected at step 1410, a first indication is generated at step 1420 to signal whether the timed media content data is either fully lost or partially corrupted (e.g. as illustrated by ‘corrupted’ parameter in
At step 1430, a second indication is generated to provide codec specific information on the type of corruption or type of media content data (e.g. as illustrated by ‘codec_specific_param’ parameter in
At step 1440, the obtained plurality of timed media content data units and first and second indications are encapsulated into a media file, e.g. according to ISOBMFF or a ISOBMFF-based or derived specification.
As illustrated, a first step 1500 is directed to obtaining a media file comprising a plurality of timed media content data units. The media file can be obtained by reading it on a storage device or by receiving it from the network (e.g. using a TCP or UDP based protocol).
At step 1510, it is checked whether there is a first indication signalling that at least one timed media content data unit of the plurality of timed data units is corrupted or lost. The first indication may be obtained by parsing the descriptive metadata of the media file (e.g. the MovieBox ‘moov’ of an ISOBMFF file).
At step 1520, after having obtained a first indication indicating that at least one timed media content data unit of the plurality of timed data units is corrupted, a second indication is obtained, this second indication providing codec specific information on the type of corruption (or type of media content data that is corrupted).
At step 1530, it is determined whether the processing to be performed on the plurality of timed media content data units is resilient to the type of corruption (for example, loss of a slice header of a NAL unit), i.e. whether the corrupted data can be recovered during the processing or cannot be recovered. The processing can correspond to a parsing, a decoding or display of the bitstream represented by the plurality of timed media content data units.
At step 1540, if it is determined that the processing may be resilient to signalled types of corruption, the media file is de-encapsulated and the plurality of timed media content data units are processed.
The second indication is useful to avoid starting the processing of corrupted timed media content data when the types of corruption cannot be recovered by the processing.
Therefore, according to these embodiments, the invention provides a method for encapsulating timed media content data, the timed media content data comprising a plurality of timed media content data units, the method being carried out by a server and comprising:
According to some embodiments, the generated indication is a first generated indication, the method further comprising generating a second indication upon determining that at least one timed media content data unit is corrupted, the second indication being a parameter of the sample group of the predetermined type, according to ISOBMFF or any ISOBMFF derived specification, signalling a type of corruption. The type of corruption may depend on a codec used to encode the timed media content data units. A second indication may be generated for each corrupted timed media content data unit.
According to some embodiments, a timed media content data unit is a sample, a frame, a tile, a subpicture, a block, an open bitstream unit, or a NAL unit.
Still according to some embodiments, the sample group of the predetermined type comprises the number of timed media content data units that are not lost and not corrupted, the number of timed media content data units that are lost, the number of timed media content data units that are corrupted and that are associated with a second indication, and/or the number of timed media content data units that are corrupted and that are not associated with a second indication.
Still according to the embodiments described above, the invention provides a method for processing encapsulated timed media content data, the timed media content data comprising a plurality of timed media content data units, the method being carried out by a client and comprising:
According to some embodiments, the obtained indication is a first obtained indication, the method further comprising obtaining a second indication, the second indication being a parameter of the sample group of the predetermined type, according to ISOBMFF or any ISOBMFF derived specification, signalling a type of corruption, the obtained timed media content data units being processed as a function of the obtained first and second indications to generate a media bitstream complying with a predetermined standard. The type of corruption may depend on a codec used to encode the timed media content data units. A second indication may be obtained for each corrupted timed media content data unit.
According to some embodiments, a timed media content data unit is a sample, a frame, a tile, a subpicture, a block, an open bitstream unit, or a NAL unit.
Still according to the embodiments described above, the invention provides a computer program product for a programmable apparatus, the computer program product comprising a sequence of instructions for implementing each of the steps of the method described above when loaded into and executed by the programmable apparatus.
Still according to the embodiments described above, the invention provides a non-transitory computer-readable storage medium storing instructions of a computer program for implementing each of the steps of the method described above.
Still according to the embodiments described above, the invention provides a device for encapsulating timed media content data or processing encapsulated timed media content data, the device comprising a processing unit configured for carrying out each of the steps of the method described above.
The executable code may be stored either in read only memory 1606, on the hard disk 1610 or on a removable digital medium for example such as a disk. According to a variant, the executable code of the programs can be received by means of a communication network, via the network interface 1612, in order to be stored in one of the storage means of the communication device 1600, such as the hard disk 1610, before being executed.
The central processing unit 1604 is adapted to control and direct the execution of the instructions or portions of software code of the program or programs according to embodiments of the invention, which instructions are stored in one of the aforementioned storage means. After powering on, the CPU 1604 is capable of executing instructions from main RAM memory 1608 relating to a software application after those instructions have been loaded from the program ROM 1606 or the hard-disc (HD) 1610 for example. Such a software application, when executed by the CPU 1604, causes the steps of the flowcharts shown in the previous figures to be performed.
In this embodiment, the apparatus is a programmable apparatus which uses software to implement the invention. However, alternatively, the present invention may be implemented in hardware (for example, in the form of an Application Specific Integrated Circuit or ASIC).
Although the present invention has been described hereinabove with reference to specific embodiments, the present invention is not limited to the specific embodiments, and modifications will be apparent to a person skilled in the art which lie within the scope of the present invention.
Many further modifications and variations will suggest themselves to those versed in the art upon making reference to the foregoing illustrative embodiments, which are given by way of example only and which are not intended to limit the scope of the invention, that being determined solely by the appended claims. In particular the different features from different embodiments may be interchanged, where appropriate.
In the claims, the word “comprising” does not exclude other elements or steps, and the indefinite article “a” or “an” does not exclude a plurality. The mere fact that different features are recited in mutually different dependent claims does not indicate that a combination of these features cannot be advantageously used.
Number | Date | Country | Kind |
---|---|---|---|
2015413.4 | Sep 2020 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2021/076714 | 9/28/2021 | WO |