The present embodiments generally relate to extended reality scene description and extended reality scene rendering.
Extended reality (XR) is a technology enabling interactive experiences where the real-world environment and/or a video content is enhanced by virtual content, which can be defined across multiple sensory modalities, including visual, auditory, haptic, etc. During runtime of the application, the virtual content (3D content or audio/video file for example) is rendered in real-time in a way that is consistent with the user context (environment, point of view, device, etc.). Scene graphs (such as the one proposed by Khronos/glTF and its extensions defined in MPEG Scene Description format or Apple/USDZ for instance) are a possible way to represent the content to be rendered. They combine a declarative description of the scene structure linking real-environment objects and virtual objects on one hand, and binary representations of the virtual content on the other hand. Scene description frameworks ensure that the timed media and the corresponding relevant virtual content are available at any time during the rendering of the application. Scene descriptions can also carry data at scene level describing how a user can interact with the scene objects at runtime for immersive XR experiences.
According to an embodiment, a method is provided, comprising: obtaining at least a parameter, at node level, used to indicate that an object corresponding to a node is in proximity of another object corresponding to another node, from a description for an extended reality scene; and activating a trigger to an action responsive to that said object is in proximity of said another object.
According to another embodiment, an apparatus is provided, comprising one or more processors and at least one memory coupled to said one or more processors, wherein said one or more processors are configured to: obtain at least a parameter, at node level, used to indicate that an object corresponding to a node is in proximity of another object corresponding to another node, from a description for an extended reality scene; and activate a trigger to an action responsive to that said object is in proximity of said another object.
One or more embodiments also provide a computer program comprising instructions which when executed by one or more processors cause the one or more processors to perform the method according to any of the embodiments described herein. One or more of the present embodiments also provide a computer readable storage medium having stored thereon instructions for processing scene description according to the methods described herein.
One or more embodiments also provide a computer readable storage medium having stored thereon video data generated according to the methods described above. One or more embodiments also provide a method and apparatus for transmitting or receiving the video data generated according to the methods described herein.
Various XR applications may apply to different context and real or virtual environments. For example, in an industrial XR application, a virtual 3D content item (e.g., a piece A of an engine) is displayed when a reference object (piece B of an engine) is detected in the real environment by a camera rigged on a head mounted display device. The 3D content item is positioned in the real-world with a position and a scale defined relatively to the detected reference object.
For example, in an XR application for interior design, a 3D model of a furniture is displayed when a given image from the catalog is detected in the input camera view. The 3D content is positioned in the real-world with a position and scale defined relatively to the detected reference image. In another application, some audio file might start playing when the user enters an area close to a church (being real or virtually rendered in the extended real environment). In another example, an ad jingle file may be played when the user sees a can of a given soda in the real environment. In an outdoor gaming application, various virtual characters may appear, depending on the semantics of the scenery which is observed by the user. For example, bird characters are suitable for trees, so if the sensors of the XR device detect real objects described by a semantic label ‘tree’, birds can be added flying around the trees. In a companion application implemented by smart glasses, a car noise may be launched in the user's headset when a car is detected within the field of view of the user camera, in order to warn him of the potential danger. Furthermore, the sound may be spatialized in order to make it arrive from the direction where the car was detected.
An XR application may also augment a video content rather than a real environment. The video is displayed on a rendering device and virtual objects described in the node tree are overlaid when timed events are detected in the video. In such a context, the node tree comprises only virtual objects descriptions.
Device 130 comprises following elements that are linked together by a data and address bus 131:
In accordance with an example, the power supply is external to the device. In each of mentioned memory, the word “register” used in the specification may correspond to area of small capacity (some bits) or to very large area (e.g., a whole program or large amount of received or decoded data). The ROM 133 comprises at least a program and parameters. The ROM 133 may store algorithms and instructions to perform techniques in accordance with present principles. When switched on, the CPU 132 uploads the program in the RAM and executes the corresponding instructions.
The RAM 134 comprises, in a register, the program executed by the CPU 132 and uploaded after switch-on of the device 130, input data in a register, intermediate data in different states of the method in a register, and other variables used for the execution of the method in a register.
Device 130 is linked, for example via bus 131 to a set of sensors 137 and to a set of rendering devices 138. Sensors 137 may be, for example, cameras, microphones, temperature sensors, Inertial Measurement Units, GPS, hygrometry sensors, IR or UV light sensors or wind sensors. Rendering devices 138 may be, for example, displays, speakers, vibrators, heat, fan, etc.
In accordance with examples, the device 130 is configured to implement a method according to the present principles, and belongs to a set comprising:
In XR applications, scene description is used to combine explicit and easy-to-parse description of a scene structure and some binary representations of media content.
In time-based media streaming, the scene description itself can be time-evolving to provide the relevant virtual content for each sequence of a media stream. For instance, for advertising purpose, a virtual bottle can be displayed on a table during a video sequence where people are seated around the table. This kind of behavior can be achieved by relying on the framework defined in the Scene Description for MPEG media document.
Although the MPEG-I Scene Description framework ensures that the timed media and the corresponding relevant virtual content are available at any time, there is no description of how a user can interact with the scene objects at runtime for immersive XR experiences.
In our previous work, a solution is proposed to augment the time-evolving scene description by adding “behavior” data. These behaviors are related to pre-defined virtual objects on which runtime interactivity is allowed for user specific XR experiences. These behaviors are also time-evolving and are updated through the existing scene description update mechanism.
A behavior comprises:
Behavior 410 takes place at scene level. A trigger is linked to nodes and to the nodes' child nodes. In the example of
Different formats can be used to represent the node tree. For example, the MPEG-I Scene Description framework using the Khronos glTF extension mechanism may be used for the node tree. In this example, an interactivity extension may apply at the glTF scene level and is called MPEG_scene_interactivity. The corresponding semantic is provided in Table 1, where ‘M’ in ‘Usage’ column indicates that the field is mandatory in a XR scene description format and ‘O’ indicates the field is optional.
In this example, items of the array of field ‘triggers’ are defined according to Table 2.
As can be seen from the example of Table 2, the proximity can be handled by a proximity trigger at the scene level, with attributes distanceLowerLimit and distanceUpperLimit.
In another embodiment, the node “U” can be a node of the scene. The application is responsible for checking the threshold min and max on nodes to determine whether to activate the trigger at runtime.
The proximity criteria will be computed between the user (e.g., the camera) and the considered nodes. Behaviors take place at scene level. With this mechanism, a set of nodes is considered to compute the activation of the trigger. Using Table 2 as the examples, the distance between “nodes” in field “triggers” and node “User” for example are computed and are compared with distanceLowerLimit and distanceUpperLimit. If the distance is within distanceLowerLimit and distanceUpperLimit, then the trigger is activated. However, in this generic mechanism for a proximity trigger, the parameters are defined at scene level and therefore the same proximity computation is used for each node.
However, the creator of a virtual scene may require defining node-specific proximity criteria depending on the nature and/or size of the object geometry related to that node, for example, proximity criteria may be different for a big or a small object, or angle of approach can be considered.
During computation, there may be different solutions to manage occlusion (object between user and target), for example, by adding a visibility trigger, or adding a set of nodes not considered during computation in attributes list (for example, in
This disclosure provides a solution to specialize the proximity criteria for any node of a scene description. In one embodiment, we propose to augment a scene description by specializing the criteria of proximity at the node level for dedicated proximity triggering. For example, the scene description augmentation at the node level includes one or more of the following parameters:
This node level information encompasses the child nodes of this node if present.
In the following, the proposed augmentation is described using the MPEG-I Scene Description framework with the Khronos glTF extension mechanism to support additional scene description features. However, the present principles may be applied to other existing or upcoming descriptions of XR scene.
The “MPEG_node_proximity_trigger” extension is defined at node level. When this extension is carried by a node, specific attributes are used for the computation of the trigger. The semantic of the MPEG_node_proximity_trigger at node level is provided in TABLE 3.
The most efficient placement to define geospatial coordinates is at scene level, but alternative placement could be envisaged (e.g., at the node level).
Parameters distanceLowerLimit and distanceUpperLimit are defined at scene level. Hence, the same values are applied to the set of nodes handled by the trigger (attributes “nodes”). With the weight attribute, it is possible to apply a scaling factor to the distance parameters at node level:
This is useful if the designer wishes to give a different weight to the nodes, for example, taking into account the size. The same result can be achieved by setting a trigger per node at scene level.
We propose to add a frustum to activate the proximity trigger, as illustrated in
With this approach, it is possible to capture objects inside frustum (the portion of the pyramid between the near plane and far plane in
For simplicity, in the following examples, the frustum is represented as a triangle, near plane is set to 0 and the far plane is set to a fixed value.
A syntax example of such proximity extensions in the MPEG-I Scene Description is provided in TABLE 4. Fields introduced by the present methods are in bold.
At the scene level:
At the node level:
Another syntax example of proximity extensions in the MPEG-I Scene Description is provided in TABLE 5.
At the scene level:
At the node level:
During runtime, the application iterates on each defined behavior (which could be defined at scene level). If an “proximity extended” node is affected, the attributes listed in TABLE 3 are used to compute the activation of the proximity trigger.
To compute proximity, at scene level, the Euclidian distance from node to centroid of user can be used.
To check the presence in the frustum, we loop through each plane of the frustum and compute the signed distance of the centroid to this plane the position of the centroid.
A specific mesh can be provided, this mesh defines a volume used for the trigger calculation. This mesh is not necessarily centered on the related node geometry. To check the presence in a specific mesh for example a polyhedral, it is possible to check that the projection of the user's centroid satisfies all hyperplanes defining the polyhedron.
When a weight is applied, the lower and the upper distances are multiplied by this value. This is useful to take into account the size of a node.
At runtime, for each concerned nodes, proximity criteria related to trigger parameters are checked (1360). Depending on the parameters, the trigger may be activated. In particular, if the proximity criteria are satisfied, then the trigger is activated (1370) and the associated actions are launched (1380). The monitoring of proximity criteria continues during the runtime, as the nodes and the user may move around.
Various numeric values are used in the present application. The specific values are for example purposes and the aspects described are not limited to these specific values.
Various methods are described herein, and each of the methods comprises one or more steps or actions for achieving the described method. Unless a specific order of steps or actions is required for proper operation of the method, the order and/or use of specific steps and/or actions may be modified or combined. Additionally, terms such as “first”, “second”, etc. may be used in various embodiments to modify an element, component, step, operation, etc., such as, for example, a “first decoding” and a “second decoding”. Use of such terms does not imply an ordering to the modified operations unless specifically required. So, in this example, the first decoding need not be performed before the second decoding, and may occur, for example, before, during, or in an overlapping time period with the second decoding.
The implementations and aspects described herein may be implemented in, for example, a method or a process, an apparatus, a software program, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (for example, discussed only as a method), the implementation of features discussed may also be implemented in other forms (for example, an apparatus or program). An apparatus may be implemented in, for example, appropriate hardware, software, and firmware. The methods may be implemented in, for example, an apparatus, for example, a processor, which refers to processing devices in general, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device.
Processors also include communication devices, for example, computers, cell phones, portable/personal digital assistants (“PDAs”), and other devices that facilitate communication of information between end-users.
Reference to “one embodiment” or “an embodiment” or “one implementation” or “an implementation”, as well as other variations thereof, means that a particular feature, structure, characteristic, and so forth described in connection with the embodiment is included in at least one embodiment. Thus, the appearances of the phrase “in one embodiment” or “in an embodiment” or “in one implementation” or “in an implementation”, as well any other variations, appearing in various places throughout this application are not necessarily all referring to the same embodiment.
Additionally, this application may refer to “determining” various pieces of information. Determining the information may include one or more of, for example, estimating the information, calculating the information, predicting the information, or retrieving the information from memory.
Further, this application may refer to “accessing” various pieces of information. Accessing the information may include one or more of, for example, receiving the information, retrieving the information (for example, from memory), storing the information, moving the information, copying the information, calculating the information, determining the information, predicting the information, or estimating the information.
Additionally, this application may refer to “receiving” various pieces of information. Receiving is, as with “accessing”, intended to be a broad term. Receiving the information may include one or more of, for example, accessing the information, or retrieving the information (for example, from memory). Further, “receiving” is typically involved, in one way or another, during operations, for example, storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.
It is to be appreciated that the use of any of the following “/”, “and/or”, and “at least one of”, for example, in the cases of “A/B”, “A and/or B” and “at least one of A and B”, is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of both options (A and B). As a further example, in the cases of “A, B, and/or C” and “at least one of A, B, and C”, such phrasing is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of the third listed option (C) only, or the selection of the first and the second listed options (A and B) only, or the selection of the first and third listed options (A and C) only, or the selection of the second and third listed options (B and C) only, or the selection of all three options (A and B and C). This may be extended, as is clear to one of ordinary skill in this and related arts, for as many items as are listed.
As will be evident to one of ordinary skill in the art, implementations may produce a variety of signals formatted to carry information that may be, for example, stored or transmitted. The information may include, for example, instructions for performing a method, or data produced by one of the described implementations. For example, a signal may be formatted to carry the bitstream of a described embodiment. Such a signal may be formatted, for example, as an electromagnetic wave (for example, using a radio frequency portion of spectrum) or as a baseband signal. The formatting may include, for example, encoding a data stream and modulating a carrier with the encoded data stream. The information that the signal carries may be, for example, analog or digital information. The signal may be transmitted over a variety of different wired or wireless links, as is known. The signal may be stored on a processor-readable medium.
Number | Date | Country | Kind |
---|---|---|---|
22305362.0 | Mar 2022 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2023/055275 | 3/2/2023 | WO |