The present disclosure relates to an information processing device, a method, and a program.
For the purpose of achieving audio reproduction with a higher sense of realism, for example, MPEG-H 3D Audio has been known as an encoding technique to transmit plural pieces of audio data prepared for each audio object (refer to Non Patent Literature 1).
Plural pieces of encoded audio data are provided to a user, included, for example, in a content file, such as ISO base media file format (ISOBMFF) file, standard of which is defined in Non-Patent Literature 2 below, together with image data.
On the other hand, a multi-view content enabled to display images while switching viewpoints has recently been becoming common. In sound reproduction of such a multi-view content, there has been a case in which positions of audio objects do not match between before and after a viewpoint switch, to give a sense of awkwardness to a user.
Accordingly, in the present disclosure, an information processing apparatus, an information processing method, and a program that are capable of reducing a sense of awkwardness given to a user by performing a position correction of an audio object at the time of switching viewpoints among plural viewpoints are proposed.
According to the present disclosure, an information processing device is provided that includes: a metadata-file generating unit that generates a metadata file including viewpoint switch information to perform a position correction of an audio object at a viewpoint switch among a plurality of viewpoints.
Moreover, according to the present disclosure, an information processing method is provided that is performed by an information processing device, the method including: generating a metadata file that includes viewpoint switch information to perform a position correction of an audio object at a viewpoint switch among a plurality of viewpoints.
Moreover, according to the present disclosure, a program is provided that causes a computer to implement a function of generating a metadata that includes viewpoint switch information to perform a position correction of an audio object at a viewpoint switch among a plurality of viewpoints.
As explained, according to the present disclosure, a sense of awkwardness given to a user can be reduced by performing a position correction of an audio object at the time of switching viewpoints among plural viewpoints.
Note that the effect described above is not limited, and any effect described in the present application, or other effects understood from the present application may be produced together with the above effect, or instead of the above effect.
Hereinafter, exemplary embodiments of the present disclosure will be explained in detail with reference to the accompanying drawings. Note that common reference symbols are assigned to components having substantially the same functional configurations throughout the present specification and the drawings, and duplicated explanation will be thereby omitted.
Moreover, in the present application and the drawings, plural components having substantially the same functional configurations can be distinguished thereamong by adding different alphabets at the end of the same reference symbols. However, when it is not necessary to particularly distinguish respective plural components having substantially the same functional configurations, only the same reference symbol is assigned.
Explanation will be given in following order.
Firstly, the explanation is given about the background of the present disclosure.
Multi-view contents enabled to display images while switching viewpoints have recently been becoming common. Such a multi-view content includes not only a two-dimensional image, but also a 360° whole sky image that is taken by a whole sky camera or the like, as images corresponding to respective viewpoints. When a 360° whole sky image is displayed, a partial range is cut out from the whole sky image, and the cut-out display image is displayed based on, for example, an input by a user or a viewing position and direction of a user determined by sensing. Of course, also when a 2D image is displayed, a display image obtained by cutting out a partial range from the 2D image can be displayed.
A use case in which a user views such a multi-view content including both a 360° whole sky image and a 2D image while changing a cut-out range for a display image will be explained, referring to
In the example illustrated in
Moreover, in
When the number of pixels of a display image is smaller than the number of pixels of a display device, enlargement processing is performed to display it. The number of pixels of a display image is determined by the number of pixels of a cut-out source and a size of a cut-out range, and when the number of pixels of the 360° whole sky image G10 is small, or when the size of the range to be cut out for the display image G14 is small, the number of pixels of the display image G14 is to be small also. In such a case, degradation of image quality, such as blurriness, can occur in the display image G14 as illustrated in
When a range corresponding to the display image G14 is contained in the 2D image G20 and the number of pixels of the 2D image G20 is large, switch of viewpoint can be considered. By switching the viewpoint to display the 2D image G20 and then by further increasing the zoom factor or the like, a display image G22 that is obtained by cutting out, from the 2D image G20, a range R1 corresponding to the display image G14 in the 2D image G20 can be displayed. The display image G22 displays the range corresponding to the display image G14, and is expected to cause less degradation in image quality than the display image G14, and to bear viewing in which the zoom factor is further increased.
When a 360° whole sky image is to be displayed, degradation of image quality can occur not only when the zoom factor is large, but also when the zoom factor is small. For example, when the zoom factor is small, a distortion included in a display image that is cut out from a 360° whole sky image can be significantly noticeable. In such a case also, switching to a 2D image is effective.
However, when it is switched to the 2D image G20 from the state in which the display image G14 is displayed, sizes of the subject vary, and a sense of awkwardness can therefore be given to a user. Accordingly, it is preferable that a display be switched directly from the display image G14 to the display image G22 at the time of switching the viewpoints. For example, to switch the display directly from the display image G14 to the display image G22, it is necessary to identify a size and a position of a center C of the range R1 corresponding to the display image G14 in the 2D image G20.
When viewpoints are switched within a 360° whole sky image, a display angle of view (angle of view of zoom factor 1) that enables a subject to be seen about the same as that in the real world can be calculated and, therefore, the sizes of the subject can be matched between before and after the switch.
However, in the case of 2D image, it can be stored in a zoomed state at the time of shoot, but it does not necessarily provide information about angle of view at the time of shoot. In that case, a shot image is to be zoomed in and zoomed out to be displayed on a reproduction side, and a true zoom factor (display angle of view) with respect to the real world of the image currently being displayed is to be what is obtained by multiplying the zoom factor at the time of shoot and a zoom factor at the time of reproduction. When the zoom factor at the time of shoot is unknown, the true zoom factor with respect to the real world of the image currently being displayed is also unknown. Therefore, it becomes impossible to match the sizes of the subject between before and after a switch in the use case of performing viewpoint switch. Note that such a phenomenon can occur at a viewpoint switch between a 360° whole sky image that is enabled to be zoomed or rotated and a 2D image, or between plural 2D images.
To make the subject appear in sizes equivalent to each other between before and after a viewpoint switch, it is necessary to acquire a value of a display magnification of the image before a switch, and to appropriately set a display magnification of the image after the switch to be the same as the value.
A display magnification of an image viewed by a user is determined by three parameters of an angle of view at the time of shoot, a cut-out angle of view from an original image of the display image, and a display angle of view of a display device at the time of reproduction. Moreover, a true display magnification (display angle of view) of an image finally viewed by a user with respect to the real world can be calculated as follows.
True Display Angle of View=(Angle of View at Shoot)×(Cut-Out Angle of view from Original Image of Display Image)×(Display Angle of View of Display Device)
In the case of a 360° whole sky image, the angle of view at the time of shot is 360°. Furthermore, as for a cut-out angle of view, based on the number of pixels in a cut-out range, a corresponding degree of angle of view can be calculated. Moreover, because information about an angle of view of a display device is also determined by a reproduction environment, a final display magnification can be calculated.
On the other hand, in the case of a 2D image, information about an angle of view at the time of shoot cannot be generally obtained, or is often lost curing creation. Moreover, it is possible to acquire a cut-out angle of view as a relative position to the original image, but a corresponding degree of the angle of view as an absolute value in the real world cannot be acquired. Therefore, it is difficult to acquire a final display magnification.
Furthermore, in a viewpoint switch between a 360° whole sky image and a 2D image, it is necessary to match directions of the subject. Accordingly, direction information at the time of shoot of the 2D image is also necessary. If a 360° whole sky image is an image conforming to the omnidirectional media application format (OMAF), direction information is recorded as metadata, but from 2D images, it is common that direction information cannot be acquired therefrom.
As described, to enable to match sizes of a subject between a 360° whole sky image and a 2D image at a viewpoint switch with zooming, information of an angle of view and information of a direction at the time when the 2D image is shot are necessary.
In reproduction of a multi-view content, it is preferable that a position of a sound source (hereinafter, it can be referred to as audio object) of a sound be appropriately changed according to zooming or a viewpoint switch. In MPEG-H 3D Audio described in Non-Patent Literature 1 described above, a mechanism of correcting a position of an audio object corresponding to zooming of an image is defined. Hereinafter, such a mechanism will be explained.
In MPEG-H 3D Audio, following two position correcting functions of an audio object are provided.
(First Correcting Function): A position of an audio object is corrected when a display angle of view at the time of content creation and a display angle of view at the time of reproduction that have been subjected to positioning of an image sound.
(Second Correcting Function): A position of an audio object is corrected, following zooming of an image at the time of reproduction.
First, the first correcting function described above will be explained, referring to
In the example illustrated in
As illustrated in
The example of displaying a content thus created at the display angle of view of 120° is illustrated in
Subsequently, the second correcting function described above will be explained, referring to
In MPEG-H 3D Audio, the two position correcting functions of an audio object as explained above are provided. However, with the position correcting functions of an audio object provided in the MPEG-H 3D Audio described above, there is a case in which a position correction of an audio object when a viewpoint switch is performed along with zooming cannot be performed appropriately.
A position correction of an audio object necessary in a use case assuming a viewpoint switch performed along with zooming will be explained, referring to
In the example illustrated in
In the example illustrated in
In the example illustrated in
Furthermore, performing a viewpoint switch with respect to the 360° whole sky image in the example illustrated in
However, as described above, a true display magnification with respect to the real world at the time of reproduction of a 2D image is unknown, and the true display magnification with respect to the real world at the time of reproduction of the 2D image and the true display magnification with respect to the real world at the time of reproduction of the 360° whole sky image do not necessarily coincide with each other by the viewpoint switch as described above. Therefore, by the viewpoint switch as described above, sizes of a subject do not match.
Moreover, as for the position of an audio object also, a mismatch can occur between before and after the viewpoint switch, and a sense of awkwardness can be given to the user. Therefore, it is preferable that correction to match positions of an audio object also be performed between before and after a viewpoint switch, along with matching sizes of a subject.
In the example illustrated in
Furthermore, in the example illustrated in
Furthermore, in
As described above, when a position of an audio object is determined based on an image at such a zoom factor that the true display magnification with respect to the real world is not 1 fold, the display image G24 to be displayed at the time of reproduction, and a rotation angle of the display image G24 with respect to the real world are unknown. Accordingly, a moving angle of the audio object that is moved in accordance with a move of the cut-out range with respect to the real world is also unknown.
However, when it is transitioned from a state in which the display image G24 is displayed to a state in which the display image G26 is displayed, it is possible to correct the position of the audio object by using the position correcting function of an audio object provided in MPEG-H 3D Audio as explained referring to
Focusing on the circumstances described above, respective embodiments according to the present disclosure have been achieved. According to the respective embodiments explained hereinafter, it is possible to reduce a sense of awkwardness given to a user by performing position correction of an audio object at a viewpoint switch among multiple viewpoints. In the following, a basic principle of the technique according to the present disclosure (hereinafter, also referred to as present technique) common among the respective embodiments of the present disclosure will be explained.
<<2-1. Overview of Present Technique>>
When a display image G16 that is obtained by cutting out a range R5 of the display image G12 is displayed from a state in which the display image G12 is displayed, deterioration of an image quality can occur. Therefore, a viewpoint switch to a viewpoint of the 2D image G20 is considered to be performed. At this time, in the present technique, a range R6 corresponding to the display image G16 in the 2D image G20 is automatically identified, and the display image G24 in which the size of the subject is kept is thereby displayed, without displaying the entire portion of the 2D image G20. Furthermore, in the present technique, also when a viewpoint switch from the viewpoint of the 2D image G20 to the 2D image G30, the size of the subject is kept. In the example illustrated in
Moreover, in the present technique, at the viewpoint switch described above, the position correction of an audio object is performed, and reproduction is performed at a position of a sound source according to the viewpoint switch. According to such a configuration, a sense of awkwardness given to a sense of hearing of a user can be reduced.
To achieve the effects explained referring to
<<2-2. Multi-View Zoom-Switch Information>>
One example of the multi-view zoom-switch information will be explained, referring to
As illustrated in
The image type information is information indicating a type of image related to a viewpoint associated with the multi-view zoom-switch information, and can be, for example, a 2D image, a 360° whole sky image, others, or the like.
The shooting-related information is information about a time of shoot of an image relating to a viewpoint associated with the multi-view zoom-switch information. For example, the shooting-related information includes shooting position information relating to a position of a camera used to take the image. Moreover, the shooting-related information includes shooting direction information relating to a direction of a camera used to take the image. Furthermore, the shooting-related information includes shooting angle-of-view information relating to an angle of view (horizontal angle of view, vertical angle of view) of the camera used to take the image.
The angle-of-view information at the time of content creation is information of a display angle of view (horizontal angle of view, vertical angle of view) at the time of content creation. The angle-of-view information at the time of content creation is also reference angle-of-view information relating to an angle of view of a screen that is referred to when position information of an audio object relating to a viewpoint associated with the viewpoint switch information. Moreover, the angle-of-view information at the time of content creation is may be information corresponding to mae_ProductionScreenSizeData( ) in MPEG-H 3D Audio.
By using the shooting-related information, and the angle-of-view information at the time of content creation, display while keeping a size of a subject is enabled, and the position correction of an audio object is enabled.
The switch-destination viewpoint information is information relating to a switch destination viewpoint to which the viewpoint associated with the multi-view zoom-switch information can be switched. As illustrated in
The switch-destination viewpoint information may be, for example, information to switch to a switch destination viewpoint. In the example illustrated in
For example, in the example illustrated in
The threshold information may be information of a threshold of, for example, a maximum display magnification. For example, in the region R11 of the viewpoint VP1, when the display magnification becomes 3-fold or larger, the viewpoint switch to the viewpoint VP2 is performed. Moreover, in the region R12 of the viewpoint VP1, when the display magnification becomes 2-fold or larger, the viewpoint switch to the viewpoint VP3 is performed.
As above, one example of the switch-destination viewpoint information has been explained, referring to
For example, the switch-destination viewpoint information may be set in multiple stages. Furthermore, the switch-destination viewpoint information may be set such that viewpoints are mutually switchable. For example, it may be set such that the viewpoint VP1 and the viewpoint VP2 can be mutually switched, and the viewpoint VP1 and the viewpoint VP3 can be mutually switched.
Moreover, the switch-destination viewpoint information may be set such that different paths can be taken among viewpoints. For example, it may be set such that it can be switched from the viewpoint VP1 to the viewpoint VP2, and from the viewpoint VP2 to the viewpoint VP3, and from the viewpoint VP3 to the viewpoint VP1.
Furthermore, when viewpoints are mutually switchable, a hysteresis may be provided in the switch-destination viewpoint information by varying the threshold information depending on a direction of switch. For example, it may be set such that a threshold of that from the viewpoint VP1 to the viewpoint VP2 is 3-fold, and a threshold of that from the viewpoint VP2 to the viewpoint VP1 is 2-fold. According to such a configuration, complicated viewpoint switch is less likely to occur, and a sense of awkwardness given to a user can be further reduced.
Moreover, regions in the switch-destination viewpoint information may overlap each other. In the example illustrated in
Furthermore, the threshold information included in the switch-destination viewpoint information may be information of a minimum display magnification, not just the maximum display magnification. For example, in the example illustrated in
Moreover, the maximum display magnification or the minimum display magnification may be set in a region having no switch destination viewpoint. In such a case, a zoom change may be stopped at the maximum display magnification or at the minimum display magnification.
Furthermore, when an image subject to the viewpoint switch is a 2D image, the switch-destination viewpoint information may include information of a default initial display range to be displayed right after the switch. As described later, while a display magnification and the like at a switch destination viewpoint can be calculated, a default range to be displayed intentionally by a content creator may be configurable for each switch destination viewpoint. For example, in the example illustrated in
First, an image type is set, and the information is added (S102). Subsequently, a position, a direction, and an angle of view of a camera at shooting are set, and the shooting-related information is added (S104). At step S104, the shooting-related information may be set by referring to a camera position, a direction, and a zoom value at the time of shoot, a 360° whole sky image being shot at the same time, and the like.
Subsequently, an angle of view at the time of content creation is set, and the angle-of-view information at the time of content creation is added (S106). As described above, the angle-of-view information at the time of content creation is a screen size (display angle of view of a screen) referred to when a position of an audio object is determined. For example, to eliminate an influence of misregistration caused by zooming, full-screen display may be applied without cutting out an image, at the time of content creation.
Subsequently, the switch-destination viewpoint information is set (S108). The content creator sets a region in an image corresponding to each viewpoint, and sets a threshold of a display magnification at which the viewpoint switch occurs, and identification information of a viewpoint switch destination.
As above, the generation flow of the multi-view zoom-switch information at the time of content creation has been explained. The generated multi-view zoom-switch information is included in a content file or a metadata file as described later, and is provided to a device that performs reproduction in the respective embodiments of the present disclosure. In the following, a viewpoint switch flow using the multi-view zoom-switch information at the time of reproduction will be explained, referring to
First, information of a viewing screen that is used for reproduction is acquired (S202). The information of a viewing screen may be a display angle of view from a viewing position, and can be uniquely determined by a reproduction environment.
Subsequently, the multi-view zoom-switch information relating to a viewpoint of an image currently being displayed is acquired (S204). The multi-view zoom-switch information is stored in a metadata file or a content file as described later. An acquisition method of the multi-view zoom-switch information in the respective embodiment of the present disclosure will be explained later.
Subsequently, information of a cut-out range of a display image, a direction of the display image, and an angle of view are calculated (S208). The information of a cut-out range of the display image may include, for example, information of a center position and a size of the cut-out range.
Subsequently, it is determined whether the cut-out range of the display image calculated at S208 is included in any of regions of the switch-destination viewpoint information included in the multi-view zoom-switch information (S210). When the cut-out range of the display image is not included in any region (NO at S210), the viewpoint switch is not performed, and the flow is ended.
Subsequently, a display magnification of the display image is calculated (S210). For example, the display magnification can be calculated based on the information of a size of the image before the cut-out and the cut-out range of the display image. Subsequently, the display magnification of the display image is compared with the threshold of the display magnification included in the switch-destination viewpoint information (S212). In the example illustrated in
On the other hand, when the display magnification of the display image is larger than the threshold (YES at S212), the viewpoint switch to a switch destination viewpoint indicated by the switch-destination viewpoint information is started (S214). A cut-out position and an angle of view of the display image at the switch destination viewpoint are calculated based on the information of a direction and an angle of view of the display image before the switch, the shooting-related information included in the multi-view zoom-switch information, and the angle-of-view information at the time of content creation (S216).
The display image at the switch destination viewpoint is cutout to be displayed based on the information of the cut-out position and the angle of view calculated at step S216 (S218). Moreover, a position of an audio object is corrected based on the information of the cut-out position and the angle of view calculated at step S216, to be audio-output (S220).
As above, the basic principle of the present technique common among the respective embodiments of the present disclosure have been explained. Subsequently, the respective embodiments of the present disclosure will be specifically explained in the following.
<3-1. Configuration Example>
(System Configuration)
The generating device 100 is an information processing device that generates a content file and a metadata file that are adaptive to streaming by MPEG-DASH. The generating device 100 according to the present embodiment may be used for content creation (position determination of an audio object), or may receive an image signal, an audio signal, and position information of an audio object from another device for content creation. A configuration of the generating device 100 will be described later, referring to
The distribution server 200 functions as an HTTP server, and is an information processing device that performs streaming by MPEG-DASH. For example, the distribution server 200 performs streaming of a content file and a metadata file generated by the generating device 100 to the client 300 based on MPEG-DASH. A configuration of the distribution server 200 will be described later, referring to
The client 300 is an information processing device that receives the content file and the metadata file generated by the generating device 100 from the distribution server 200, and performs reproduction thereof.
The output device 400 is a device that displays a display image and performs audio output by a reproduction control of the client 300.
The output device 400A may be, for example, a television, or the like. A user may be able to perform operation, such as zooming and rotation, through a controller and the like connected to the output device 400A, and information of the operation can be transmitted from the output device 400A to the client 300A.
Moreover, the output device 400B may be a head mounted display (HMD) that is mounted on a user's head. The output device 400B has a sensor to acquire information, such as a position and an orientation (posture) of the head of the user on which it is mounted, and the information can be transmitted from the output device 400B to the client 300B.
Furthermore, the output device 400C is a mobile display terminal, such as a smartphone and a tablet, and has a sensor to acquire information, such as a position and an orientation (posture) when, for example, the user holds in a hand and moves the output device 400C.
As above, the system configuration example of the information processing system according to the present embodiment has been explained. The above configuration explained referring to
(Functional Configuration of Generating Device)
The generating unit 110 performs processing related to an image and an audio object, and generates a content file and a metadata file. As illustrated in
The image-stream encoding unit 111 acquires an image signal of multiple viewpoints (multi-view image signal), and a parameter at shooting (for example, the shooting-related information) from another device through the communication unit 130, or from the storage unit 140 in the generating device 100, and performs encoding processing. The image-stream encoding unit 111 outputs an image stream and the parameter at the shooting to the content-file generating unit 113.
The audio-stream encoding unit 112 acquires an audio object signal and position information of respective audio objects from another device through the communication unit 130, or from the storage unit 140 in the generating device 100, and performs encoding processing. The audio-stream encoding unit 112 outputs the audio stream to the content-file generating unit 113.
The content-file generating unit 113 generates a content file based on the information provided from the image-stream encoding unit 111 and the audio-stream encoding unit 112. The content file generated by the content-file generating unit 113 may be, for example, an MP4 file, and in the following, an example in which the content-file generating unit 113 generates an MP4 file will be mainly explained. In the present embodiment, the MP4 file may be an ISO Base Media File Format (ISOBMFF) file, a standard of which is defined by ISO/IEC 14496-12.
The MP4 file generated by the content-file generating unit 113 may be a segment file that is data un a unit possible to be distributed by MPEG-DASH.
The content-file generating unit 113 outputs the generated MP4 file to the communication unit 130 and the metadata-file generating unit 114.
The metadata-file generating unit 114 generates a metadata file including the multi-view zoom-switch information described above based on the MP4 file generated by the content-file generating unit 113. Moreover, a metadata file generated by the metadata-file generating unit 114 may be an MPD (media presentation description) file, a standard of which is defined by ISO/IEC 23009-1.
Furthermore, the metadata-file generating unit 114 according to the present embodiment may store the multi-view zoom-switch information in a metadata file. The metadata-file generating unit 114 according to the present embodiment may store the multi-view zoom-switch information in the metadata file, associating with each viewpoint included in plural switchable viewpoints (viewpoints of a multi-view content). A storage example of the multi-view zoom-switch information in the metadata file will be described later.
The metadata-file generating unit 114 outputs the generated MPD file to the communication unit 130.
The control unit 120 is a functional component that controls the entire processing performed by the generating device 100 in a centralized manner. For example, it is noted that what is controlled by the control unit 120 is not particularly limited. For example, the control unit 120 may control processing generally performed by a general-purpose computer, a PC, a tablet PC, and the like.
Moreover, when the generating device 100 is used at the time of content creation, the control unit 120 may perform processing related to generation of the position information of object audio data, and generation of the multi-view zoom-switch information explained with reference to
The communication unit 130 performs various kinds of communications with the distribution server 200. For example, the communication unit 130 transmits an MP4 file and an MPD file generated by the generating device 100 to the distribution server 200. What is communicated by the communication unit 130 is not limited to these.
The storage unit 140 is a functional component that stores various kinds of information. For example, the storage unit 140 stores the multi-view zoom-switch information, a multi-view image signal, an audio object signal, an MP4 file, an MPD file, and the like, or stores a program or a parameter used by respective functional components of the generating device 100, and the like. What is stored by the storage unit 140 is not limited to these.
(Functional Configuration of Distribution Server)
The control unit 220 is a functional component that controls the entire processing performed by the distribution server 200 in a centralized manner, and performs a control related to streaming distribution by MPEG-DASH. For example, the control unit 220 causes various kinds of information stored in the storage unit 240 to be transmitted to the client 300 through the communication unit 230 based on request information from the client 300 received through the communication unit 230 or the like. What is controlled by the control unit 220 is not particularly limited. For example, the control unit 120 may control processing generally performed by a general-purpose computer, a PC, a tablet PC, and the like.
The communication unit 230 performs various kinds of communications with the distribution server 200 and the client 300. For example, the communication unit 230 receives an MP4 file and an MPD file from the distribution server 200. Moreover, the communication unit 230 transmits, to the client 300, an MP4 file or an MPD file according to request information received from the client 300 in accordance with a control of the control unit 220. What is communicated by the communication unit 230 is not limited to these.
The storage unit 240 is a functional component that stores various kinds of information. For example, the storage unit 240 stores an MP4 file, an MPD file, and the like received from the generating device 100, or stores a program or a parameter used by the respective functional components of the distribution server 200, and the like. What is stored by the storage unit 240 is not limited to these.
(Functional Configuration of Client)
The processing unit 310 is a functional component that performs processing related to reproduction of a content. The processing unit 310 may perform, for example, processing related to the viewpoint switch explained with reference to
The metadata-file acquiring unit 311 is a functional component that acquires an MPD file (metadata file) from the distribution server 200 prior to reproduction of a content. More specifically, the metadata-file acquiring unit 311 generates request information of the MPD file based on a user operation or the like, and transmits the request information to the distribution server 200 through the communication unit 350, thereby acquiring the MPD file from the distribution server 200. The metadata-file acquiring unit 311 provides the acquired MPD file to the metadata-file processing unit 312.
The metadata file acquired by the metadata-file acquiring unit 311 includes the multi-view zoom-switch information as described above.
The metadata-file processing unit 312 is a functional component that performs processing related to the MPD file provided from the metadata-file acquiring unit 311. More specifically, the metadata-file processing unit 312 recognizes information necessary for acquiring an MP4 file or the like (for example, URL or the like) based on an analysis of the MPD file. The metadata-file processing unit 312 provides these information to the segment-file-selection control unit 313.
The segment-file-selection control unit 313 is a functional component that selects a segment file (MP4 file) to be acquired. More specifically, the segment-file-selection control unit 313 selects a segment file to be acquired based on various information provided from the metadata-file processing unit 312 described above. For example, the segment-file-selection control unit 313 according to the present embodiment selects a segment file of a switch destination viewpoint when a viewpoint switch is caused by the viewpoint switch processing explained with reference to
The image processing unit 320 acquires a segment file based on information selected by the segment-file-selection control unit 313, and performs image processing.
As illustrated in
The audio processing unit 330 acquires a segment file based on the information selected by the segment-file-selection control unit 313, and performs audio processing.
As illustrated in
The control unit 340 is a functional configuration that controls the entire processing performed by the client 300 in a centralized manner. For example, the control unit 340 may control various kinds of processing based on an input performed by using an input unit (not illustrated), such as a mouse and a keyboard, by a user. What is controlled by the control unit 340 is not particularly limited. For example, the control unit 340 may control processing generally performed by a general-purpose computer, a PC, a tablet PC, and the like.
The communication unit 350 performs various kinds of communications with the distribution server 200. For example, the communication unit 350 transmits request information provided by the processing unit 310 to the distribution server 200. Moreover, the communication unit 350 functions as a receiving unit also, and receives an MPD file, an MP4 file, and the like as a response to the request information from the distribution server 200. What is communicated by the communication unit 350 is not limited to these.
The storage unit 360 is a functional component that stores various kinds of information. For example, the storage unit 360 stores the MPD file, the MP4 file, and the like acquired from the distribution server 200, or stores a program or a parameter used by the respective functional components of the client 300, and the like. Information stored by the storage unit 360 is not limited to these.
<3-2. Storage Example of Multi-View Zoom-Switch Information in Metadata File>
As above, a configuration example of the present embodiment has been explained. Subsequently, a storage example of the multi-view zoom-switch information in a metadata file generated by the metadata-file generating unit 114 in the present embodiment will be explained.
First, a layer structure of an MPD file will be explained.
In Representation, information of an encoding speed of an image and an audio, an image size, and the like is stored. In Representation, plural pieces of Segmentlnfo are stored. Segmentlnfo includes information relating to a segment that is obtained by dividing a stream into plural files. In Segmentlnfo, Initialization segment that indicates initial information, such as a data compression method, and Media segment that indicates a segment of a moving image and a sound is included.
As above, a layer structure of an MPD file has been explained. The metadata-file generating unit 114 according to the present embodiment may store the multi-view zoom-switch information in the MPD file described above.
(Example of Storing AdaptationSet)
As described above, because the multi-view zoom-switch information is present per viewpoint, it is preferable to be stored in an MPD file associated with each viewpoint. In a multi-view content, each viewpoint can correspond to AdaptationSet. Therefore, the metadata-file generating unit 114 according to the present embodiment may store the multi-view zoom-switch information, for example, in AdaptationSet described above. In such a configuration, the client 300 can acquire the multi-view zoom-switch information at the time or reproduction.
As indicated on the fourth line, the eight line, and the twelfth line in
Furthermore, as indicated on the fourth line, the eighth line, and the twelfth line in
Moreover, the MPD file generated by the metadata-file generating unit 114 according to the present embodiment is not limited to the example illustrated in
(Example of Storing in Period, Associating with AdaptationSet)
As indicated on the third to the fifth lines in
As for schemeldUri of EssentialProperty indicated in
For example, in
(Modification)
As above, the storage example of the multi-view zoom-switch information in an MPD file by the metadata-file generating unit 114 according to the present embodiment has been explained, but the present embodiment is not limited to the example.
For example, as a modification, the metadata-file generating unit 114 may generate another metadata file different from the MPD file, in addition to the MPD file, and may store the multi-view zoom-switch information in this metadata file. Furthermore, the metadata-file generating unit 114 may store access information to access the metadata file in which the multi-view zoom-switch information is stored in the MPD file. The MPD file generated by the metadata-file generating unit 114 in this modification will be explained, referring to
As indicated on the fourth line, the eight line, and the twelfth line in
As for schemeldUri of EssentialProperty indicated in
For example, POS-100.txt indicated in value on the fourth line in
Moreover, POS-200.txt indicated in value on the eighth line in
Moreover, POS-300.txt indicated in value on the twelfth line in
While the example in which the access information is stored in AdaptationSet has been explained in
<3-3. Operation Example>
As above, the metadata file generated by the metadata-file generating unit 114 in the present embodiment has been explained. Subsequently, an operation example according to the present embodiment will be explained.
As illustrated in
Processing related to generation of the multi-view zoom-switch information explained with reference to
As illustrated in
Subsequently, the processing unit 310 acquires information of a transmission band (S406), and selects Representation that can be transmitted in a bitrate range of a transmission path (S408). Furthermore, the processing unit 310 acquires an MP4 file constituting Representation selected at step S408 from the distribution server 200 (S410). The processing unit 310 then starts decoding of an elementary streaming included in the MP4 file acquired at step S410 (S412).
As above, the first embodiment has been explained. While an example in which streaming distribution is performed by MPEG-DASH has been explained in the first embodiment described above, hereinafter, an example in which a content file is provided through a storage device instead of streaming distribution will be explained as a second embodiment. Moreover, in the present embodiment, the multi-view zoom-switch information described above is stored in a content file.
<4-1. Configuration Example>
Functional Configuration Example of Generating Device
As illustrated in
The generating unit 610 performs processing related to an image and an audio, and generates a content file. As illustrated in
The content-file generating unit 613 generates a content file based on information provided from the image-stream encoding unit 611 and the audio-stream encoding unit 612. A content file generated by the content-file generating unit 613 according to the present embodiment may be an MP4 file (ISOBMFF file) similarly to the first embodiment described above.
However, the content-file generating unit 613 according to the present embodiment stores the multi-view zoom-switch information in a header of the content file. Moreover, the content-file generating unit 613 according to the present embodiment may store the multi-view zoom-switch information in the header, associating the multi-view zoom-switch information with each viewpoint included in plural switchable viewpoints (viewpoints of a multi-view content). A storage example of the multi-view zoom-switch information in a header of a content file will be described later.
The MP4 file generated by the content-file generating unit 613 is output and stored in the storage device 700 illustrated in
The control unit 620 is a functional component that controls the entire processing performed by the generating device 600 in a centralized manner. For example, it is noted that what is controlled by the control unit 620 is not limited. For example, the control unit 620 may control processing generally performed by a general-purpose computer, a PC, a tablet PC, and the like.
The communication unit 630 performs various kinds of communications. For example, the communication unit 630 transmits an MP4 file generated by the generating unit 110 to the storage device 700. What is communicated by the communication unit 630 is not limited to these.
The storage unit 640 is a functional component that stores various kinds of information. For example, the storage unit 640 stores the multi-view zoom-switch information, a multi-view image signal, an audio object signal, an MP4 file, and the like, or stores a program or a parameter used by the respective functional components of the generating unit 600, and the like. What is stored by the storage unit 640 is not limited to these.
(Functional Configuration Example of Reproducing Device)
Moreover, as illustrated in
The processing unit 810 is a functional component that performs processing related to reproduction of a content. The processing unit 810 may perform, for example, processing related to the viewpoint switch explained with reference to
The image processing unit 820 acquires an MP4 file stored in the storage device 700, and performs image processing. As illustrated in
The audio processing unit 830 acquires an MP4 file stored in the storage device 700, and performs audio processing. As illustrated in
The control unit 840 is a functional component that controls the entire processing performed by the reproducing device 800 in a centralized manner. For example, the control unit 840 may control various kinds of processing based in an input made by using an input unit (not illustrated), such as a mouse and a keyboard, by a user. What is controlled by the control unit 840 is not particularly limited. For example, the control unit 340 may control processing generally performed by a general-purpose computer, a PC, a tablet PC, and the like.
The communication unit 850 performs various kinds of communications. Moreover, the communication unit 850 also functions as a receiving unit, and receives an MP4 file and the like from the storage device 700. What is communicated by the communication unit 850 is not limited to these.
The storage unit 860 is a functional component that stores various kinds of information. For example, the storage unit 860 stores an MP4 file and the like acquired from the storage device 700, or stores a program or a parameter used by the respective functional components of the reproducing device 800, and the like. What is stored by the storage unit 860 is not limited to these.
As above, the generating device 600 and the reproducing device 800 according to the present embodiment have been explained. Although an example in which an MP4 file is provided through the storage device 700 has been explained above, it is not limited to the example. For example, the generating device 600 and the reproducing device 800 may be connected to each other directly or through a communication network, and an MP4 file may be transmitted from the generating device 600 to the reproducing device 800, to be stored in the storage unit 860 of the reproducing device 800.
<4-2. Storage Example of Multi-View Zoom-Switch Information in Content File>
As above, the configuration example of the present embodiment has been explained. Subsequently, a storage example of the multi-view zoom-switch information in a header of a content file generated by the content-file generating unit 613 will be explained in the present embodiment.
As described above, the content file generated by the content-file generating unit 613 in the present embodiment may be an MP4 file. When the MP4 file is an ISOBMFF file, standard of which is defined by ISO/IEC 14496-12, a moov box (system layer metadata) is included in the MP4 file as a header of the MP4 file.
(Storage Example of Storing in Udta Box)
(Example of Storing as Metadata Track)
Although an example in which the multi-view zoom-switch information is stored in a udta box as static metadata with respect to video track has been explained above, the present embodiment is not limited thereto. For example, when the multi-view zoom-switch information changes according to a reproduction time, it is difficult to store in a udta box.
Therefore, when the multi-view zoom-switch information changes according to a reproduction time, by using track, which has a structure having a time axis, new metadata track that indicates the multi-view zoom-switch information may be defined. A definition method of metadata track in ISOBMFF is described in ISO/IEC 14496-12, and metadata track according to the present example may be defined conforming to ISO/IEC 14496-12. This example will be explained, referring to
In the present example, the content-file generating unit 613 stores the multi-view zoom-switch information in a mdat box as timed metadata track. In the present example, the content-file generating unit 613 can store the multi-view zoom-switch information also in a moon box.
For example, in the example illustrated in
Furthermore, in the present example, the content-file generating unit 613 can store the multi-view zoom-switch information also in a moov box.
In the present example, the content-file generating unit 613 may define sample as illustrated in
<4-3. Operation Example>
As above, a content file generated by the content-file generating unit 613 has been explained in the present embodiment. Subsequently, an operation example according to the present embodiment will be explained.
As illustrated in
Prior to the processing illustrated in
As illustrated in
The processing unit 810 then starts decoding of an elementary stream included in the MP4 file acquired at step S602.
As above, embodiments of the present disclosure have been explained. Finally, a hardware configuration of the information processing device according to embodiments of the present disclosure will be explained, referring to
As illustrated in
The CPU 901 functions as an arithmetic processing device and a control device, and controls overall operation in the information processing device 900 in accordance with various kinds of programs. Moreover, the CPU 901 may be a microprocessor. The ROM 902 stores a program, arithmetic parameters, and the like used by the CPU 901. The RAM 903 temporarily stores a program used at execution of the CPU 901, a parameter that appropriately varies at its execution, and the like. The CPU 901 can form, for example, the generating device 110, the control unit 120, the control unit 220, the processing unit 310, the control unit 340, the generating unit 610, the control unit 620, the processing unit 810, and the control unit 840.
The CPU 901, the ROM 902, and the RAM 903 are connected to one another through the host bus 904a including a CPU bus, or the like. The host bus 904a is connected to the external bus 904b, such as a peripheral component interconnect/interface (PCI) bus, through the bridge 904. The host bus 904a, the bridge 904, and the external bus 904b do not necessarily need to be formed separately, and functions of these components may be implemented in a single bus.
The input device 906 is implemented by a device to which information is input by a user, such as a mouse, a keyboard, a touch panel, a button, a microphone, a switch, and a lever, for example. Moreover, the input device 906 may be a remote control device that uses, for example, an infrared ray or other radio waves, or may be an externally connected device, such as a mobile phone and a PDA supporting operation of the information processing device 900. Furthermore, the input device 906 may include an input control circuit that generates an input signal based on information input by a user using the input means described above, and that outputs it to the CPU 901, and the like. A user of the information processing device 900 is enabled to input various kinds of data or to instruct a processing action with respect to the information processing device 900 by operating this input device 906.
The output device 907 is formed with a device that capable of notifying of acquired information to a user visually or aurally. These devices include a display device, such as a CRT display device, a liquid crystal display device, a plasma display device, an EL display device, and a lamp, a sound output device, such as a speaker and a headphone, a printer device, and the like. The output device 907 outputs a result obtained by various kinds of processing performed by the information processing device 900. Specifically, the display device visually displays a result obtained by various kinds of processing performed by the information processing device 900 in various forms, such as text, image, table, and graph. On the other hand, the sound output device converts an audio signal composed of reproduced sound data, acoustic data, or the like into an analog signal to aurally output.
The storage device 908 is a device for data storage formed as one example of a storage unit of the information processing device 900. The storage device 908 is implemented by, for example, a magnetic storage device, such as an HDD, a semiconductor storage device, an optical storage device, a magneto-optical storage device, or the like. The storage device 908 may include a recording medium, a recording device that records data on a recording medium, a reader device that reads data from a recording medium, a deletion device that deletes data recorded on a recording medium, and the like. This storage device 908 stores a program executed by the CPU 901, various kinds of data, various kinds of data acquired externally, and the like. The storage device 908 described above can form, for example, the storage unit 140, the storage unit 240, the storage unit 360, the storage unit 640, and the storage unit 860.
The drive 909 is a reader/writer for a recording medium, and is mounted in the information processing device 900, or is externally arranged. The drive 909 reads information recorded on an inserted removable recording medium, such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory, to output to the RAM 903. Moreover, the drive 909 can write information to a removable recording medium also.
The connecting port 911 is a interface connected to an external device, and is a connecting port to an external device to which data can be transmitted through, for example, a universal serial bus (USB), or the like.
The communication device 913 is a communication interface that is formed with a communication device or the like to connect to the network 920. The communication device 913 is a communication card for a wired or wireless local area network (LAN), a long term evolution (LTE), Bluetooth (registered trademark), or wireless USB (WUSB), or the like. Furthermore, the communication device 913 may be a router for optical communication, a router for asymmetric digital subscriber line (ADSL), a modem for various kinds of communications, or the like. This communication device 913 can communicate a signal and the like with the Internet or other communication devices, according to a predetermined protocol, such as TCP/IP. The communication device 913 can form, for example, the communication unit 130, the communication unit 230, the communication unit 350, the communication unit 630, and the communication unit 850.
The sensor 915 is various kinds of sensor, such as a acceleration sensor, a gyro sensor, a geomagnetic sensor, an optical sensor, a sound sensor, a range sensor, and a force sensor. The sensor 915 acquires information relating to the information processing device 900 itself, such as a posture and a moving speed of the information processing device 900, and information relating to a peripheral environment of the information processing device 900, such as brightness and noise of periphery of the information processing device 900. Furthermore, the sensor 915 may include a GPS sensor that measures a latitude and longitude, and altitude of the device by receiving a GPS signal.
The network 920 is a wired or wireless transmission path of information transmitted from a device connected to the network 920. For example, the network 920 may include a public circuit network, such as the Internet, a telephone line network, a satellite communication network, various kinds of local area networks (LAN) including Ethernet (registered trademark), a wide area network (WAN), and the like. Moreover, the network 920 may include a dedicated line network, such as Internet protocol-virtual private network (IP-VPN).
As above, one example of the hardware configuration that enables to implement functions of the information processing device 900 according to the embodiments of the present disclosure have been described. The respective components described above may be implemented by using a general-purpose members, or may be implemented by hardware specified to functions of the respective components. Therefore, according to a technical level of each time the embodiments of the present disclosure are performed, the hardware configuration to be applied can be changed as appropriate.
A computer program to implement the respective functions of the information processing device 900 according to the embodiments of the present disclosure as described above can be created, and install in a PC, or the like. Moreover, a computer-readable recording medium in which such a computer program is stored can also be provided. The recording medium is, for example, a magnetic disk, an optical disk, a magneto-optical disk, a flash memory, and the like. Furthermore, the computer program described above may be distributed, for example, through a network, without using a recording medium.
As explained above, according to the respective embodiments of the present disclosure, by using the multi-view zoom viewpoint switching information (viewpoint switch information) to perform viewpoint switch among plural viewpoints for reproduction of a content, a sense of awkwardness given to a user can be reduced visually and aurally. For example, as described above, it is possible to display a display image, matching a direction and a size of a subject between before and after a viewpoint switch based on the multi-view zoom viewpoint switching information. Furthermore, as described above, it is possible to reduce a sense of awkwardness given to a user by performing a position correction of an audio object in a viewpoint switch based on the multi-view zoom viewpoint switching information.
As above, exemplary embodiments of the present disclosure have been explained in detail with reference to the accompanying drawings, but a technical scope of the present disclosure is not limited to those examples. It is obvious that those having ordinary knowledge in the technical field of the present disclosure can think of various kinds of alteration examples and modification examples within a category of technical ideas described in claims, and those are also understood to belong to the technical range of the present disclosure naturally.
For example, in the first embodiment, an example in which the multi-view zoom-switch information is stored in a metadata file has been explained, but the present technique is not limited to the example. For example, as the first embodiment described above, even when streaming distribution is performed by MPEG-DASH, the multi-view zoom-switch information may be stored in a header of an MP4 file as explained in the second embodiment in place of or in addition to an MPD file. Particularly, when the multi-view zoom-switch information varies according to a reproduction time, it is difficult to store the multi-view zoom-switch information in an MPD file. Therefore, even when streaming distribution is performed by MPEG-DASH, the multi-view zoom-switch information may be stored in a mdat box as timed metadata track as the example explained with reference to
Whether the multi-view zoom-switch information varies according to a reproduction time can be determined by, for example, a content creator. Accordingly, where to store the multi-view zoom-switch information may be determined by an operation of a content creator, or based on information given by the content creator.
Moreover, effects described in the present specification are only examples, and are not limited. That is, the technique according to the present disclosure can produce other effects apparent to those skilled in the art from description of the present specification, together with the effects described above, or instead of the effects described above.
Configurations as below also belong to the technical scope of the present disclosure.
(1)
An information processing device comprising
a metadata-file generating unit that generates a metadata file including viewpoint switch information to perform a position correction of an audio object at a viewpoint switch among a plurality of viewpoints.
(2)
The information processing device according to (1), wherein
the metadata file is a media presentation description (MPD) file.
(3)
The information processing device according to (2), wherein
the viewpoint switch information is stored in AdaptationSet in the MPD file.
(4)
The information processing device according to (2), wherein
the viewpoint switch information is stored in Period in the MPD file, associated with AdaptationSet in the MPD file.
(5)
The information processing device according to (1), wherein
the metadata-file generating unit further generates a media presentation description (MPD) file including access information to access the metadata file.
(6)
The information processing device according to (5), wherein
the access information is stored in AdaptationSet in the MPD file.
(7)
The information processing device according to (5), wherein
the access information is stored in Period in the MPD file, associated with AdaptationSet in the MPD file.
(8)
The information processing device according to any one of (1) to (7), wherein
the viewpoint switch information is stored in the metadata file, associated with each viewpoint included in the plurality of viewpoints.
(9)
The information processing device according to (8), wherein
the viewpoint switch information includes switch-destination viewpoint information related to a switch destination viewpoint switchable from a viewpoint associated with the viewpoint switch information.
(10)
The information processing device according to (9), wherein
the viewpoint switch information includes threshold information relating to a threshold for a switch to the switch destination viewpoint from a viewpoint associated with the viewpoint switch information.
(11)
The information processing device according to any one of (8) to (10), wherein
the viewpoint switch information includes shooting-related information of an image relevant to a viewpoint associated with the viewpoint switch information.
(12)
The information processing device according to (11), wherein
the shooting-related information includes shooting position information relating to a position of a camera that has taken the image.
(13)
The information processing device according to (11) or (12), wherein
the shooting-related information includes shooting direction information relating to a direction of a camera that has taken the image.
(14)
The information processing device according to any one of (11) to (13), wherein
the shooting-related information includes shooting angle-of-view information relating to an angle of view of a camera that has taken the image.
(15)
The information processing device according to any one of (8) to (14), wherein
the viewpoint switch information includes reference angle-of-view information relating to an angle of view of a screen referred to when position information of an audio object relevant to a viewpoint that is associated with the viewpoint switch information has been determined.
(16)
An information processing method that is performed by an information processing device, the method comprising
generating a metadata file that includes viewpoint switch information to perform a position correction of an audio object at a viewpoint switch among a plurality of viewpoints.
(17)
A program that causes a computer to implement a function of
generating a metadata that includes viewpoint switch information to perform a position correction of an audio object at a viewpoint switch among a plurality of viewpoints.
(18)
An image processing device that includes a metadata-file acquiring unit that acquires a metadata file including viewpoint switch information to perform a position correction of an audio object at a viewpoint switch among plural viewpoints.
(19) The information processing device according to (18) described above in which the metadata file is a media presentation description (MPD) file.
(20)
The information processing device according to (19) described above in which the viewpoint switch information is stored in AdaptationSet in the MPD file.
(21)
The information processing device according to (19) described above in which the viewpoint switch information is stored in Period in the MPD file, associated with AdaptationSet in the MPD file.
(22)
The information processing device according to (18) described above in which the metadata-file acquiring unit further acquires a media presentation description (MPD) file including access information to access the metadata file.
(23)
The information processing device according to (22) described above in which the access information is stored in AdaptationSet in the MPD file.
(24)
The information processing device according to (22) described above in which the access information is stored in Period in the MPD file, associated with AdaptationSet in the MPD file.
(25)
The information processing device according to (18) to (24)
described above in which the viewpoint switch information is stored in the metadata file, associated with each viewpoint included in the plural viewpoints.
(26)
The information processing device according to (25) described above in which the viewpoint switch information includes switch-destination viewpoint information relating to a switch destination viewpoint switchable from a viewpoint associated with the viewpoint switch information.
(27)
The information processing device according to (26) described above in which the switch destination information includes threshold information relating to a threshold for a switch to the switch destination viewpoint from a viewpoint associated with the viewpoint switch information.
(28)
The information processing device according to any one of (25) to (27) described above in which the viewpoint switch information includes shooting-related information of an image relevant to a viewpoint associated with the viewpoint switch information.
(29)
The information processing device according to (28) described above in which the shooting-related information includes shooting position information relating to a position of a camera that has taken the image.
(30)
The information processing device according to (28) or (29)
described above in which the shooting-related information includes shooting direction information relating to a direction of a camera that has taken the image.
(31)
The information processing device according to any one of (28) to (30) described above in which the shooting-related information includes shooting angle-of-view information relating to an angle of view of a camera that has taken the image.
(32)
The information processing device according to any one of (25) to (31) described above in which the viewpoint switch information includes reference angle-of-view information relating to an angle of view of a screen referred to when position information of an audio object relevant to a viewpoint that is associated with the viewpoint switch information has been determined.
(33)
An information processing method that is performed by an information processing device, the method including acquiring a metadata file that includes viewpoint switch information to perform a position correction of an audio object at a viewpoint switch among plural viewpoints.
(34)
A program that causes a computer to implement a function of acquiring a metadata that includes viewpoint switch information to perform a position correction of an audio object at a viewpoint switch among plural viewpoints.
Number | Date | Country | Kind |
---|---|---|---|
2018-065014 | Mar 2018 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2018/048002 | 12/27/2018 | WO | 00 |