This disclosure relates to a sound signal description method, a sound signal production equipment, and a sound signal reproduction equipment, all of which are capable of representing information of sound signals with use of metadata for sound reproduction through multichannel speakers.
Various sound systems, such as a 2 channel sound system, a 5.1 channel sound system, and “3-dimensional multichannel stereophonic sound systems” beyond the 5.1 channel sound system, are used for program production. Describing the various sound systems using a common description format provides flexibility to the sound systems, which allows the systems to be applied to next-generation sound systems across various sound application scenarios. ITU-R, which is an international standardization body associated with broadcasting including sound, has defined requirements for an advanced multichannel sound system as ITU-R Recommendation. (Refer to Non Patent Literature 1.)
As the common description format for describing the various sound systems, an advanced study has been conducted on “sound signals to compose a single-layered sound field.” However, in some cases of sound program production, the format of “sound signals to compose a multi-layered sound field” can be used so as to facilitate rendering, conversion, and switching of received sound signals according to a receiver's environment or demand of program exchange or a home reproduction. For example, the receiver of program exchange or the home sometimes does not employ the same size image display as in the program production, and according to such a video reproduction environment of the receiver, the sound signal needs to be converted. Furthermore, it is sometimes required a language switching for program reproduction and, a reproduction position relocation of a narration signal according to needs of the receiver. Conventionally, however, the study has not been conducted on the description method for the “sound signals to compose a multi-layered sound field.”
It could therefore be helpful to provide a sound signal description method corresponding to the format of the “sound signals to compose a multi-layered sound field”, as well as a sound signal production equipment and a sound signal reproduction equipment which correspond to the sound signal description method.
One of the disclosed aspects therefore provides a sound signal description method for describing a multi-layered sound field, comprising: the number of sound field layers of the multi-layered sound field; a type of each sound field layer of the multi-layered sound field; and language information.
It is preferable that the type of each sound field layer of the multi-layered sound field indicates the sound elements of the program, such as one of international sound, which consists of all the sound program elements except for the commentary/dialogue elements, and one of commentary/dialogue sound with particular language.
Furthermore, another one of the disclosed aspects provides a sound signal description method for describing a multi-layered sound field, comprising: the number of sound field layers of the multi-layered sound field; and a video link identifier indicating, for each sound field layer of the multi-layered sound field, whether the sound field layer is linked to video.
Moreover, yet another one of the disclosed aspects provides a sound signal production equipment that produces a sound signal according to a sound signal description method for describing a multi-layered sound field, comprising: a metadata addition unit that produces metadata including the number of sound field layers of the multi-layered sound field, a type of each sound field layer of the multi-layered sound field, and language information; a coding unit that produces the sound signal according to the sound signal description method based on an input sound signal and the metadata; and a multiplexer that multiplexes the produced sound signal into a bit stream.
Moreover, yet another one of the disclosed aspects provides a sound signal reproduction equipment that reproduces a sound signal according to a sound signal description method for describing a multi-layered sound field, comprising: an environment information input unit that inputs reproduction environment information and user demand information; and a rendering reproduction unit that converts the sound signal according to the number of sound field layers of the multi-layered sound field, a type of each sound field layer of the multi-layered sound field, and language information included in the sound signal and according to the reproduction environment information and user demand information, and reproduces the converted sound signal.
The type of each sound field layer of the multi-layered sound field indicates which one of international sound and a particular language the sound field layer comprises, the international sound being used irrespective of language, and the particular language being switched by the environment information input unit. The rendering reproduction unit preferably adds the sound signal of the particular language to the international sound and reproduces added sound.
Moreover, yet another one of the disclosed aspects provides a sound signal production equipment that produces a sound signal according to a sound signal description method for describing a multi-layered sound field, comprising: a metadata addition unit that produces metadata including the number of sound field layers of the multi-layered sound field and a video link identifier indicating, for each sound field layer of the multi-layered sound field, whether the sound field layer is linked to video; a coding unit that produces the sound signal according to the sound signal description method based on an input sound signal and the metadata; and a multiplexer that multiplexes the produced sound signal into a bit stream.
Moreover, yet another one of the disclosed aspects provides a sound signal reproduction equipment that reproduces a sound signal according to a sound signal description method for describing a multi-layered sound field, comprising: an environment information input unit that inputs reproduction environment information and user demand information; and a rendering reproduction unit that converts the sound signal according to the number of sound field layers of the multi-layered sound field and a video link identifier included in the sound signal and according to the reproduction environment information and user demand information. The video link identifier indicating, for each sound field layer of the multi-layered sound field, whether the sound field layer is linked to video.
When the video link identifier indicates that the sound field layer is linked to video, the rendering reproduction unit preferably renders the sound signal of the sound field layer based on video display information input by the environment information input unit.
The disclosed sound signal description method, the disclosed sound signal production equipment, and the disclosed sound signal reproduction equipment-make it possible to describe the “sound signals to compose a multi-layered sound field” and to produce and reproduce a sound program using the sound signals.
In the accompanying drawings:
Embodiments of our methods and equipment will be described in detail below with reference to the drawings.
We extend a description method (referred to below as a “Basic sound field descriptor”) for describing “sound signals to compose a single-layered sound field” to the description method (referred to below as an “Extended sound field descriptor”) for describing a “sound signals to compose a multi-layered sound field.” Regarding the Basic sound field descriptor, we filed a Korean Patent Application (10-2012-0112984), and the Basic sound field descriptor is reviewed below for understanding of the disclosure.
In order to describe multichannel sound signals to compose a single-layered sound field, it is necessary to describe which channel corresponds to the reproduction position. The described information is called descriptor, which is described as metadata in a header of a corresponding multichannel sound signal or in the headers on each sound channel constituting the multichannel.
Table 1 illustrates terms and definitions of the Basic sound field descriptor. The Basic sound field descriptor is employed for production and exchange of complete mix programs (i.e. programs including all sound required for reproduction) with multichannel sound, for example.
The Sound Essence descriptor includes a descriptor of a program, a descriptor (name) of the Sound-field, and other relevant descriptors.
As shown in
The Sound Channel descriptor includes the Channel label descriptor and/or Channel Position descriptor.
The following describes the descriptors in the Basic sound field descriptor. Note that some of the descriptors overlap with each other in anticipation of different program exchange scenarios. However, a program producer or the like is able to appropriately choose necessary descriptors for each program exchange scenario.
The Basic sound field descriptor includes (A) Sound Essence descriptors, (B) Sound-field configuration descriptors, and (C) Sound Channel descriptors.
Table 2 shows (A) Sound Essence descriptors in the Basic sound field descriptor.
Table 3 shows (B) Sound-field configuration descriptors in the Basic sound field descriptor.
Table 4 shows (C) Sound Channel descriptors in the Basic sound field descriptor.
Table 5 shows C.1 Channel label descriptors, which are descriptors of the Channel label data included in the Sound Channel descriptors.
Table 6 shows C.2 Channel position descriptors, which are descriptors of the Channel position data included in the Sound Channel descriptors.
We extend the Basic sound field descriptor, which is the description method for the “sound signals to compose a single-layered sound field” as mentioned above, to the Extended sound field descriptor, which is the description method for the “sound signals to compose a multi-layered sound field.”
Table 7 illustrates terms and definitions of the Extended sound field descriptor.
The Sound Essence descriptor includes the descriptor of the program, the descriptor (name) of the Sound-field, and the other relevant descriptors.
As shown in
The Sound Channel descriptor includes the Channel label descriptor and/or the Channel Position descriptor.
Table 8 shows (A) Sound Essence descriptors in the Extended sound field descriptor.
Table 9 shows A.2 Sound-field descriptors in the Extended sound field descriptor.
Regarding (B) Sound-field configuration descriptors and (C) Sound Channel descriptors in the Extended sound field descriptor, these descriptors are the same as those of the Basic sound field descriptor, and a description thereof is omitted.
The mixing unit 11 mixes sound signals (Sound Sources 1-M) and outputs, to the coding unit 13, sound signals to compose the multi-layered sound field including Spatial anchor, Commentary, Dialogue, and Object signals, the sound signals being output from a “production system for sound signals to compose a multi-layered sound field.”
The metadata addition unit 12 outputs, to the coding unit 13, the metadata to be described for the Extended sound field descriptor of the multi-layered sound field including Spatial anchor, Commentary, Dialogue, and Object signals. The metadata addition unit 12 also outputs the produced metadata to the coding unit 13.
Based on the mixed sound signals received from the mixing unit 11 and the metadata received from the metadata addition unit 12, the coding unit 13 produces the sound signals according to the Extended sound field descriptor, encodes the produced sound signals, and outputs the encoded sound signals to the multiplexer 14.
The multiplexer 14 receives, from the coding unit 13, the sound signals according to the Extended sound field descriptor that have been encoded, and multiplexes the received sound signals into a bit stream, in order to convey a multiplexed sound signal to a sound signal reproduction equipment via broadcast or transmission. The multiplexer 14 transmits the multiplexed bit stream to remote places such as home via radio waves, IP circuits, and the like.
The monitoring unit 15 is used for checking contents of the sound signals and the metadata.
The demultiplexer 21 receives, via broadcast or transmission, the sound signal according to the Extended sound field descriptor that has been multiplexed into the bit stream, and demultiplexes the received sound signal into the respective sound signals of the sound field layers and the metadata. The demultiplexer 21 also outputs the demultiplexed sound signals and metadata to the decoding unit 22.
The decoding unit 22 decodes the encoded sound signals and metadata received from the demultiplexer 21 and outputs, to the rendering reproduction unit 23, signals including Spatial anchor, Commentary, Dialogue, Object signals, and metadata.
Based on the Extended sound field descriptor, the rendering reproduction unit 23 reproduces the original sound signals as they are, or renders (e.g. down-mixes) the sound signals based on the reproduction environment (e.g. the number of channels of a speaker and a display size) before reproducing the sound signals. That is to say, the rendering reproduction unit 23 renders (e.g switches, converts, and renders) the sound signals based on the Extended sound field descriptor in a sound reproduction environment different from the environment during program production.
The environment information input unit 24 displays to a user the metadata information described as the Extended sound field descriptor, receives user inputs about the reproduction environment information and user demand information, namely, language selection for the multiplexed sound, reproduction environment information (e.g. the speaker configuration and the display size), and the like, and outputs the reproduction environment information and user demand information to the rendering reproduction unit 23.
The monitoring unit 25 is used for checking a result of reproduction performed by the rendering reproduction unit 23, as well as program viewing.
The following describes specific usage embodiments of the sound signal production equipment and the sound signal reproduction equipment. For example, the disclosed sound signal production equipment and the disclosed sound signal reproduction equipment make it possible to easily control the narration language switching and narration reproduction position relocation in accordance with the home reproduction environment and user demand. Furthermore, in the reproduction environment with the video display having the different size than the size according to production conditions, the disclosed sound signal production equipment and the disclosed sound signal reproduction equipment make it possible to easily control the sound image field position in the sound field layer of the “video/sound linked sound source”, which requires the video to be linked to the sound image position, to be adjusted to the video display and perform reproduction, while maintaining the high quality sound providing as much of the sense of presence as was produced.
As an example of program production using the Extended sound field descriptor, i.e., the format of the “sound signals to compose a multi-layered sound field”, suppose a case where not only the sound signals of the Japanese or Korean narrations and dialogues but also the sound signals of various languages such as English are produced. In the above example, the sound signal production system is formed by the format of the “sound signals to compose a multi-layered sound field” including the sound field layer of the international sound (Spatial anchor) used irrespective of language, and the sound field layers (Commentary, Dialogue) of the narrations and dialogues of particular languages.
In this example, the metadata addition unit 12 adds the metadata shown in Table 10 to the header of the corresponding multichannel-sound-format signal or to the headers on each sound channel constituting the multichannel according to the Extended sound field descriptor.
The user inputs the information of the reproduction system, such as the speaker arrangement information and the user demand of narration sound position to be reproduced, and controls the sound signals (e.g. the user arbitrarily adjusts the reproduction position). For example, in the home reproduction environment the sound signals can be reproduced under control in terms of a desired narration language and narration reproduction position while the high quality sound providing as much of the sense of presence as was produced is maintained.
In order to achieve the above function, the user at an receiving side inputs, through the environment information input unit 24, the information of desired narration sound (e.g. the narration language that the user demands to reproduce and the narration reproduction position) and the information of the reproduction system (e.g. speaker arrangement information). The rendering reproduction unit 23 switches a sound signal of the “narration language” layer that has been designated from among the produced narration languages described in the metadata, adds to the switched sound signal the international sound used irrespective of language for reproduction, and reproduces the sound signal. The rendering reproduction unit 23 is also fed the desired narration reproduction position, the speaker arrangement information, and the sound signal of the produced “narration language” layer. The rendering reproduction unit 23 also relocates the switched sound signal so that reproduction is performed from the designated narration reproduction position and renders the signal so that the sound quality providing as much of the sense of presence as was produced is achieved. Subsequently, the rendering reproduction unit 23 adds, to the rendered signal, the international sound used irrespective of language and reproduces the signal.
As an example of program production using the Extended sound field descriptor, i.e., the format of the “sound signals to compose a multi-layered sound field”, suppose a case where the “sound requiring the link between video and sound positions” and the “sound directly irrespective of the video position” are separately produced and recorded. Sound signals include not only the “sound requiring the link between video and sound positions” (e.g. the dialogue of an actor and sound emitted from an object on the screen) but also the “sound directly irrespective of the video position” (e.g. sound effects for enhancing the sense of presence of an entire program), and the “sound requiring the link between video and sound positions” and the “sound directly irrespective of the video position” can be separately produced and recorded. In the above example, the sound signal production system is formed by the format of the “sound signals to compose a multi-layered sound field” including the sound field layer of the “sound requiring the link between video and sound positions” and the “sound directly irrespective of the video position.”
In this example, the metadata addition unit 12 adds the metadata shown in Table 11 to the header of the corresponding multichannel sound format signal or to the headers on each sound channel constituting the multichannel according to the Extended sound field descriptor.
In the reproduction environment with the video display having the different size than the size according to the production conditions as shown in
In order to achieve the above function, the user at the receiving side inputs, through the environment information input unit 24, the information of the reproduction system (e.g. speaker arrangement and video display information). When the conditions for the video display and the speaker arrangement during production are the same as the conditions for the video display and the speaker arrangement at the receiving side, the rendering reproduction unit 23 does neither convert nor render the received sound signals. In this case, the rendering reproduction unit 23 adds the “sound requiring the link between video and sound positions” and the “sound directly irrespective of the video position” and reproduces the added sound. On the other hand, when the above conditions are not the same in terms of either one of the video display and the speaker arrangement, the rendering reproduction unit 23 converts the received sound signals by either rendering or down-mixing so that the sound quality providing as much of the sense of presence as was produced is achieved, and reproduces the added sound signals. When the video display size is different, and the speaker arrangement is the same, the rendering reproduction unit 23 renders the sound signals of the layer of the “sound preferably requiring the link between video and sound positions” so that a width of the video display size equals a width of the sound image. The rendering reproduction unit 23 adds the rendered “sound preferably requiring the link between video and sound positions” and the unconverted and un-rendered “sound directly irrespective of the video position” and reproduces the added sound. Here, the rendering processing, i.e., processing for equalizing the width between the sound image of the “sound preferably requiring the link between video and sound positions” and the video display size, can be easily performed by using field position information of Azimuth angle and Elevation angle included in Spatial position data defined in Channel position data.
Thus, according to the above embodiment, the Extended sound field descriptor includes the number of sound field layers, the type of each sound field layer, and the language information. With the above structure, the sound signal description method corresponding to the format of the “sound signals to compose a multi-layered sound field” is achieved.
Furthermore, it is preferable that the type of each sound field layer indicates which one of international sound and a particular language the sound field layer comprises, the international sound being used irrespective of language. With the above structure, in the home reproduction environment, for example, the sound signals can be reproduced under control in terms of the desired narration language and narration reproduction position while the high quality sound providing as much of the sense of presence as was produced is maintained.
Moreover, according to the above embodiment, the Extended sound field descriptor includes the number of multiple sound field layers and a video link identifier indicating, for each sound field layer, whether the sound field layer is linked to video. With the above structure, in the reproduction environment with the video display having the different size than the size according to the production conditions, for example, the sound image field position in the sound field layer of the “video/sound linked sound source”, which requires the link between video and sound image positions, can be controlled to be adjusted to the video display, and reproduction is performed, while the high quality sound providing as much of the sense of presence as was produced is maintained.
Moreover, with the sound signal production equipment and the sound signal reproduction equipment according to the above embodiments, the sound signal described by the Extended sound field descriptor can be produced and reproduced. Note that the disclosed equipment also includes, in its scope, any equipment that transmits the sound signal described by the Extended sound field descriptor to the remote places such as home via radio waves, IP circuits, and the like, any equipment that stores and records in a recording medium the sound signal described by the Extended sound field descriptor, and a recording medium in which the sound signal described by the Extended sound field descriptor is stored and recorded.
The sound signal production equipment according to one of the embodiments produces the metadata including the number of sound field layers, the type of each sound field layer, and the language information, produces the sound signal according to the Extended sound field descriptor based on an input sound signal and the metadata, and multiplexes the sound signal into the bit stream. Furthermore, the sound signal reproduction equipment according to one of the embodiments converts the sound signal according to the number of sound field layers, the type of each sound field layer, and the language information included in the sound signal and according to the reproduction environment information and user demand information, and reproduces the converted sound signal. The above structure makes it possible to produce and view a program using the “sound signals to compose a multi-layered sound field.” In particular, the sound signal reproduction equipment adds, to the international sound, the sound signal of the particular language that has been switched by the user, and reproduces the added sound. The above structure allows the user to arbitrarily carry out an operation such as language selection with use of the received metadata, thereby making it possible to switch and relocate the appropriate narration language and narration reproduction position, while the high quality sound providing as much of the sense of presence as was produced is maintained.
Moreover, the sound signal production equipment according to one of the embodiments produces the metadata including the number of layers of sound field and a video link identifier indicating, for each sound field layer, whether the sound field layer is linked to video, produces the sound signal according to the Extended sound field descriptor based on the input sound signal and the metadata, and multiplexes the sound signal into the bit stream. Moreover, the sound signal reproduction equipment according to one of the embodiments converts the sound signal according to the video link identifier and according to the reproduction environment information of the user, the video link identifier indicating, for each sound field layer, whether the sound field layer is linked to video, and the sound signal reproduction equipment reproduces the converted sound signal. The above structure makes it possible to produce and view the program using the “sound signals to compose a multi-layered sound field.” In particular, when the video link identifier indicates that the sound field layer is linked to video, the rendering reproduction unit renders the sound signal of the sound field layer based on information about the video display of the user, and reproduces the rendered sound signal. The above structure makes it possible to render and convert the sound image field position in the sound field layer of the “video/sound linked sound source”, which requires the link between video and sound image positions, so that the sound field image position is adjusted to the video display, while the high quality sound providing as much of the sense of presence as was produced is maintained by inputting the information of the reproduction system (e.g. the video display) of the user and by using the information of the video display during production described in the metadata.
While our methods and equipment have been described based on the drawings and embodiments, it should be noted that a person skilled in the art can readily make various modifications and changes in accordance with the disclosure. As such, it should also be noted that the modifications and changes are within the scope of the disclosure. For example, the function or the like included in each element, each means, and each step is subject to rearrangement, and several means and steps can be combined into a single means or step or they can be divided.
We make it possible to describe a “sound signals to compose a multi-layered sound field”, and to produce and view/listen a program using such sound signals. As a result, interoperability between different next generation sound systems is achieved, and even in a sound reproduction environment different from the environment during program production, switching, conversion, and rendering of the sound signals is facilitated.
Number | Date | Country | Kind |
---|---|---|---|
2013-010544 | Jan 2013 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2013/007390 | 12/16/2013 | WO | 00 |