The present application relates generally to the processing of audio in a captured scene, and more particularly, where the captured scene includes an image and spatially localizable audio, which is adjusted, where the particular spatially localizable audio that is adjusted is associated with an object from the captured scene that is selected by a user.
As computing power increases relative to personal computers and/or hand held electronic devices, virtual reality and augmented reality applications are beginning to become more mainstream, and are generally beginning to become more available to the average consumer. While virtual reality applications may attempt to create a substitute for the real world with a simulated world, augmented reality attempts to alter one's perception of the real world through an addition, an alteration, or a subtraction of elements from a real world experience.
While most augmented reality experiences focus extensively on addressing the visual aspects of reality, the present inventors recognize that an ability to make adjustments that affect the other senses such as sound, smell, taste and/or touch can further enhance the experience. However in order to effectively address the other senses, it often requires an ability to spatially isolate perceived aspects of the other senses, and associate them with objects and/or spaces that are visually being presented to the user. For example, when visually adding, altering, and/or removing an object from a scene, a failure to similarly add, alter, and/or remove other aspect of the object such as any sound being produced by the object, can result in the intended change to reality having a less than desired immersive effect. While it can be relatively straight forward to alter the visual aspects of a scene and/or elements within a scene, the pairing and corresponding adjustment of the perceived portion of the audio with the affected visual elements or aspects can sometimes be less straight forward, and can be further complicated by an augmented reality application that attempts to modify at the user's direction the user's experience in real time.
The present inventors have recognized that in order to enhance an augmented reality experience, it would be beneficial to be able to identify and address spatially localizable audio aspects of an experience in addition to the visual aspects of an experience, and to match the particular spatially localizable audio aspects and any changes thereto with the visual aspects being perceived and selected for adjustment by the user.
The present application provides a method for processing audio in a captured scene including an image and spatially localizable audio. The method includes capturing a scene including image information and spatially localizable audio information. The captured image information of the scene is then presented to a user via an image reproduction module. An object in the presented image information is then selected, which is the source of spatially localizable audio information, by isolating the spatially localizable audio information in the direction of the selected object. The isolated spatially localizable audio information is then altered.
In at least some instances, altering the isolated spatially localizable audio information includes adjusting characteristics of the isolated spatially localizable audio information, where in some instances adjusting the characteristics of the isolated spatially localizable audio information can include altering the apparent location of origin of the isolated spatially localizable audio information.
In at least some further instances, altering the isolated spatially localizable audio information includes removing the isolated spatially localizable audio information prior to modification, and replacing the removed isolated spatially localizable audio information with updated spatially localizable audio information.
In at least some still further instances, the method further includes altering an appearance of the selected object in the presented image information.
The present application further provides a device for processing audio in a captured scene including an image and spatially localizable audio. The device includes an image capture module for receiving image information, a spatially localizable audio capture module for receiving spatially localizable audio information, and a storage module for storing at least some of the received image information and received spatially localizable audio information. The device further includes an image reproduction module for presenting captured image information to a user, and a user interface for receiving a selection from the user, which corresponds to an object in the captured image information presented to the user. The device still further includes a controller, which includes an object direction identification module for determining a direction of the selected object within the captured scene information, a spatially localizable audio information isolation module for isolating the spatially localizable audio information within the captured scene information in the direction of the selected object, and a spatially localizable audio information alteration module for altering the isolated spatially localizable audio information.
These and other objects, features, and advantages of the present application are evident from the following description of one or more preferred embodiments, with reference to the accompanying drawings.
While the present application is susceptible of embodiment in various forms, there is shown in the drawings and will hereinafter be described presently preferred embodiments with the understanding that the present disclosure is to be considered an exemplification and is not intended to be limited to the specific embodiments illustrated.
In the illustrated embodiment, the device corresponding to a radio frequency telephone includes a display 102 which covers a large portion of the front facing. In at least some instances, the display 102 can incorporate a touch sensitive matrix, that can help facilitate the detection of one or more user inputs relative to at least some portions of the display, including an interaction with visual elements being presented to the user via the display 102. In some instances, the visual elements could correspond to objects with which the user can interact. In other instances, the visual element can form part of a visual representation of a keyboard including one or more virtual keys and/or one or more buttons with which the user can interact and/or select for a simulated actuation. In addition to one or more virtual user actuatable buttons or keys, the device 100 can include one or more physical user actuatable buttons 104. In the particular embodiment illustrated, the device has three such buttons located along the right side of the device.
The exemplary device 100, illustrated in
While in the particular embodiment shown, a single speaker 106 and a single microphone 108 are illustrated, the device 100 could include more than one of each, to enable spatially localizable information to be captured and/or encoded in the audio to be played back and perceived by the user. It is further possible that the device could be used with a peripheral and/or an accessory, which can be used to supplement the included image and audio capture and/or playback capabilities.
In addition and/or alternative to the serial bus port 206, a connector port could take still further forms. For example, an interface could be present on the back surface of the device which includes pins or pads arranged in a predetermined pattern for interfacing with another device, which could be used to supply data and/or power signals. It is also possible that additional devices could interface or interact with a main device through a less physical connection, that may incorporate one or more forms of wireless communications, such as radio frequency, infra-red (IR), near field (NFC), etc.
In an augmented reality scene, a virtual character may be added, and an existing entity may be changed and/or removed. The changes could include alterations to the visual aspects of elements captured in the scene, as well as other aspects associated with other senses including audio aspects. For example, the sounds that the bird or the dog may be making could be altered. In some instances, the dog could be made to sound more like a bird, and the bird could be made to sound more like a dog. In other instances, the augmented reality scene could be altered to convert the sounds the dog and the bird are making to appear to be more like the language of a person. Alternatively and/or additionally, the tone and/or the intensity of the animal sounds could be altered to create or enhance the emotions appearing to be conveyed. For example, the sound coming from a particular animal could be amplified with respect to the surroundings and other characters, so that the user/observer is able to focus more on the behavior of the particular animal. Still further, a change in the environmental surroundings, real or virtual, could be accompanied by changes to the animal sounds, by adding equalization and/or reverb.
A virtual conversation involving the user 302 with another entity included in the scene and/or added to the scene could be created as part of an augmented reality application which is being executed on the device 100. In some instances, a virtual conversation between the user and a virtual character could be used to support the addition of services, such as the services of a virtual guide or narrator. The added and/or altered aspects of the scene could be included in the information being presented to the user 302 via the device 100 which is also capturing the original scene, such as via the display 102 of the device 100.
The exemplary device further includes a spatially localizable audio capture module 506, which in at least some instances can include a microphone array 508 including a plurality of spatially distinct audio capture elements. The ability to spatially localize captured audio enables the captured audio to be isolated and/or associated with various areas in a captured image, which can then be correspondingly associated with items, elements and characters contained within an image. In at least some instances, the identified spatially distinct audio corresponds to various streams of audio that are each received from a particular direction, where the nature and arrangement of the audio capture elements within a microphone array can be used to help determine the spatial ability to differentiate between the various sources of received audio. In at least some instances, the microphone array 508 can be included as part of a peripheral that can attach to the device 100 via one or more ports, which can include a universal serial bus port, such as port 206.
Once captured, the received image information 510 and received spatially localizable audio information 512 can be maintained in a storage module 514. Once maintained in the storage module 514, the captured image information 510, and audio information 512 can be modified and/or adjusted so as to alter and/or augment the information, that is subsequently presented to the user and/or one or more other people as part of the augmented scene. The storage element 514 could include one or more forms of volatile and/or non-volatile memory, including conventional ROM, EPROM, RAM, or EEPROM. The possible additional data storage capabilities may also include one or more forms of auxiliary storage, which is either fixed or removable, such as a hard drive, a floppy drive, or a memory stick. One skilled in the art will further appreciate that other still further forms of storage elements could be used in connection with the processing of audio in a captured scene without departing from the teachings of the present disclosure. The storage module can additionally include one or more sets of prestored instructions 516, which could be used in connection with a microprocessor that could form all or parts of a controller in the management of the desired functioning of the device 100 and/or one or more applications being executed on the device.
Correspondingly, adjustments of the captured information is generally managed under the control of a controller 518, which can be associated with one or more microprocessors. In some of the same or other instances, the controller can incorporate state machines and/or logic circuitry, which can be used to implement at least partially, various modules and/or functionality associated with the controller 518. In some instances, all or parts of storage module 514 could also be incorporated as part of the controller 518.
In the illustrated embodiment, the controller 518 includes an object direction identification module 520, which can be used to determine a selected object and a corresponding direction of the selected object within the scene relative to the user 302 and the device 100. The selection is generally managed using a user selection module 522 of the user interface 524, which can be included as part of the device 100. In some instances, the user selection module 522 is incorporated as part of a touch sensitive display 528, which is also capable of visually presenting captured scene information to the user 302 as part of an image reproduction module 526 of the user interface 524. The use of a display 530 for use in visually presenting captured scene information to the user, which does not incorporate touch sensitive capability, is also possible. However, in such instances, an alternative form of accepting input from the user for purposes of user selection may be used.
Alternative to and/or in addition to using a touch sensitive display 528 for purposes of receiving a user selection from the user 302, the user selection module can additionally or alternatively include one or more of a cursor control device 532, a gesture detection module 534, or a microphone 536. The cursor control device 532 can include the use of one or more of a joystick, a mouse, a track pad, a track ball or a track point, each of which could be used to move a cursor relative to an image being presented via a display. When a selection is indicated, the position of the cursor may highlight and/or coincide with an associated area or element in the image being displayed, which allows the corresponding area or element to be selected.
A gesture detection module 534 could be used to detect movements of the user 302 and/or a pointer controlled by the user relative to the device 100, which in turn could have one or more predesignated meanings, which might allow the controller 518 to identify elements or areas in the image information and better manage any adjustments to the captured scene. In some instances, the gesture detection module 534 could be used in conjunction with a touch sensitive display 528 and/or a related set of sensors. For example, the gesture detection module could be used to detect a scratching relative to an area or element being visually presented to the user. The scratching might be used to indicate a user's desire to delete an object associated with the corresponding area or element being scratched. Alternatively, the gesture detection module could be used to detect an object selection gesture, such as a circling gesture, which could be used to identify a selection of an object.
A microphone 536 could still further alternatively and/or additionally be used to provide a detectable audible description from the user, which might assist in the selection of an area or element to be affected by a desired subsequent augmentation. Language parsing could be used to determine the meaning of the detected audible description, and the determined meaning of the audible description might then be paired with a corresponding visual context that might have been determined to be contained in the captured image information being presented to the user.
Once a direction for the object and/or area to be affected has been determined, the controller 518, including a spatially localizable audio information isolation module 538, can then identify audio associated with the identified object and/or area with the assistance of the spatially localizable audio capture module 506. The identified spatially localized audio associated with the area or object of interest can then be altered using a spatially localizable audio information alteration module 540, which is included as part of the controller 518. In some instances, in addition to altering the identified spatially localized audio associated with a particular area or object, it may be desirable to also alter the corresponding visual appearance of the same. Such an alteration could be managed using a corresponding appearance alteration module 542. The captured scene, which has been augmented and/or altered could then be presented to the user 302 and/or others. For example, the augmented/altered version of the captured scene could be presented to the user 302 using the display 102 and one or more audio transducers 544, which can sometimes take the form of one or more speakers. In some instances, the one or more audio transducers 544 will include speaker 106, which is illustrated in
In at least some instances, the device 100 will also include wireless communication capabilities. Where the device 100 includes wireless communication capabilities, the device will generally include a wireless communication interface 546, which is coupled to an antenna 548. The wireless communication interface 546 can further include one or more of a transmitter 550 and a receiver 552, which can sometimes take the form of a transceiver 554. While at least some of the illustrated embodiments of the present application can incorporate wireless communication capabilities, such capabilities are not essential.
By incorporating wireless communication capabilities, one may be able to distribute at least some of the processing associated with any alteration of the audio in a captured scene, including the offloading of all or parts of the processing to another device, such as a central server that could be part of the wireless communication network infrastructure. Furthermore, the microphone array could incorporate microphones from other nearby devices, which may be communicatively coupled to the device 100 via the wireless communication interface 546. It may still further be possible to offload and/or distribute other aspects of the present application making use of wireless communication capabilities without departing from the teachings of the present application.
By controlling the weighting and the relative delays of the various microphone inputs before combining, one can form a beam pattern that can then be used to enhance and/or diminish the audio received from different directions, the corresponding beam pattern can then be directed appropriately toward different areas of the captured scene, so as to help isolate a particular portion of the audio. The process of combining and beam forming can be performed in either the time or the frequency domains. It is further possible that other alternatives are possible. For example, it may be possible to extract the voice of the talker and/or audio to be isolated out of a scene by using a conventional noise-suppression techniques, that need not rely on beam-forming. Alternatively, blind source separation, independent component analysis, and other techniques for computational auditory scene analysis can separate the components of the audio stream, and allow them to be associated with the objects in the view-finder.
By steering a beam in the determined direction of a particular element and/or area, the audio from that element and/or area can be highlighted and correspondingly isolated. Once isolated, the audio associated with the elements or areas in the corresponding direction can be morphed and/or altered as desired by an audio modification module 608. For example, level adjustments can be made to all or parts of the isolated audio, as well as audio effects could be added, which affect various characteristics of the isolated audio. Examples of audio characteristics that can be adjusted can include adding reverberations, spectral enhancements, pitch shifting and/or time scale changes. It is further possible to remove the isolated audio and replace the same with different audio information. The replacement audio could include synthesized, or other recorded sounds. In some instances, the recorded sounds being used for addition and/or replacement may come from a data base. For example, audio from a database having verbal content could be added in such a way that it is associated with an object, such as a tree 306 or a dog 310, or a virtual character.
In some instances, the replacement audio could be based upon determined characteristics of the audio that was being removed. For example, the verbal content of the isolated audio associated with a person 304 in a captured scene could be identified, converted into another language, and then reinserted into the scene. In another instance, the isolated audio information associated with one of the elements from the captured scene, such as a bird 308, could be altered to more closely correspond to audio information associated with another element from the capture scene, such as a dog 310, or vice versa. In such an instance, some of the characteristics of the original audio, such as audio pitch could be preserved.
In still other instances, the adjustments to the audio information could track and/or correspond to adjustments being made to the visual information within a captured scene. For example, a person 304 in a scene could be made to look more like a ghost, where corresponding changes to the audio information could include the addition of an amount of reverb to the same to sound more ghost-like. It is further possible to alter the isolated audio, so as to make it sound like it came from another point within the captured scene, where the location of the visual representation of the apparent source within the captured scene could also be adjusted. In such an instance, the audio could include adjusted volume level and time delay to account for the change in location, as well as adjusted reverb.
While the preferred embodiments have been illustrated and described, it is to be understood that the application is not so limited. Numerous modifications, changes, variations, substitutions and equivalents will occur to those skilled in the art without departing from the spirit and scope of the present application as defined by the appended claims.