1. Technical Field
Embodiments described herein relate generally to a method, non-transitory computer-readable storage medium, and system for audio-assisted optical focus setting adjustment in an image-capturing device. More particularly, embodiments of the present disclosure relate to a method, non-transitory computer-readable storage medium, and system for adjusting the optical focus setting of the image-capturing device to focus on a speaking person, based on audio from the speaking person.
2. Background
In a conference room or environment with multiple people in attendance, several speakers may be seated at different locations around the conference room. It is often difficult to determine where the speaker is located. Especially in situations in which captured images of the conference room are being viewed remotely, remote viewers may not have the same breadth and depth of experience attained by in-person attendees because remote viewers may be unable to ascertain which speaker is speaking.
A more complete appreciation of the disclosure and many of the attendant advantages thereof will be readily obtained as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings, wherein:
Overview
According to one aspect of the present disclosure, an image-capturing device includes a receiver that receives distance and angular direction information that specifies an audio source position from a microphone array. The image-capturing device also includes a controller that determines whether to change an initial focal plane to a subsequent focal plane within a field of view of an image frame based on a detected change in the audio source position. The image-capturing device further includes a focus adjuster that adjusts an optical focus setting to change from the initial focal plane to the subsequent focal plane within the field of view to focus on at least one object-of-interest located at the audio source position, based on a position determination by the controller.
While this invention is susceptible of embodiment in many different forms, there is shown in the drawings and will herein be described in detail specific examples of the principles and not intended to limit the invention to the specific examples shown and described. In the description below, like reference numerals are used to describe the same, similar or corresponding parts in the several views of the drawings.
The terms “a” or “an”, as used herein, are defined as one or more than one. The term “plurality”, as used herein, is defined as two or more than two. The term “another”, as used herein, is defined as at least a second or more. The terms “including” and/or “having”, as used herein, are defined as comprising (i.e., open language). The term “program” or “computer program” or similar terms, as used herein, is defined as a sequence of instructions designed for execution on circuitry of a computer system, whether in a single chassis or distributed amongst several devices. A “program”, or “computer program”, may include a subroutine, a program module, a script, a function, a procedure, an object method, an object implementation, in an executable application, an applet, a servlet, a source code, an object code, a shared library/dynamic load library and/or other sequence of instructions designed for execution on a computer system.
Reference throughout this document to “one embodiment”, “certain embodiments”, “an embodiment”, “an implementation”, “an example” or similar terms means that a particular feature, structure, or characteristic described in connection with the example is included in at least one example of the present disclosure. Thus, the appearances of such phrases or in various places throughout this specification are not necessarily all referring to the same example. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more examples without limitation.
The term “or” as used herein is to be interpreted as an inclusive or meaning any one or any combination. Therefore, “A, B or C” means “any of the following: A; B; C; A and B; A and C; B and C; A, B and C”. An exception to this definition will occur only when a combination of elements, functions, steps or acts are in some way inherently mutually exclusive.
Due to camera limitations, all participants at one endpoint may be visible within an image frame, but they may not be able to fit within a region-of-interest specified by a current optical focus setting of an image capturing device. For example, one participant may be located in a first focal plane of the camera, but another participant might be located in a different image plane. To overcome this limitation, audio data sourced by a relevant target, e.g., a current speaker, is obtained and used to change the optical focus setting of the image capturing device to a new optical focus setting that focuses on the relevant target. Thus, a viewer at another endpoint would see a focused image of the person speaking at the first endpoint, and then later a focused image of a second person at the first endpoint when that second person is the primary speaker.
The above-described mappings are stored in storage 106 in the image-capturing device 100. These mappings specify a correspondence between the location, which is specified with respect to a room layout, and at a minimum, an indication of whether a face was previously detected at the location. The mappings are not limited to only specifying a correspondence with the indication; for example, an image of the detected face is storable in addition to or in place of the indication.
In one non-limiting example, the controller 104 determines that the pan-tilt-zoom setting must be changed and controls a pan-tilt-zoom controller 110 in the image-capturing device 100 to adjust this setting. The pan-tilt-zoom controller 110 changes the pan-tilt-zoom setting so as to include the audio source, e.g., the person, which is the source of the audio picked up by the microphone array, in a field of view (or image frame) of the image-capturing device. The controller 104 also determines that the optical focus setting must be changed and controls a focus adjuster 108 in the image-capturing device 100 to adjust this setting. The focus adjuster 108 adjusts the optical focus setting in order to focus on the audio source, e.g., the person, which is the source of the audio picked up by the microphone array.
It should be noted that an image-capturing device implementing the speaker-assisted focusing method is not limited to the configuration shown in
The image-capturing device 100 is implementable by one or more of the following including, but not limited to: a video camera, a cell phone, a digital still camera, a desktop computer, a laptop, and a touch screen device. The receiver 102, the controller 104, the focus adjuster 108, and the pan-tilt-zoom controller 110 are controlled or implementable by one or more of the following including, but not limited to: circuitry, a computer, and a programmable processor. Other examples of hardware and hardware/software combinations upon which these elements are implemented and by which these elements are controlled are described below. The storage 106 is implementable by, for example, a Random Access Memory (RAM). Other examples of storage are described below.
The video camera 202 uses this information to change its optical focus setting by a focus adjuster based on, for example, adjusting an optical focus distance. Objects in a focal plane corresponding to an adjusted optical focus distance are “in focus” or “focused on.” These objects are objects-of-interest. The field of view 208 includes everything visible to the video camera 202 (i.e., everything “seen” by the one or more video camera 202). In
In the exemplary configuration shown in
In
Instead of only one object-of-interest,
In
In
In
Face Detection
In one non-limiting example, additional determinations are made prior to changing the field of view or the region-of-interest to include the object-of-interest. In some instances, the speaker's voice may reflect off of surfaces in the room in which the video camera and microphone array are situated. To confirm that the picked up audio corresponds to a speaker and not a reflection of the voice, a face detection process is performed. In addition to the field of view and region-of-interest and object-of-interest determinations made above, a determination is made as to whether a face is detected at the location indicated by the microphone array. Detecting a face at the location confirms the existence of a speaker, instead of an audio reflection, and increases the accuracy of the speaker-assisted focusing system and method. As described above, facial detection is an exemplary detection methodology that is supplementable or replaceable with a detection process that detects a desired audio source, e.g., a person, using, for example, silhouettes, partial faces, upper bodies, and gaits.
Storing Speaker Location and Face Detection Mappings
In another non-limiting example, the video camera, or other external storage, is enabled to store a predetermined number of mappings between locations in the room layout, obtained based on information from the microphone array, i.e., speaker positions, and indications of detected faces. For example, when a speaker begins speaking and turns their head such that their face is not detectable, the video camera uses the mappings to “remember” that the microphone array previously indicated the location as a speaker position and a face was previously detected at that location. Irrespective of the fact that a face cannot currently be detected, a speaker is determined to be likely to be at that location, instead of, for example, an audio reflection.
Facial and Speech Recognition
In another non-limiting example, subsequent to or in place of performing facial detection, the video camera or external device performs facial recognition. Captured or detected faces are compared with pre-stored facial images stored in a database accessible by the video camera. In still another non-limiting example, the picked up audio is used to perform speech recognition using pre-stored speech sequences stored in the database accessible by the video camera. These exemplary and additional levels of processing provide enhanced accuracy to the speaker-assisted focusing method. In yet another non-limiting example, identity information corresponding to the recognized face is displayed on the display screen, either along with or in place of the object-of-interest. For example, a corporate or government-issued identification photograph could be displayed on the display screen.
Profile Information
In one non-limiting example, the portion of the database searched by the video camera to find a matching face or speech sequence is constrained by conference attendees that are registered for a predetermined combination of date, time, and room location. Constraining the database reduces the processing resources required to recognize faces or speech.
Gesture Detection
In one non-limiting embodiment, the region-of-interest is set so as to include a speaker that is currently speaking and is subsequently changed based on detecting gestures of the speaker. As a non-limiting example, the initial region-of-interest may focus on the speaker's face, and the subsequent region-of-interest may focus on a whiteboard upon which the speaker is writing; changing the region-of-interest to include the text written on the whiteboard could be triggered by any of the following, but not limited to: an arm motion, a hand motion, a mark made by a marker, and movement of an identifying tag (e.g., a radio frequency identifier tag) attached to the marker. As another non-limiting example, the speaker may be a lecturer using a laser pointer to designated certain areas on an overhead projector; changing the region-of-interest to include the area designated by the laser pointer could be triggered by any of the following, but not limited to: detection of a frequency associated with the laser pointer and detection of a color associated with the laser pointer.
Blurring Filter
In one non-limiting embodiment, one or more objects excluding the objects-of-interest, are shown as being out of focus or “blurred” using, for example, a blurring filter. For example, two speakers that are engaged in a conversation may be shown in focus, while remaining attendees are blurred to prevent distraction. In another non-limiting embodiment, the portion of the object-of-interest that corresponds to, for example, the user's body below the head, which is not in the region-of-interest, is not blurred.
Application Environments
While the above-described examples have been set forth with respect to focusing on speakers in an indoor room, tracking other objects-of-interest, for example, vehicles, sports players, and animals, each of which produce audio, is envisioned. Further, the present invention is not limited to being implemented indoors; the strength and accuracy of the microphone array, and optionally, attendant sensors, lend the present invention to be implementable in a variety of applications, including outdoor applications.
In a non-limiting example, the users 206a, 206b, 206c, 206d, 206e, 206f, 206g, 206h, 206i, 206j, 206k, and 206l are conference speakers or attendees that take turns speaking. In another non-limiting example, the users 206a, 206b, 206c, 206d, 206e, 206f, 206g, 206h, 206i, 206j, 206k, and 206l are distance learning students participating and asking questions to a remotely located professor. In yet another non-limiting example, the users 206a, 206b, 206c, 206d, 206e, 206f, 206g, 206h, 206i, 206j, 206k, and 206l are talk show guests that ask questions to interviewees. In still another non-limiting example, the users 206a, 206b, 206c, 206d, 206e, 206f, 206g, 206h, 206i, 206j, 206k, and 206l are actors in a television show, e.g., a reality show.
Adjusting Frame Margins
In a non-limiting embodiment, image frame margins are dynamically adjusted based on a speaker position so as to frame the speaker, within the image frame, in a specified manner. The frame margins are adjusted to communicate the speaker's location within a room and to whom the speaker is speaking by shifting the speaker left or right in the image frame by a specified amount, which depends on a distance between the speaker and a predefined central axis.
In another non-limiting embodiment, the image frame margins are dynamically adjusted based on the direction that the speaker faces. The orientation of the speaker's head affects the horizontal framing of the speaker in the image frame; if a speaker looks away from the predefined central axis, then speaker is centered in the image frame and the frame margins are adjusted to include more space in front of the speaker's face.
In one non-limiting embodiment, the frame margins are automatically adjusted according to cinematic composition rules; this advantageously reduces the cognitive load on the viewers, more closely conforms to viewers' expectations on television and film productions, and improves the overall quality of experience. In a non-limiting example, composition rules may capture context associated with a whiteboard when a speaker addresses a video camera, while still tracking the speaker.
As illustrated in
According to one example, the CPU 1002 loads a program stored in the recording portion 1016 into the RAM 1006 via the input-output interface 1010 and the bus 1008, and then executes a program configured to provide the functionality of the one or combination of the functions of the video camera 202 and the microphone array 204, such as the determination processing.
Those skilled in the art will recognize, upon consideration of the above teachings, that certain of the above examples, for example using the video camera 202 and the microphone array 204, are based upon use of a programmed processor. However, examples of the present disclosure are not limited to such examples, since other examples could be implemented using hardware component equivalents such as special purpose hardware and/or dedicated processors. Similarly, general purpose computers, microprocessor based computers, micro-controllers, optical computers, analog computers, dedicated processors, application specific circuits and/or dedicated hard wired logic may be used to construct alternative equivalent examples.
Those skilled in the art will appreciate, upon consideration of the above teachings, that the operations and processes, such as those by the video camera 202 and the microphone array 204, and associated data used to implement certain of the examples described above can be implemented using disc storage as well as other forms of storage such as non-transitory storage devices including as for example Read Only Memory (ROM) devices, Random Access Memory (RAM) devices, network memory devices, optical storage elements, magnetic storage elements, magneto-optical storage elements, flash memory, core memory and/or other equivalent volatile and non-volatile storage technologies without departing from certain examples of the present disclosure. The term non-transitory does not suggest that information cannot be lost by virtue of removal of power or other actions. Such alternative storage devices should be considered equivalents.
Certain examples described herein, are or may be implemented using one or more programmed processors executing programming instructions that are broadly described above in flow chart form that can be stored on any suitable electronic or computer readable storage medium. However, those skilled in the art will appreciate, upon consideration of the present disclosure, that the processes described above can be implemented in any number of variations and in many suitable programming languages without departing from examples of the present disclosure. For example, the order of certain operations carried out can often be varied, additional operations can be added or operations can be deleted without departing from certain examples of the disclosure. Such variations are contemplated and considered equivalent.
While certain illustrative examples have been described, it is evident that many alternatives, modifications, permutations and variations will become apparent to those skilled in the art in light of the foregoing description.