Adding a speech balloon (speech bubble, dialog balloon, word balloon, thought balloon, etc.) to an image of object (e.g., person, place, or thing) is a popular pastime. There are web applications that enable users to upload images (e.g., photographs) and manually add speech balloons to them. In one photo tagging application, users can add quotes through speech balloons to photographs within an existing photo album. Certain devices (e.g., cameras, mobile telephones, etc.) use cameras and microphones to record images and/or video clips. However, other than using the web applications described above, these devices are unable to create speech balloons for the images and/or video clips captured by the devices.
According to one aspect, a method may include capturing, by a device, an image of an object; recording, in a memory of the device, audio associated with the object; determining, by a processor of the device and when the object is a person, a location of the person's head in the captured image; translating, by the processor, the audio into text; creating, by the processor, a speech balloon that includes the text; and positioning, by the processor, the speech balloon adjacent to the location of the person's head in the captured image to create a final image.
Additionally, the method may further include displaying the final image on a display of the device, and storing the final image in the memory of the device.
Additionally, the method may further include recording, when the object is an animal, audio provided by a user of the device, determining a location of the animal's head in the captured image, translating the audio provided by the user into text, creating a speech balloon that includes the text translated from the audio provided by the user, and positioning the speech balloon, that includes the text translated from the audio provided by the user, adjacent to the location of the animal's head in the captured image to create an image.
Additionally, the method may further include recording, when the object is an inanimate object, audio provided by a user of the device, translating the audio provided by the user into user-provided text, and associating the user-provided text with the captured image to create a user-defined image.
Additionally, the method may further include analyzing, when the object includes multiple persons, video of the multiple persons to determine mouth movements of each person; comparing the audio to the mouth movements of each person to determine portions of the audio that are associated with each person; translating the audio portions, associated with each person, into text portions; creating, for each person, a speech balloon that includes a text portion associated with each person; determining a location of each person's head based on the captured image; and positioning each speech balloon with a corresponding location of each person's head to create a final multiple person image.
Additionally, the method may further include analyzing the audio to determine portions of the audio that are associated with each person.
Additionally the audio may be provided in a first language and translating the audio into text may further include translating the audio into text provided in a second language that is different than the first language.
Additionally, the method may further include capturing a plurality of images of the object; creating a plurality of speech balloons, where each of plurality of speech balloons includes a portion of the text; and associating each of the plurality of speech balloons with a corresponding one of the plurality of images to create a time-ordered image.
Additionally, the method may further include recording audio provided by a user of the device; translating the audio provided by the user into user-provided text; creating a thought balloon that includes the user-provided text; and positioning the thought balloon adjacent to the location of the person's head in the captured image to create a thought balloon image.
Additionally, the device may include at least one of a radiotelephone, a personal communications system (PCS) terminal, a camera, a video camera with camera capabilities, binoculars, or video glasses.
According to another aspect, a device may include a memory to store a plurality of instructions, and a processor to execute instructions in the memory to capture an image of an object, record audio associated with the object, determine, when the object is a person, a location of the person's head in the captured image, translate the audio into text, create a speech balloon that includes the text, position the speech balloon adjacent to the location of the person's head in the captured image to create a final image, and display the final image on a display of the device.
Additionally, the processor may further execute instructions in the memory to store the final image in the memory.
Additionally, the processor may further execute instructions in the memory to record, when the object is an animal, audio provided by a user of the device, determine a location of the animal's head in the captured image, translate the audio provided by the user into text, create a speech balloon that includes the text translated from the audio provided by the user, and position the speech balloon, that includes the text translated from the audio provided by the user, adjacent to the location of the animal's head in the captured image to create an image.
Additionally, the processor may further execute instructions in the memory to record, when the object is an inanimate object, audio provided by a user of the device, translate the audio provided by the user into user-provided text, and associate the user-provided text with the captured image to create a user-defined image.
Additionally, the processor may further execute instructions in the memory to analyze, when the object includes multiple persons, video of the multiple persons to determine mouth movements of each person, compare the audio to the mouth movements of each person to determine portions of the audio that are associated with each person, translate the audio portions, associated with each person, into text portions, create, for each person, a speech balloon that includes a text portion associated with each person, determine a location of each person's head based on the captured image, and position each speech balloon with a corresponding location of each person's head to create a final multiple person image.
Additionally, the processor may further execute instructions in the memory to analyze the audio to determine portions of the audio that are associated with each person.
Additionally, the audio may be provided in a first language and, when translating the audio into text, the processor may further execute instructions in the memory to translate the audio into text provided in a second language that is different than the first language.
Additionally, the processor may further execute instructions in the memory to capture a plurality of images of the object, create a plurality of speech balloons, where each of plurality of speech balloons includes a portion of the text, and associate each of the plurality of speech balloons with a corresponding one of the plurality of images to create a time-ordered image.
Additionally, the processor may further execute instructions in the memory to record audio provided by a user of the device, translate the audio provided by the user into user-provided text, create a thought balloon that includes the user-provided text, and position the thought balloon adjacent to the location of the person's head in the captured image to create a thought balloon image.
According to yet another aspect, a device may include means for capturing an image of an object; means for recording audio associated with the object; means for determining, when the object is a person, a location of the person's head in the captured image; means for translating the audio into text; means for creating a speech balloon that includes the text; means for positioning the speech balloon adjacent to the location of the person's head in the captured image to create a final image; means for displaying the final image; and means storing the final image.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate one or more implementations described herein and, together with the description, explain these implementations. In the drawings:
The following detailed description refers to the accompanying drawings. The same reference numbers in different drawings may identify the same or similar elements. Also, the following detailed description does not limit the invention.
Systems and/or methods described herein may provide a device that performs voice-controlled image editing. For example, an exemplary arrangement as shown in
Device 110 may capture an image 140 of subjects 120/130 and may record audio associated with subjects 120/130 when image 140 is captured by device 110. Device 110 may capture and analyze video of subjects 120/130 to determine mouth movements of first subject 120 and mouth movements of second subject 130, and may compare the recorded audio to the mouth movements to determine portions of the audio that are associated with first subject 120 and second subject 130. Device 110 may translate the audio portions into text portions associated with each of subjects 120/130, may create a first speech balloon 150 that includes text associated with first subject 120, and may create a second speech balloon 160 that includes text associated with second subject 130. Device 110 may determine locations of the heads of subjects 120/130, may position first speech balloon 150 with the location of first subject's 120 head, and may position second speech balloon 160 with the location of second subject's 130 head to create a final version of image 140. Device 110 may also display and/or store the final version of image 140.
The description to follow will describe a device. As used herein, a “device” may include a radiotelephone; a personal communications system (PCS) terminal that may combine a cellular radiotelephone with data processing, facsimile, and data communications capabilities; a personal digital assistant (PDA) that can include a radiotelephone, pager, Internet/intranet access, web browser, organizer, calendar, a Doppler receiver, and/or global positioning system (GPS) receiver; a laptop; a GPS device; a personal computer; a camera (e.g., contemporary camera or digital camera); a video camera (e.g., a camcorder with camera capabilities); binoculars; a telescope; and/or any other device capable of utilizing a camera.
As used herein, a “camera” may include a device that may capture and store images and/or video. For example, a digital camera may be an electronic device that may capture and store images and/or video electronically instead of using photographic film as in contemporary cameras. A digital camera may be multifunctional, with some devices capable of recording sound and/or video, as well as images.
Lens 220 may include a mechanically, electrically, and/or electromechanically controlled assembly of lens(es) whose focal length may be changed, as opposed to a prime lens, which may have a fixed focal length. Lens 220 may include “zoom lenses” that may be described by the ratio of their longest and shortest focal lengths. Lens 220 may work in conjunction with an autofocus system (not shown) that may enable lens 220 to obtain the correct focus on a subject, instead of requiring a user of device 200 to manually adjust the focus. The autofocus system may rely on one or more autofocus sensors (not shown) to determine the correct focus. The autofocus system may permit manual selection of the sensor(s), and may offer automatic selection of the autofocus sensor(s) using algorithms which attempt to discern the location of the subject. The data collected from the autofocus sensors may be used to control an electromechanical system that may adjust the focus of the optical system.
Flash unit 230 may include any type of flash units used in cameras. For example, in one implementation, flash unit 230 may include a light-emitting diode (LED)-based flash unit (e.g., a flash unit with one or more LEDs). In other implementations, flash unit 230 may include a flash unit built into device 200; a flash unit separate from device 200; an electronic xenon flash lamp (e.g., a tube filled with xenon gas, where electricity of high voltage is discharged to generate an electrical arc that emits a short flash of light); a microflash (e.g., a special, high-voltage flash unit designed to discharge a flash of light with a sub-microsecond duration); etc.
Viewfinder 240 may include a window that a user of device 200 may look through to view and/or focus on a subject. For example, viewfinder 240 may include an optical viewfinder (e.g., a reversed telescope); an electronic viewfinder (e.g., a cathode ray tube (CRT), liquid crystal display (LCD), or an organic light-emitting diode (OLED) based display that may be used as a viewfinder and/or to replay previously captured material); or a combination of the aforementioned.
Button 250 may include a mechanical or electromechanical button that may be used to capture an image of the subject by device 200. If the user of device 200 engages button 250, device 200 may engage lens 220 (and the autofocus system) and flash unit 230 in order to capture an image of the subject with device 200.
Although
Display 330 may provide visual information to the user. For example, display 330 may provide information regarding incoming or outgoing calls, media, games, phone books, the current time, etc. In another example, display 330 may provide an electronic viewfinder, e.g., a cathode ray tube (CRT), liquid crystal display (LCD), or an organic light-emitting diode (OLED) based display that a user of device 300 may look through to view and/or focus on a subject and/or to replay previously captured material.
Control buttons 340 may permit the user to interact with device 300 to cause device 300 to perform one or more operations. For example, control buttons 340 may be used to capture an image of the subject by device 300 in a similar manner as button 250 of device 200. Keypad 350 may include a standard telephone keypad. Microphone 360 may receive audible information from the user and/or a subject to be captured by device 300.
As shown in
Although
Processing unit 410 may include one or more processors, microprocessors, application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), or the like. Processing unit 410 may control operation of device 200/300 and its components.
Memory 420 may include a random access memory (RAM), a read only memory (ROM), and/or another type of memory to store data and instructions that may be used by processing unit 410.
User interface 430 may include mechanisms for inputting information to device 200/300 and/or for outputting information from device 200/300. Examples of input and output mechanisms might include a speaker (e.g., speaker 320) to receive electrical signals and output audio signals; a camera lens (e.g., lens 220 or camera lens 370) to receive image and/or video signals and output electrical signals; a microphone (e.g., microphones 360 or 390) to receive audio signals and output electrical signals; buttons (e.g., a joystick, button 250, control buttons 340, or keys of keypad 350) to permit data and control commands to be input into device 200/300; a display (e.g., display 330) to output visual information (e.g., image and/or video information received from camera lens 370); and/or a vibrator to cause device 200/300 to vibrate.
Communication interface 440 may include, for example, a transmitter that may convert baseband signals from processing unit 410 to radio frequency (RF) signals and/or a receiver that may convert RF signals to baseband signals. Alternatively, communication interface 440 may include a transceiver to perform functions of both a transmitter and a receiver. Communication interface 440 may connect to antenna assembly 450 for transmission and/or reception of the RF signals.
Antenna assembly 450 may include one or more antennas to transmit and/or receive RF signals over the air. Antenna assembly 450 may, for example, receive RF signals from communication interface 440 and transmit them over the air and receive RF signals over the air and provide them to communication interface 440. In one implementation, for example, communication interface 440 may communicate with a network (e.g., a local area network (LAN), a wide area network (WAN), a telephone network, such as the Public Switched Telephone Network (PSTN), an intranet, the Internet, or a combination of networks).
As described herein, device 200/300 may perform certain operations in response to processing unit 410 executing software instructions contained in a computer-readable medium, such as memory 420. A computer-readable medium may be defined as a physical or logical memory device. A logical memory device may include memory space within a single physical memory device or spread across multiple physical memory devices. The software instructions may be read into memory 420 from another computer-readable medium or from another device via communication interface 440. The software instructions contained in memory 420 may cause processing unit 410 to perform processes that will be described later. Alternatively, hardwired circuitry may be used in place of or in combination with software instructions to implement processes described herein. Thus, implementations described herein are not limited to any specific combination of hardware circuitry and software.
Although
Device 200/300 may translate recorded audio 510 (e.g., the audio clip) into text using speech recognition software. In one implementation, speech recognition may be performed on recorded audio 510 with speech recognition software provided in device 200/300 (e.g., via processing unit 410 and memory 420 of device 200/300). In another implementation, speech recognition may be performed on recorded audio 510 with speech recognition software provided on a device communicating with device 200/300 (e.g., via communication interface 440). The speech recognition software may include any software that converts spoken words to machine-readable input (e.g., text). Examples of speech recognition software may include “Voice on the Go,” “Vorero” provided by Asahi Kasei, “WebSphere Voice Server” provided by IBM, “Microsoft Speech Server,” etc.
Device 200/300 may use face detection software to determine a location of first subject's 120 head in captured image 520. In one implementation, face detection may be performed on captured image 520 with face detection software provided in device 200/300 (e.g., via processing unit 410 and memory 420 of device 200/300). In another implementation, face detection may be performed on captured image 520 with face detection software provided on a device communicating with device 200/300 (e.g., via communication interface 440). The face detection software may include any face detection technology that determines locations and sizes of faces in images, detects facial features, and ignores anything else (e.g., buildings, trees, bodies, etc.).
Device 200/300 may create a speech balloon 530 that includes the translated text of recorded audio 510. Based on the determined location of first subject's 120 head in captured image 520, device 200/300 may position speech balloon 530 adjacent to first subject's 120 head in captured image 520. In one implementation, the user of device 200/300 may manually re-position speech balloon 530 in relation to captured image 520, and/or may manually edit text provided in speech balloon 530. Device 200/300 may combine the positioned speech balloon 530 and captured image 520 of first subject 120 to form a final image 540. Device 200/300 may display image 540 (e.g., via display 330) and/or may store image 540 (e.g., in memory 420).
Although
Audio to text translator 600 may include any hardware or combination of hardware and software that may receive recorded audio 510 (e.g., from first subject 120), and may translate recorded audio 510 (e.g., the audio clip) into text 630 (e.g., of recorded audio 510) using speech recognition software. In one implementation, speech recognition may be performed on recorded audio 510 with speech recognition software provided in device 200/300 (e.g., via audio to text translator 600). In another implementation, speech recognition may be performed on recorded audio 510 with speech recognition software provided on a device communicating with device 200/300 (e.g., via communication interface 440). Audio to text translator 600 may provide text 630 to image/speech balloon generator 620.
Image analyzer 610 may include any hardware or combination of hardware and software that may receive captured image 520 (e.g., of first subject 120), and may use face detection software to determine a location 640 of first subject's 120 head in captured image 520. In one implementation, face detection may be performed on captured image 520 with face detection software provided in device 200/300 (e.g., via image analyzer 610). In another implementation, face detection may be performed on captured image 520 with face detection software provided on a device communicating with device 200/300 (e.g., via communication interface 440). Image analyzer 610 may provide location 640 of first subject's 120 head in captured image 520 to image/speech balloon generator 620.
Image/speech balloon generator 620 may include any hardware or combination of hardware and software that may receive text 630 from audio to text translator 600, may receive location 640 from image analyzer 610, and may create speech balloon 530 that includes text 630. Based on location 640, image/speech balloon generator 620 may position speech balloon 530 adjacent to first subject's 120 head in captured image 520. Image/speech balloon generator 620 may combine the positioned speech balloon 530 and captured image 520 of first subject 120 to generate final image 540.
Although
If more than a single person (e.g., subjects 120/130) is present in image 720 captured by device 200/300 and subjects 120/130 are both speaking, device 200/300 may need to identify which portions of recorded audio 710 are attributable to each of subjects 120/130. In order to achieve this, in one implementation, device 200/300 may analyze video (or multiple captured images) of subjects 120/130 to determine mouth movements of subjects 120/130, and may compare recorded audio 710 to the mouth movements to determine which portions of recorded audio 710 are attributable to each of subjects 120/130. In another implementation, device 200/300 may analyze recorded audio 710 to determine differences in voices of subjects 120/130, and may use this information to determine which portions of recorded audio 710 are attributable to each of subjects 120/130. In still another implementation, device 200/300 may include one or more directional microphones that may be used to determine which portions of recorded audio 710 are attributable to each of subjects 120/130. In still a further implementation, device 200/300 may utilize a combination of aforementioned techniques to determine which portions of recorded audio 710 are attributable to each of subjects 120/130.
Device 200/300 may translate recorded audio 710 (e.g., the audio clip) into text using speech recognition software. In one implementation, speech recognition may be performed on recorded audio 710 with speech recognition software provided in device 200/300 (e.g., via processing unit 410 and memory 420 of device 200/300). In another implementation, speech recognition may be performed on recorded audio 710 with speech recognition software provided on a device communicating with device 200/300 (e.g., via communication interface 440). Device 200/300 may create a speech balloon 730 that includes the translated text of the portion of recorded audio 710 that is attributable to first subject 120, and may create a speech balloon 740 that includes the translated text of the portion of recorded audio 710 that is attributable to second subject 130.
Device 200/300 may use face detection software to determine a location of each subject's 120/130 head in captured image 720. In one implementation, face detection may be performed on captured image 720 with face detection software provided in device 200/300 (e.g., via processing unit 410 and memory 420 of device 200/300). In another implementation, face detection may be performed on captured image 720 with face detection software provided on a device communicating with device 200/300 (e.g., via communication interface 440).
Based on the determined location of first subject's 120 head in captured image 720, device 200/300 may position speech balloon 730 adjacent to first subject's 120 head in captured image 720. Based on the determined location of second subject's 130 head in captured image 720, device 200/300 may position speech balloon 740 adjacent to second subject's 130 head in captured image 720. Device 200/300 may arrange speech balloons 730/740 according to a time order that the text provided in speech balloons 730/740 is spoken by subjects 120/130. For example, if first subject 120 spoke the text “How's it going today?” (e.g., provided in speech balloon 730) before second subject 130 spoke the text “Good. How are you?” (e.g., provided in speech balloon 740), then device 200/300 may arrange speech balloon 730 to the left (or on top) of speech balloon 740 in order to show the correct time order.
In one implementation, the user of device 200/300 may manually re-position speech balloons 730/740 in relation to captured image 720, and/or may manually edit text provided in speech balloons 730/740. Device 200/300 may combine the positioned speech balloons 730/740 and captured image 720 of subjects 120/130 to form a final image 750. Device 200/300 may display image 750 (e.g., via display 330) and/or may store image 750 (e.g., in memory 420).
Although
Audio to text translator 600 may receive recorded audio 710 (e.g., from subjects 120/130), and may translate recorded audio 710 (e.g., the audio clip) into text 800 (e.g., of recorded audio 710) associated with first subject 120 and text 810 (e.g., of recorded audio 710) associated with second subject 130. Audio to text translator 600 may provide text 800 and text 810 to image/speech balloon generator 620.
Image analyzer 610 may receive recorded audio 710 and video 820 of subjects 120/130, may analyze video 820 to determine mouth movements of subjects 120/130, and may compare recorded audio 710 to the mouth movements to determine which portions of recorded audio 710 are attributable to each of subjects 120/130. Image analyzer 610 may analyze recorded audio 710 to determine differences in voices of subjects 120/130, and may use this information to determine which portions of recorded audio 710 are attributable to each of subjects 120/130. Image analyzer 610 may use face detection software to determine locations of subjects' 120/130 heads in captured image 720, and may combine the head location information with the determined portions of recorded audio 710 attributable to each of subjects 120/130 to produce audio/first subject match information 830 and audio/second subject match information 840. Image analyzer 610 may provide information 830 and 840 to image/speech balloon generator 620.
Image/speech balloon generator 620 may receive text 800/810 from audio to text translator 600, and may receive information 830/840 from image analyzer 610. Image/speech balloon generator 620 may position speech balloon 730 adjacent to first subject's 120 head in captured image 720, based on the determined location of first subject's 120 head in captured image 720. Image/speech balloon generator 620 may position speech balloon 740 adjacent to second subject's 130 head in captured image 720, based on the determined location of second subject's 130 head in captured image 720. Image/speech balloon generator 620 may combine the positioned speech balloons 730/740 and captured image 720 of subjects 120/130 to form final image 750.
Although
Device 200/300 may translate recorded audio 930 (e.g., the audio clip) into text using speech recognition software. In one implementation, speech recognition may be performed on recorded audio 930 with speech recognition software provided in device 200/300 (e.g., via processing unit 410 and memory 420 of device 200/300). In another implementation, speech recognition may be performed on recorded audio 930 with speech recognition software provided on a device communicating with device 200/300 (e.g., via communication interface 440).
Device 200/300 may use face detection software to determine a location of animal's 910 head in captured image 940. In one implementation, face detection may be performed on captured image 940 with face detection software provided in device 200/300 (e.g., via processing unit 410 and memory 420 of device 200/300). In another implementation, face detection may be performed on captured image 940 with face detection software provided on a device communicating with device 200/300 (e.g., via communication interface 440).
Device 200/300 may create a speech balloon 950 that includes the translated text of recorded audio 930. Based on the determined location of animal's 910 head in captured image 940, device 200/300 may position speech balloon 950 adjacent to animal's 910 head in captured image 940. In one implementation, user 920 may manually re-position speech balloon 950 in relation to captured image 940, and/or may manually edit text provided in speech balloon 950. Device 200/300 may combine the positioned speech balloon 950 and captured image 940 of animal 910 to form a final image 960. Device 200/300 may display image 960 (e.g., via display 330) and/or may store image 960 (e.g., in memory 420).
Although
Device 200/300 may translate recorded audio 1030 (e.g., the audio clip) into text using speech recognition software. In one implementation, speech recognition may be performed on recorded audio 1030 with speech recognition software provided in device 200/300 (e.g., via processing unit 410 and memory 420 of device 200/300). In another implementation, speech recognition may be performed on recorded audio 1030 with speech recognition software provided on a device communicating with device 200/300 (e.g., via communication interface 440). Device 200/300 may use face detection software to determine a location of a head in captured image 1040. However, since object 1010 does not have head, device 200/300 may not detect a head in captured image 1040.
If no head is detected in captured image 1040, device 200/300 may create a title 1050 (e.g., for captured image 1040) that includes the translated text of recorded audio 1030. Device 200/300 may position title 1050 adjacent to object 1010 in captured image 1040 (e.g., as a title). In one implementation, user 1020 may manually re-position title 1050 in relation to captured image 1040, and/or may manually edit text provided in title 1050. Device 200/300 may combine the positioned title 1050 and captured image 1040 of object 1010 to form a final image 1060. Device 200/300 may display image 1060 (e.g., via display 330) and/or may store image 1060 (e.g., in memory 420).
Although
Device 200/300 may attempt to identify which portions of recorded audio 1110 are attributable to each of subjects 120/130. In one implementation, device 200/300 may analyze video (or multiple captured images) of subjects 120/130 to determine mouth movements of subjects 120/130 and may compare recorded audio 1110 to the mouth movements to determine which portions of recorded audio 1110 are attributable to each of subjects 120/130. In another implementation, device 200/300 may analyze recorded audio 1110 to determine differences in voices of subjects 120/130, and may use this information to determine which portions of recorded audio 1110 are attributable to each of subjects 120/130. In still another implementation, device 200/300 may utilize a combination of aforementioned techniques to determine which portions of recorded audio 1110 are attributable to each of subjects 120/130.
Device 200/300 may translate recorded audio 1110 (e.g., the audio clip) into text using speech recognition software. In one implementation, speech recognition may be performed on recorded audio 1110 with speech recognition software provided in device 200/300 (e.g., via processing unit 410 and memory 420 of device 200/300). In another implementation, speech recognition may be performed on recorded audio 1110 with speech recognition software provided on a device communicating with device 200/300 (e.g., via communication interface 440). If device 200/300 is unable to identify which portions of recorded audio 1110 are attributable to each of subjects 120/130, device 200/300 may create a subtitle 1130 that includes the translated text of recorded audio 1110. Subtitle 1130 may also be provided even if device 200/300 is able to identify which portions of recorded audio 1110 are attributable to each of subjects 120/130. Subtitle 1130 may display the translated text of recorded audio 1110 without the need to identify which portions of recorded audio 1110 are attributable to each of subjects 120/130. Subtitle 1130 may provide real-time translation of audio 1110 and may be used with video glasses (e.g., described below in connection with
If device 200/300 is unable to identify which portions of recorded audio 1110 are attributable to each of subjects 120/130, device 200/300 may position subtitle 1130 adjacent to (e.g., below) subjects 120/130 in captured image 1120. In one implementation, the user of device 200/300 may manually re-position subtitle 1130 in relation to captured image 1120, and/or may manually edit text provided in subtitle 1130. Device 200/300 may combine the positioned subtitle 1130 and captured image 1120 of subjects 120/130 to form a final image 1140. Device 200/300 may display image 1140 (e.g., via display 330) and/or may store image 1140 (e.g., in memory 420).
Although
Device 200/300 may translate recorded audio 1240 (e.g., the audio clip) into text using speech recognition software. In one implementation, speech recognition may be performed on recorded audio 1240 with speech recognition software provided in device 200/300 (e.g., via processing unit 410 and memory 420 of device 200/300). In another implementation, speech recognition may be performed on recorded audio 1240 with speech recognition software provided on a device communicating with device 200/300 (e.g., via communication interface 440).
Device 200/300 may use face detection software to determine a location of subject's 1210 head in captured image 1250. In one implementation, face detection may be performed on captured image 1250 with face detection software provided in device 200/300 (e.g., via processing unit 410 and memory 420 of device 200/300). In another implementation, face detection may be performed on captured image 520 with face detection software provided on a device communicating with device 200/300 (e.g., via communication interface 440).
Device 200/300 may create a thought balloon 1260 (e.g., based on voice command 1230) that includes the translated text of recorded audio 1240. Based on the determined location of subject's 1210 head in captured image 1250, device 200/300 may position thought balloon 1260 adjacent to subject's 1210 head in captured image 1250. In one implementation, user 1220 may manually re-position thought balloon 1260 in relation to captured image 1250, and/or may manually edit text provided in thought balloon 1260. Device 200/300 may combine the positioned thought balloon 1260 and captured image 1250 of subject 1210 to form a final image 1270. Device 200/300 may display image 1270 (e.g., via display 330) and/or may store image 1270 (e.g., in memory 420).
Although
Device 200/300 may translate recorded audio 1310 (e.g., the audio clip) into text, in a second language (e.g., English), using speech recognition software. In one implementation, speech recognition and language translation may be performed on recorded audio 1310 with speech recognition software provided in device 200/300 (e.g., via processing unit 410 and memory 420 of device 200/300). In another implementation, speech recognition and language translation may be performed on recorded audio 1310 with speech recognition software provided on a device communicating with device 200/300 (e.g., via communication interface 440).
Device 200/300 may use face detection software to determine a location of first subject's 120 head in captured image 1320. In one implementation, face detection may be performed on captured image 1320 with face detection software provided in device 200/300 (e.g., via processing unit 410 and memory 420 of device 200/300). In another implementation, face detection may be performed on captured image 1320 with face detection software provided on a device communicating with device 200/300 (e.g., via communication interface 440).
Device 200/300 may create a speech balloon 1330, in the second language (e.g., English), that includes the translated text (e.g., “Barcelona? It costs 20 Euro. Hurry the train is leaving!”) of recorded audio 1310. Based on the determined location of first subject's 120 head in captured image 1320, device 200/300 may position speech balloon 1330 adjacent to first subject's 120 head in captured image 1320. In one implementation, the user of device 200/300 may manually re-position speech balloon 1330 in relation to captured image 1320, and/or may manually edit text provided in speech balloon 1330. Device 200/300 may combine the positioned speech balloon 1330 and captured image 1320 of first subject 120 to form a final image 1340. Device 200/300 may display image 1340 (e.g., via display 330) and/or may store image 1340 (e.g., in memory 420).
There may be some delay when interpreting and translating recorded audio 1310 before speech balloon 1330 (or a subtitle) will be displayed by device 200/300. Such a delay may be diminished by displaying portions of recorded audio 1310 are they are translated (e.g., rather than waiting for a complete translation of recorded audio 1310). For example, device 200/300 may display a words of recorded audio 1310 as soon as it is interpreted (and translated), rather than waiting for a complete sentence or a portion of a sentence to be interpreted (and translated). In such an arrangement, device 200/300 may display words with almost no delay and the user may begin interpreting recorded audio 1310. When a complete sentence or a portion of a sentence have been interpreted (and translated) by device 200/300, device 200/300 may rearrange the words to display a grammatically correct sentence or portion of a sentence. Device 200/300 may display interpreted (and translated) text in multiple lines, and may scroll upward or fade out previous lines of text as new recorded audio 1310 is received, interpreted, and displayed by device 200/300.
Although
Video glasses 1410 may translate recorded audio 1420 (e.g., the audio clip) into text, in a second language (e.g., English), using speech recognition software. In one implementation, speech recognition and language translation may be performed on recorded audio 1420 with speech recognition software provided in video glasses 1410. In another implementation, speech recognition and language translation may be performed on recorded audio 1420 with speech recognition software provided on a device communicating with video glasses 1410.
Video glasses 1410 may use face detection software to determine a location of first subject's 120 head. In one implementation, face detection may be performed on captured image 1430 with face detection software provided in video glasses 1410. In another implementation, face detection may be performed on captured image 1430 with face detection software provided on a device communicating with video glasses 1410.
Video glasses 1410 may create a speech balloon 1330, in the second language (e.g., English), that includes the translated text (e.g., “The meeting will begin with a short presentation about . . . ”) of recorded audio 1420. Based on the determined location of first subject's 120 head, video glasses 1410 may position speech balloon 1440 adjacent to first subject's 120 head. Video glasses 1410 may display speech balloon 1440 (e.g., on the lenses) adjacent to first subject's 120 head. Video glasses 1410 may automatically update the position of speech balloon 1440, with respect to first subject 120, if first subject 120 or the user wearing video glasses 1410 moves. Such an arrangement may enable the user wearing video glasses 1410 to obtain language translations on the fly. Video glasses 1410 may display and capture real-time video (e.g., for a deaf person watching a play). For example, in one implementation, video glasses 1410 may display speech balloon 1440 (or subtitles) on otherwise transparent glasses. In another implementation, video glasses 1410 may display real-time video of subject 120 along with speech balloon 1440 (or subtitles).
Although
Although
As illustrated in
As further shown in
Returning to
As shown in
As further shown in
Returning to
As further shown in
As shown in
As further shown in
Returning to
Systems and/or methods described herein may provide a device that performs voice-controlled image editing.
The foregoing description of implementations provides illustration and description, but is not intended to be exhaustive or to limit the invention to the precise form disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practice of the invention.
For example, while series of blocks have been described with regard to
It will be apparent that aspects, as described herein, may be implemented in many different forms of software, firmware, and hardware in the implementations illustrated in the figures. The actual software code or specialized control hardware used to implement these aspects is not limiting of the invention. Thus, the operation and behavior of these aspects were described without reference to the specific software code—it being understood that software and control hardware may be designed to implement these aspects based on the description herein.
Further, certain portions of the invention may be implemented as “logic” that performs one or more functions. This logic may include hardware, such as an application specific integrated circuit or a field programmable gate array, or a combination of hardware and software.
Even though particular combinations of features are recited in the claims and/or disclosed in the specification, these combinations are not intended to limit the invention. In fact, many of these features may be combined in ways not specifically recited in the claims and/or disclosed in the specification.
It should be emphasized that the term “comprises/comprising” when used herein is taken to specify the presence of stated features, integers, steps or components but does not preclude the presence or addition of one or more other features, integers, steps, components or groups thereof.
No element, act, or instruction used in the present application should be construed as critical or essential to the invention unless explicitly described as such. Also, as used herein, the article “a” is intended to include one or more items. Where only one item is intended, the term “one” or similar language is used. Further, the phrase “based on” is intended to mean “based, at least in part, on” unless explicitly stated otherwise.