When recording media such as audio and video, users of a media recording system may wish to remember specific moments in a media recording by tagging the moments with comments, searchable metadata, or other such tags based upon the content in the recording. Many current technologies, such as audio and video editing software, allow such users to add such tags to recorded media manually after the content has been recorded.
Various embodiments are disclosed herein that relate to the automatic tagging of content such that contextual tags are added to content without manual user intervention. For example, one disclosed embodiment provides a computing device comprising a processor and memory having instructions executable by the processor to receive input data comprising one or more of depth data, video data, and directional audio data, identify a content-based input signal in the input data, and apply one or more filters to the input signal to determine whether the input signal comprises a recognized input. Further, if the input signal comprises a recognized input, then the instructions are executable to tag the input data with the contextual tag associated with the recognized input and record the contextual tag with the input data to form recorded tagged data.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.
As mentioned above, current methods for tagging recorded content with contextual tags involve manual user steps to locate frames or series of frames of image data, audio data, etc. for tagging, and to specify a tag that is to be applied at the selected frame or frames. Such steps involve time and effort on the part of a user, and therefore may be unsatisfactory for use environments in which content is viewed soon after recording, and/or where a user does not wish to perform such manual steps.
Accordingly, various embodiments are disclosed herein that relate to the automatic generation of contextual tags for recorded media. The embodiments disclosed herein may be used, for example, in a computing device environment where user actions are captured via a user interface comprising an image sensor, such as a depth sensing camera and/or a conventional camera (e.g. a video camera) that allows images to be recorded for playback. The embodiments disclosed herein also may be used with a user interface comprising a directional microphone system. Contextual tags may be generated as image (and, in some embodiments, audio) data is collected and recorded, and therefore may be available for use and playback immediately after recording, without involving any additional manual user steps to generate the tags after recording. While described herein in the context of tagging data as the data is received from an input device, it will be understood that the embodiments disclosed herein also may be used with suitable pre-recorded data.
As described in more detail below, the input device 106 may comprise various sensors configured to provide input data to the computing device 102. Examples of sensors that may be included in the input device 106 include, but are not limited to, a depth-sensing camera, a video camera, and/or a directional audio input device such as a directional microphone array. In embodiments that comprise a depth-sensing camera, the computing device 102 may be configured to locate persons in image data acquired from a depth-sensing camera tracking, and to track motions of identified persons to determine whether any motions correspond to recognized inputs. The identification of a recognized input may trigger the automatic addition of tags associated with the recognized input to the recorded content. Likewise, in embodiments that comprise a directional microphone, the computing device 102 may be configured to associate speech input with a person in the image data via directional audio data. The computing device 102 may then record the input data and the contextual tag or tags to form recorded tagged data. The contextual tags may then be displayed during playback of the recorded tagged data, used to search for a desired segment in the recorded tagged data, or used in any other suitable manner.
Prior to discussing embodiments of automatically generating contextual tags for recorded data,
Computing device 102 is illustrated as comprising a logic subsystem 310 and a data-holding subsystem 312. Logic subsystem 310 may include one or more physical devices configured to execute one or more instructions. For example, the logic subsystem may be configured to execute one or more instructions that are part of one or more programs, routines, objects, components, data structures, or other logical constructs. Such instructions may be implemented to perform a task, implement a data type, transform the state of one or more devices, or otherwise arrive at a desired result. The logic subsystem may include one or more processors that are configured to execute software instructions. Additionally or alternatively, the logic subsystem may include one or more hardware or firmware logic machines configured to execute hardware or firmware instructions. The logic subsystem may optionally include individual components that are distributed throughout two or more devices, which may be remotely located in some embodiments.
Data-holding subsystem 312 may include one or more physical devices, which may be non-transitory, and which are configured to hold data and/or instructions executable by the logic subsystem to implement the herein described methods and processes. When such methods and processes are implemented, the state of data-holding subsystem 312 may be transformed (e.g., to hold different data). Data-holding subsystem 312 may include removable media and/or built-in devices. Data-holding subsystem 312 may include optical memory devices, semiconductor memory devices, and/or magnetic memory devices, among others. Data-holding subsystem 312 may include devices with one or more of the following characteristics: volatile, nonvolatile, dynamic, static, read/write, read-only, random access, sequential access, location addressable, file addressable, and content addressable. In some embodiments, logic subsystem 310 and data-holding subsystem 312 may be integrated into one or more common devices, such as an application specific integrated circuit or a system on a chip.
Display 104 may be used to present a visual representation of data held by data-holding subsystem 312. As the herein described methods and processes change the data held by the data-holding subsystem 312, and thus transform the state of the data-holding subsystem 312, the state of the display 104 may likewise be transformed to visually represent changes in the underlying data. The display 104 may include one or more display devices utilizing virtually any type of technology. Such display devices may be combined with logic subsystem 310 and/or data-holding subsystem 312 in a shared enclosure, or, as depicted in
The depicted input device 106 comprises a depth sensor 320, such as a depth-sensing camera, an image sensor 322, such as a video camera, and a directional microphone array 324. Inputs received from the depth sensor 320 allows the computing device 102 to locate any persons in the field of view of the depth sensor 320, and also to track the motions of any such persons over time. The image sensor 322 is configured to capture visible images within a same field of view, or an overlapping field of view, as the depth sensor 320, to allow the matching of depth data with visible image data recorded for playback.
The directional microphone array 324 allows a direction from which a speech input is received to be determined, and therefore may be used in combination with other inputs (e.g. from the depth sensor 320 and/or the image sensor 322) to associate a received speech input with a particular person identified in depth data and/or image data. This may allow a contextual tag that is generated based upon a speech input to be associated with a particular user, as described in more detail below. It will be appreciated that the particular input devices shown in
Method 400 next comprises, at 410, identifying a content-based user input signal in the input data, wherein the term “content-based” represents that the input signal is found within the content represented by the input. Examples of such input signals include gestures and speech inputs made by a user. One example embodiment illustrating the identification of user input signals in input data is shown at 412-418. First, at 412, one or more persons are identified in depth data and/or other image data. Then, at 414, motions of each identified person are tracked. Further, at 416, one or more speech inputs may be identified in the directional audio input. Then, at 418, a person from whom a speech input is received is identified, and the speech inputs are associated with the identified person.
Any suitable method may be used to identify a user input signal within input data. For example, motions of a person may be identified in depth data via techniques such as skeletal tracking, limb analysis, and background reduction or removal. Further, facial recognition methods, skeletal recognition methods, or the like may be used to more specifically identify the persons identified in the depth data. Likewise, a speech input signal may be identified, for example, by using directional audio information to isolate a speech input received from a particular direction (e.g. via nonlinear noise reduction techniques based upon the directional information), and also to associate the location from which the audio signal was received with a user being skeletally tracked. Further, the volume of a user's speech also may be tracked via the directional audio data. It will be understood that these specific examples of the identification of user inputs are presented for the purpose of example, and are not intended to be limiting in any manner. For example, other embodiments may comprise identifying only motion inputs (to the exclusion of audio inputs).
Method 400 next comprises, at 420, determining whether the identified user input is a recognized input. This may comprise, for example, applying one or more filters to motions identified in the input data via skeletal tracking to determine whether the motions are recognized motions, as illustrated at 422. If multiple persons are identified in the depth data and/or image data, then 422 may comprise determining whether each person performed a recognized motion.
Additionally, if it is determined that two or more persons performed recognized motions within a predetermined time relative to one another (e.g. wherein the motions are temporally overlapping or occur within a predefined temporal proximity), then method 400 may comprise, at 424, applying one or more group motion filters to determine whether the identified individual motions taken together comprise a recognized group motion. An example of this is illustrated in
Next, method 400 comprises, at 432, tagging the input data with a contextual tag associated with the recognized input, and recording the tagged data to form recorded tagged data. For example, where the recognized input is a recognized motion input, then the contextual tag may be related to the identified motion, as indicated at 434. Such a tag may comprise text commentary to be displayed during playback of a video image of the motion, or may comprise searchable metadata that is not displayed during playback. As an example of searchable metadata that is not displayed during playback, if a user performs a kick motion, a metadata tag identifying the motion as a kick may be applied to the input data. Then, a user later may easily locate the kick by performing a metadata search for segments identified by “kick” metadata tags. Further, where facial recognition methods are used to identify users located in the depth and/or image data, the contextual tag may comprise metadata identifying each user in a frame of image data (e.g. as determined via facial recognition). This may enable playback of the recording with names of the users in a recorded scene displayed during playback. Such tags may be added to each frame of image data, or may be added to the image data in any other suitable manner.
Likewise, a group motion-related tag may be added in response to a recognized group motion, as indicated at 436. One example of a group motion-related tag is shown in
Further, a speech-related tag may be applied for a recognized speech input, as indicated at 438. Such a speech-related tag may comprise, for example, text or audio versions of recognized words or phrases, metadata associating a received speech input with an identity of a user from whom the speech was received, or any other suitable information related to the content of the speech input. Further, the speech-related tag also may comprise metadata regarding a volume of the speech input, and/or any other suitable information related to audio presentation of the speech input during playback.
In this manner, a computing device that is recording an image of a scene may tag the recording with comments based upon what is occurring in the scene, thereby allowing playback of the scene with running commentary that is meaningful to the recorded scene. Further, metadata tags also may be automatically added to the recording to allow users to quickly search for specific moments in the recording.
Further, in some embodiments, a video and directional audio recording of users may be tagged with sufficient metadata to allow an animated version of the input data to be produced from the input data. This is illustrated at 440 in
It is to be understood that the configurations and/or approaches described herein are exemplary in nature, and that these specific embodiments or examples are not to be considered in a limiting sense, because numerous variations are possible. The specific routines or methods described herein may represent one or more of any number of processing strategies. As such, various acts illustrated may be performed in the sequence illustrated, in other sequences, in parallel, or in some cases omitted. Likewise, the order of the above-described processes may be changed.
The subject matter of the present disclosure includes all novel and nonobvious combinations and subcombinations of the various processes, systems and configurations, and other features, functions, acts, and/or properties disclosed herein, as well as any and all equivalents thereof.