Various systems and techniques exist for capturing audiovisual data of a region and reviewing the data at a later time. For example, closed-circuit cameras and other security systems often are connected to recording systems that allow for an operator to review any audio and/or video captured by the security system at a later date. Typically, the operator reviews such stored information by viewing the data at a normal speed, i.e., the speed at which any events captured by the security camera occurred originally. In some cases, an operator may be able to review captured data at a higher rate, for example, by fast-forwarding through a recorded video. Such techniques may allow for faster review of captured data.
According to an embodiment of the disclosed subject matter, a method of audio summarization includes obtaining a user preference indicating a sound signature of interest to a user, generating one or more designated audio segments of interest from an input audio stream based on the user preference, generating an event score for each of the designated audio segments of interest, the event score indicating the probability that an audio event associated with the sound signature occurs within the audio segment, and generating a summarized output audio stream by applying the event score to each of the designated audio segments of interest to emphasize sounds corresponding to the sound signature of interest to the user over sounds that do not correspond to the sound signature of interest to the user.
According to an embodiment of the disclosed subject matter, an apparatus for audio summarization includes a memory and a processor communicably coupled to the memory. In an embodiment, the processor is configured to execute instructions to obtain a user preference indicating a sound signature of interest to a user, to generate one or more designated audio segments of interest from an input audio stream based on the user preference, to generate an event score for each of the designated audio segments of interest, the event score indicating the probability that an audio event associated with the sound signature occurs within the audio segment, and to generate a summarized output audio stream by applying the event score to each of the designated audio segments of interest to emphasize sounds corresponding to the sound signature of interest to the user over sounds that do not correspond to the sound signature of interest to the user.
According to an embodiment of the disclosed subject matter, an apparatus for audio summarization includes an audio summarizer, which includes an audio marker configured to obtain a user preference indicating a sound signature of interest to a user and to generate one or more designated audio segments of interest from an input audio stream based on the user preference, and an audio compiler configured to generate an event score for each of the designated audio segments of interest, the event score indicating the probability that an audio event associated with the sound signature occurs within the audio segment, and to generate a summarized output audio stream by applying the event score to each of the designated audio segments of interest to emphasize sounds corresponding to the sound signature of interest to the user over sounds that do not correspond to the sound signature of interest to the user.
According to an embodiment of the disclosed subject matter, means for audio summarization are provided, which include means for obtaining a user preference indicating a sound signature of interest to a user, means for generating one or more designated audio segments of interest from an input audio stream based on the user preference, means for generating an event score for each of the designated audio segments of interest, the event score indicating the probability that an audio event associated with the sound signature occurs within the audio segment, and means for generating a summarized output audio stream by applying the event score to each of the designated audio segments of interest to emphasize sounds corresponding to the sound signature of interest to the user over sounds that do not correspond to the sound signature of interest to the user.
Additional features, advantages, and embodiments of the disclosed subject matter may be set forth or apparent from consideration of the following detailed description, drawings, and claims. Moreover, it is to be understood that both the foregoing summary and the following detailed description are illustrative and are intended to provide further explanation without limiting the scope of the claims.
The accompanying drawings, which are included to provide a further understanding of the disclosed subject matter, are incorporated in and constitute a part of this specification. The drawings also illustrate embodiments of the disclosed subject matter and together with the detailed description serve to explain the principles of embodiments of the disclosed subject matter. No attempt is made to show structural details in more detail than may be necessary for a fundamental understanding of the disclosed subject matter and various ways in which it may be practiced.
Conventional techniques for reviewing captured audio and/or video may be quite time-intensive, since even in a fast-forward mode a user may be required to “fast-forward” through a relatively large amount of data before identifying a segment of audio and/or video that is of interest. For example, where a user only wishes to view a period of time during which a particular sound was captured by an audio device, the user may be required to review a relatively large amount of captured audio to find the desired sound. Thus, according to implementations disclosed herein, it may be desirable to create summaries of sound events over a given timespan from pre-recorded audio or audio-video data. It also may be desirable to present an audio or audio-video event within a shorter timespan than the entire timespan of the audio or audio-visual event based on the need or desire of a specific user. For example, it may be desirable to present the user with enhanced relevant portions of an audio or audio-video event while eliminating or suppressing noises and other artifacts in a sound stream that are not included in the user's list of desirable types of sounds.
The presently-disclosed subject matter relates to methods and apparatus for creating summaries of sound events from an audio or audio-video recording. For example, summaries of sound events based on an audio or audio-video recording may be created based on the needs of a particular user, and such summaries of sound events may have a shorter timespan than the actual timespan of the audio or audio-video recording. As a specific example, upon receiving a user preference indicating a sound signature of interest, one or more designated audio segments of interest may be generated from an input audio stream based on the user preference. An event score may be generated to indicate the probability that an audio event associated with the sound signature occurs within the audio segment, and a summarized output audio stream may be generated by applying the event score to each of the designated audio segments of interest to emphasize sounds corresponding to the sound signature of interest to the user over sounds that do not correspond to the sound signature of interest to the user.
Referring to
In some embodiments, the recording devices 102, 104 and 106 may have cloud-recording capability, and audio summaries may be generated by the servers 112 according to the specific requirements or desired of each individual user. That is, the recording devices 102, 104, 106 may be able to record audio or audio-video data directly to a cloud-based storage or processing system. For example, an audio or audio-video recording may be summarized based on a particular sound signature pursuant to the requirement or specification of a particular user. As a specific example, a user may select or provide a sound signature, such as the sound of a child crying. In other examples, various types of sound signatures may include the sound of a human speech, the sounds made by pets, such as dog barks and cat meows, the sounds associated with unauthorized entries, such as sounds of glass breaking or door slamming, or sounds characteristic of a given location or environment. The sound signature may be identified by the user via a selection of an existing audio file, by the user providing a copy of the audio file, or the like. Alternatively or in addition, a system such as a smart home system may provide the user with one or more sound signatures that have been identified by the system, and allow the user to identify one or more of the sound signatures as being of interest to the user. In some implementations, potential sound signatures may be automatically identified by the system, such as where a smart home system has identified known sounds such as glass breaking, a pet noise, a child crying or talking, or the like. In another example, an audio or audio-video recording may be summarized based on the identity of the speaker. For example, a smart home system as disclosed herein may store a voiceprint or other user-specific sound signature of a user that is known to the smart home system. The user-specific sound signature may be used as the sound signature as disclosed herein. In various implementations, sound signatures associated with particular sources, for example, a specific sound signature associated with crying, laughing or speech of a particular child or a specific sound signature associated with a particular pet, may be identified by the user or by the system. In yet another example, an audio or audio-video recording may be summarized based on the location of the sound source. In other examples, summaries of audio or audio-video recordings may be generated based on various requirements of various users.
In the example of the network illustrated in
In the example shown in
The summarizer 202 may transmit the summarized audio recordings to a network, such as a smart home system, a local system or a cloud network 108, which may store one or more copies of such recordings in the audio storage 204. In addition, summarizer may transmit, directly or through the network, a summarized audio recording, that is, a shortened version of the raw audio recording produced by enhancing one or more segments of the raw audio recording or suppressing one or more other segments of the raw audio recording based on the user preferences, to the user device 114. In an embodiment, only the audio data in an audio-video recording is summarized, and a shortened audio-video clip is provided by processing only the audio portion of the audio-video recording based on the user preferences. In an embodiment, upon summarization of the audio data, segments of raw video data corresponding to retained segments audio data are retained, whereas segments of raw video data corresponding to segments of raw audio data suppressed or discarded by the audio summarization process are suppressed or discarded.
Referring to
In an embodiment, the audio enhancer 304 may be designed make the audio more presentable based on the preference of a specific user. For example, a user may want to hear a conversation that is louder and crisper than what is present in the raw audio, and the audio enhancer 304 may create a richer audio quality experience by enhancing the relevant portions of the tagged raw audio data. Audio enhancement may be achieved by suppressing noise and other artifacts in the sound stream that are not included in the user's list of sound events. Other audio enhancement techniques, for example, frequency domain based techniques that suppress or mask out irrelevant or undesirable sound features, or audio signals with undesirable types of signatures, may also be incorporated. More generally, portions of the audio that are related to a sound signature selected by a user may be emphasized or enhanced, while potions of the audio that detract from or are unrelated to a sound signature selected by a user may be deemphasized, removed, or the like. As a specific example, if a user has indicated interest in a particular speaker's voice, all other voices identified in the audio may be removed, reduced in volume, or the like, so as to emphasize the desired speaker's voice in the audio. As a specific example, the audio may be played with a certain type of sound emphasized or enhanced based on the event score for the user's preferred detector, but in a temporal or chronological order.
Alternatively, the tagged portions of the audio data may be passed directly to the audio compiler 306 without enhancement by the audio enhancer 304. In an embodiment, the audio compiler 306 receives the tagged portions of the audio data, with or without enhancement by the audio enhancer 304, and arranges the tagged portions of the audio data in a manner that is presentable and comprehensible to the user as a summarized output audio data stream in a relatively short amount of time compared to the entire length of the raw audio data stream.
In the example illustrated in
In an embodiment, the user preferences may include more than one user specified selection to activate more than one of the selectors 406a, 406b, . . . 406g to activate more than one of the detectors 408a, 408b, . . . 408g. For example, a user may wish to detect pet sounds and baby cries by activating the pet sound detector 408e and the baby cry detector 408f but not activating the detectors for other types of sounds. In that situation, user specified selection input 402 may activate two of the selectors 406e and 406f, and the pet sound detector 408e and the baby cry detector 408f may send positive signals to the audio tagger 410 to tag only portions of the raw audio data stream that include sounds associated with pet sounds or baby cries.
In addition to the examples of sound detectors illustrated in
In an embodiment, one or more of the detectors 408a, 408b, . . . 408g in
In various embodiments, methods are provided to generate summarized output audio or audio-video streams depending on the level of complexity desired by the application.
In an embodiment, tagged audio data may be concatenated and played out at the normal speed, or alternatively, at an increased speed, for example, at 1.5 times or 2 times the normal speed of play. In an embodiment, the speed of play may be variable, that is, adaptive to the probability of the tagged events. For example, referring to
In an embodiment, the playing speed of tagged audio data in a given frame may be set in inverse proportion to the event score for the frame in block 508. A specific playing speed may be assigned to each of the tagged audio frames based on the event score for each tagged audio frame in block 510. Thus, the playing speeds may be different for different tagged audio frames. In this embodiment, a tagged audio frame having a lower event score is played at a higher speed, whereas a tagged audio frame having a higher event score is played at a lower speed. In other words, frames that contain no audio events or a relatively small amount of audio events desired to be heard by the user are played at a higher speed over a shorter period of time, whereas frames that contain a large amount of audio events desired to be heard by the user are played at a normal speed over a longer period of time. By playing audio frames with high event scores at a normal speed and playing audio frames with low event scores at a faster speed, the audio events that have high probabilities of containing sounds signatures of interest as indicated by the user preferences are emphasized over audio events that have low probabilities of containing sound signatures of interest. In block 512, a tagged audio frame is resampled to the playing speed assigned to that particular tagged audio frame.
After a tagged audio frame is resampled to its assigned playing speed in block 512, a determination is made as to whether one or more tagged audio frames are being passed to the audio compiler in block 514. If no more frames are detected in block 514, then the process ends in block 516. If one or more frames are detected in block 514, then the process repeats by reading additional tagged audio frames in block 504.
In an embodiment, the playing speed of a particular tagged audio frame may depend on the type of sound detected and tagged by the audio marker 302 as shown in
In an embodiment, instead of varying the speed of play of tagged audio data, the tagged audio data may be concatenated and divided into shorter clips of approximately equal lengths. For example, an audio recording containing tagged audio data having a total length of one minute may be divided into six clips of ten seconds each. Each of the clips need not have exactly the same length. For example, some of the clips may have a length of nine seconds while some of the other clips may have a length of eleven seconds without seriously affecting the hearing experience of the user. All of these shorter clips of tagged audio data may be played concurrently. The volume of each clip may be gradually increased and then decreased one by one, for example. The volume of a given audio clip may be increased by an amount that is loud enough to move that audio clip into the foreground, but not loud enough to mask out the other clips in the background.
In an embodiment, by increasing the volume of one clip while decreasing the volumes of other clips and repeating the process for each of the clips successively, the discrimination capability of a human brain may be utilized to track sounds even after they move from the foreground to the background. Thus, audio clips that have high probabilities of containing sound signatures of interest may be emphasized over audio clips that have low probabilities of containing sound signatures of interest. Moreover, if multiple loudspeakers are provided, the human brain may be able to discriminate the sounds more effectively by playing clips that are similar to one another from different loudspeakers.
In an embodiment, the starting time of each of the tagged clips may be adjusted in such a manner that little or no overlap occurs between the tagged clips covering events that have high event scores. When the clips are being playing out, the sound volume may be increased only for portions of the clips that have high event scores. Other techniques may also be applied to minimize overlaps between audio clips which include events that have high event scores. For example, the length of each of the clips may be adjusted to minimize the overlap of high scoring events between the clips. In another example, the user may be allowed to intervene or to override automatic playing of the clips to enable a particular clip in the foreground that sounds interesting to continue playing.
In an embodiment, if the audio is part of an audio-video stream provided by a recording device, the video portion of the audio-video stream may be utilized to help guide the user on what is being heard. For example, the video portion of the audio-video stream may provide additional context to the sound that is being heard. In an embodiment, the tagged audio-video data may be concatenated and then divided into shorter clips. These shorter clips of tagged audio-video data may be played out simultaneously. The volume of the audio portion of each clip may be gradually increased and then decreased one by one, for example.
In an embodiment, the volume of the audio portion of a given clip may be increased by an amount that is loud enough to move that clip into the foreground, but not loud enough to mask out the other clips in the background. At the same time, the corresponding video portion of each tagged audio-video clip may be enhanced and faded in a manner that matches the increase and decrease in the volume of the audio portion. The increase and decrease of sound volume and the enhancement and fading of the corresponding video may be repeated for each of the clips successively.
In an embodiment, the tagged audio-video clips may be aligned such that the high scoring events have little or no overlap between the clips. For example, overlaps between tagged audio-video clips may be minimized by varying the starting time or the length of each clip. Moreover, both the audio and video portions of audio-video clips may be enhanced over the high scoring event. In some implementations, it may be easier to detect certain types of sounds by sound detectors, such as detectors 408a, 408b, . . . 408g in the audio marker 302 of
Summarized audio or audio-video data may be presented to the user in various manners. For example, the summarized audio or audio-video data may be stored in a storage, for example, the storage 204 in the network as shown in
Embodiments of the presently disclosed subject matter may be implemented in and used with a variety of component and network architectures. For example, the bank of servers 112 as shown in
The bus 21 allows data communication between the central processor 24 and one or more memory components, which may include RAM, ROM, and other memory, as previously noted. Typically RAM is the main memory into which an operating system and application programs are loaded. A ROM or flash memory component can contain, among other code, the Basic Input-Output system (BIOS) which controls basic hardware operation such as the interaction with peripheral components. Applications resident with the computer 20 are generally stored on and accessed via a computer readable medium, such as a hard disk drive (e.g., fixed storage 23), an optical drive, floppy disk, or other storage medium.
The fixed storage 23 may be integral with the computer 20 or may be separate and accessed through other interfaces. The network interface 29 may provide a direct connection to a remote server via a wired or wireless connection. The network interface 29 may provide such connection using any suitable technique and protocol as will be readily understood by one of skill in the art, including digital cellular telephone, Wi-Fi, Bluetooth®, near-field, and the like. For example, the network interface 29 may allow the computer to communicate with other computers via one or more local, wide-area, or other communication networks, as described in further detail below.
Many other devices or components (not shown) may be connected in a similar manner (e.g., document scanners, digital cameras and so on). Conversely, all of the components shown in
More generally, various embodiments of the presently disclosed subject matter may include or be embodied in the form of computer-implemented processes and apparatuses for practicing those processes. Embodiments also may be embodied in the form of a computer program product having computer program code containing instructions embodied in non-transitory or tangible media, such as floppy diskettes, CD-ROMs, hard drives, USB (universal serial bus) drives, or any other machine readable storage medium, such that when the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing embodiments of the disclosed subject matter. Embodiments also may be embodied in the form of computer program code, for example, whether stored in a storage medium, loaded into or executed by a computer, or transmitted over some transmission medium, such as over electrical wiring or cabling, through fiber optics, or via electromagnetic radiation, such that when the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing embodiments of the disclosed subject matter. When implemented on a general-purpose microprocessor, the computer program code segments configure the microprocessor to create specific logic circuits.
In some configurations, a set of computer-readable instructions stored on a computer-readable storage medium may be implemented by a general-purpose processor, which may transform the general-purpose processor or a device containing the general-purpose processor into a special-purpose device configured to implement or carry out the instructions. Embodiments may be implemented using hardware that may include a processor, such as a general purpose microprocessor or an Application Specific Integrated Circuit (ASIC) that embodies all or part of the techniques according to embodiments of the disclosed subject matter in hardware or firmware. The processor may be coupled to memory, such as RAM, ROM, flash memory, a hard disk or any other device capable of storing electronic information. The memory may store instructions adapted to be executed by the processor to perform the techniques according to embodiments of the disclosed subject matter.
In some embodiments, the recording devices 102, 104 and 106 as shown in
In general, a “sensor” as disclosed herein may include multiple sensors or sub-sensors, such as where a position sensor includes both a global positioning sensor (GPS) as well as a wireless network sensor, which provides data that can be correlated with known wireless networks to obtain location information. Multiple sensors may be arranged in a single physical housing, such as where a single device includes movement, temperature, magnetic, or other sensors. Such a housing also may be referred to as a sensor or a sensor device. For clarity, sensors are described with respect to the particular functions they perform or the particular physical hardware used, when such specification is necessary for understanding of the embodiments disclosed herein.
A sensor may include hardware in addition to the specific physical sensor that obtains information about the environment.
Sensors as disclosed herein may operate within a communication network, such as a conventional wireless network, or a sensor-specific network through which sensors may communicate with one another or with dedicated other devices. In some configurations one or more sensors may provide information to one or more other sensors, to a central controller, or to any other device capable of communicating on a network with the one or more sensors. A central controller may be general- or special-purpose. For example, one type of central controller is a home automation network that collects and analyzes data from one or more sensors within the home. Another example of a central controller is a special-purpose controller that is dedicated to a subset of functions, such as a security controller that collects and analyzes sensor data primarily or exclusively as it relates to various security considerations for a location. A central controller may be located locally with respect to the sensors with which it communicates and from which it obtains sensor data, such as in the case where it is positioned within a home that includes a home automation or sensor network. Alternatively or in addition, a central controller as disclosed herein may be remote from the sensors, such as where the central controller is implemented as a cloud-based system that communicates with multiple sensors, which may be located at multiple locations and may be local or remote with respect to one another.
The sensor network shown in
The smart home environment can control and/or be coupled to devices outside of the structure. For example, one or more of the sensors 71, 72 may be located outside the structure, for example, at one or more distances from the structure (e.g., sensors 71, 72 may be disposed outside the structure, at points along a land perimeter on which the structure is located, and the like. One or more of the devices in the smart home environment need not physically be within the structure. For example, the controller 73 which may receive input from the sensors 71, 72 may be located outside of the structure.
The structure of the smart-home environment may include a plurality of rooms, separated at least partly from each other via walls. The walls can include interior walls or exterior walls. Each room can further include a floor and a ceiling. Devices of the smart-home environment, such as the sensors 71, 72, may be mounted on, integrated with and/or supported by a wall, floor, or ceiling of the structure.
The smart-home environment including the sensor network shown in
A user can interact with one or more of the network-connected smart devices (e.g., via the network 70). For example, a user can communicate with one or more of the network-connected smart devices using a computer (e.g., a desktop computer, laptop computer, tablet, or the like) or other portable electronic device (e.g., a smartphone, a tablet, a key FOB, and the like). A webpage or application can be configured to receive communications from the user and control the one or more of the network-connected smart devices based on the communications and/or to present information about the device's operation to the user. For example, the user can view can arm or disarm the security system of the home.
One or more users can control one or more of the network-connected smart devices in the smart-home environment using a network-connected computer or portable electronic device. In some examples, some or all of the users (e.g., individuals who live in the home) can register their mobile device and/or key FOBs with the smart-home environment (e.g., with the controller 73). Such registration can be made at a central server (e.g., the controller 73 and/or the remote system 74) to authenticate the user and/or the electronic device as being associated with the smart-home environment, and to provide permission to the user to use the electronic device to control the network-connected smart devices and the security system of the smart-home environment. A user can use their registered electronic device to remotely control the network-connected smart devices and security system of the smart-home environment, such as when the occupant is at work or on vacation. The user may also use their registered electronic device to control the network-connected smart devices when the user is located inside the smart-home environment.
Moreover, the smart-home environment may make inferences about which individuals live in the home and are therefore users and which electronic devices are associated with those individuals. As such, the smart-home environment may “learn” who is a user (e.g., an authorized user) and permit the electronic devices associated with those individuals to control the network-connected smart devices of the smart-home environment, in some embodiments including sensors used by or within the smart-home environment. Various types of notices and other information may be provided to users via messages sent to one or more user electronic devices. For example, the messages can be sent via email, short message service (SMS), multimedia messaging service (MMS), unstructured supplementary service data (USSD), as well as any other type of messaging services or communication protocols.
A smart-home environment may include communication with devices outside of the smart-home environment but within a proximate geographical range of the home. For example, the smart-home environment may communicate information through the communication network or directly to a central server or cloud-computing system regarding detected movement or presence of people, animals, and any other objects and receives back commands for controlling the lighting accordingly.
The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit embodiments of the disclosed subject matter to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to explain the principles of embodiments of the disclosed subject matter and their practical applications, to thereby enable others skilled in the art to utilize those embodiments as well as various embodiments with various modifications as may be suited to the particular use contemplated.