Video and other cameras are installed in many public and private places, e.g., to provide security, monitoring, etc. and/or may otherwise be present in a location. The number of cameras has been increasing dramatically in recent years. In former times, a security guard or other personnel may have monitored in real time, e.g., on a set of display screens, the respective feed from each of a plurality of cameras. Increasingly, automated ways to monitor and otherwise consume video and/or other image data may be required.
Some cameras have network or other connections to provide feeds to a central location. Techniques based on the detection of motion in a segment of video data have been provided to identify through automated processing a subject that may be of interest. For example, bounding boxes have been used to detect an object moving through a static scene in a segment of video. However, such techniques may be imprecise, identifying a box or other area much larger than the actual subject of interest, and the inaccuracy of such techniques may increase as the speed of movement increases. Also, a non-human animal or a piece of paper or other debris blowing through a scene may be detected by such techniques, when only a human subject may be of interest.
Techniques to highlight a subject of interest in a segment of video, such as by drawing a box or other solid line around a subject of interest, have been provided, but the quality and usefulness of such highlighting have been limited by the low level of accuracy and precision with which subjects of interest have been able to be identified through the motion-based techniques mentioned above.
Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.
The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
Segmentation-based techniques to identify and/or highlight a subject of interest in a portion of video are disclosed. In various embodiments, visual content (e.g., a single image, successive frames of video, etc.) is sent to a cloud-based or other remote service. The service processes each image/frame to identify one or more subjects of interest. A mask layer to highlight the subject(s) of interest is generated and provided to a rendering site. The rendering site uses the original visual content and the mask layer to generate and display a modified visual content (e.g., modified image or video) in which the subject(s) of interest is/are highlighted. For example, a subject of interest may be highlighted by showing an outline of the subject, displaying the subject in a distinctive color or shading, selectively blurring content immediately and/or otherwise around the subject of interest, etc.
In various embodiments, video data generated by video camera 102 is processed internally, for example by an agent or other code running on a processor included in video camera 102, to process at least a subset of frames comprising the video content at least in part by making for each such frame a call across the Internet 108 and/or one or more other networks to a remote segmentation service 110. A copy of the video frame is cached, e.g., at video camera 102 and/or at client system 104, awaiting further processing based at least in part on a response received from the remote service with respect to the frame. Segmentation service 110 processes each frame (or single image) in a manner determined at least in part on configuration data 112. For example, configuration 112 may include for a user associated with client system 104 a configuration data indicating how video/image content associated with that user is to be processed. Examples include without limitation which types of objects are desired to be identified and highlighted in video associated with the user, a manner in which objects of interest are to be highlighted (e.g., selective blurring, etc.), etc.
In the example shown, segmentation service 110 performs segmentation, i.e., identifies objects of interest within frames of video content or other images, at least in part by calling a pixel labeling network 114. Pixel labeling network 114 may comprises a multi-layer neural network configured to relatively quickly compute for each pixel comprising a video frame a probability that the pixel is associated with an object of interest. For example, for each pixel, a probability that the pixel displays a part of a human body may be computed. In various embodiments, training data 116 may be used to train the neural network 114 to determine accurately and quickly a probability that a pixel is associated with an object of interest.
In various embodiments, probabilities received by segmentation service 110 from the pixel labeling network 114 may be used to determine for a frame of video content (or other image) a likelihood map indicating the coordinates within the video frame (or other image) that have been determined based on the pixel-level probabilities to be likely to be associated with an object of interest, such as a person or a portion thereof. The likelihood map is used in various embodiments to generate and return to client system 104 a mask layer to be combined with or otherwise applied to the original frame to generate a modified frame in which the detected object(s) of interest is/are highlighted. In some embodiments, the likelihood map is returned to client system 104 and client code running on client system 104 generates the mask layer.
In various embodiments, a sequence of video frames to which associated mask layers have been applied may be rendered via display device 106 to provide a display video in which the object(s) of interest is/are highlighted, e.g., as they move (or not) through a scene. In various embodiments, the background/scene may be static (e.g., stationary video camera) or dynamic (e.g., panning video camera). Whether the object of interest (e.g., person) moves through successive frames or not, in various embodiments techniques disclosed herein enable an object of interest to be identified in successive frames and highlighted as configured and/or desired.
While some examples described herein involve successive frames of video content, in various embodiments techniques disclosed herein may be applied to images not comprising video content, such as a digital photo or other non-video image. The term “visual content data” is used herein to refer to both video content, e.g., comprising a sequence of frames each comprising a single image, as well as single, static images.
In the example shown, a segmentation mask layer 304 has been received in which data identifying four objects of interest and for each a corresponding outline/extent is embodied. In the example shown, the four subjects having human form have been identified. Note that the statue has been identified as human even though it is inanimate. Also, differences in size/scale and differences in the speed at which objects of interest may be moving through the depicted scene have not affected the fidelity with which human figures have been identified. The original video frame 302 and the segmentation mask layer 304 are combined by a process or module 306 to produce a modified display frame 308. In this example, in the combined display frame 308 the objects of interest are shown in their original form and regions around them have been selectively blurred, as indicated by the dashed lines used to show non-human objects such as the pedestal and the car.
In various embodiments, successive modified display frames, such as display frame 308, may be generated and displayed in sequence to provide a modified moving video content in which objects of interest are highlighted as disclosed herein, e.g., while such objects of interest move through a video scene depicting a real world location or set of locations.
In various embodiments, a cloud-based segmentation service as disclosed herein may be called and may return a mask layer that identifies portions of a frame of video or other image as being associated with an object of interest. In some embodiments, a local process (e.g., camera 102 and/or client 104 of
In various embodiments, techniques disclosed herein may be used to identify an object of interest in visual content data quickly and to generate and render modified visual content data in which such objects are highlighted in a desired manner.
Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.