Some security systems are able to capture videos of a person, analyze movements of the person, and generate an image or video dataset of metadata. To identify human actions captured by security camera videos of the system, a person needs to manually annotate the videos. This may be time consuming where the positions and angles of the video cameras may vary and might not provide adequate coverage. Multiple cameras may be used in a controlled environment. However, subjects, movements, and background variation may still be substantially limited. Another solution uses computer graphics as a dataset source. However, this approach is expensive, and data may be proprietary. Conventional annotation tools may be used to review human actions in videos. However, such annotation tools are not intuitive to use and require much time for users to identify and annotate actions captured in the videos.
Implementations generally provide a 2-dimensional dataset from 2-dimensional and 3-dimensional computer vision techniques. In some implementations, a system includes one or more processors, and includes logic encoded in one or more non-transitory computer-readable storage media for execution by the one or more processors. When executed, the logic is operable to cause the one or more processors to perform operations including: obtaining a plurality of 2-dimensional (2D) videos of a subject performing at least one action; generating a 3-dimensional (3D) model based on the plurality of 2D videos; generating a 3D scene based on the 3D model; and generating a 2D dataset based on the 3D scene.
With further regard to the system, in some implementations, the plurality of 2D videos is synchronized. In some implementations, the plurality of 2D videos is obtained from a plurality of physical cameras that are positioned at arbitrary locations in a physical environment. In some implementations, the logic when executed is further operable to cause the one or more processors to perform operations including obtaining one or more annotations associated with the plurality of 2D videos. In some implementations, the logic when executed is further operable to cause the one or more processors to perform operations including: determining one or more model data modifications to 3D model data; and applying the one or more model data modifications to the 3D model data. In some implementations, the logic when executed is further operable to cause the one or more processors to perform operations including: determining one or more scene settings; generating one or more virtual cameras; and adding the one or more virtual cameras to the 3D scene set based on the scene settings. In some implementations, the logic when executed is further operable to cause the one or more processors to perform operations including: obtaining one or more annotations associated with the plurality of 2D videos; and applying the one or more annotations to the 2D dataset.
In some embodiments, a non-transitory computer-readable storage medium with program instructions thereon is provided. When executed by one or more processors, the instructions are operable to cause the one or more processors to perform operations including: obtaining a plurality of 2-dimensional (2D) videos of a subject performing at least one action; generating a 3-dimensional (3D) model based on the plurality of 2D videos; generating a 3D scene based on the 3D model; and generating a 2D dataset based on the 3D scene.
With further regard to the computer-readable storage medium, in some implementations, the plurality of 2D videos is synchronized. In some implementations, the plurality of 2D videos is obtained from a plurality of physical cameras that are positioned at arbitrary locations in a physical environment. In some implementations, the instructions when executed are further operable to cause the one or more processors to perform operations including obtaining one or more annotations associated with the plurality of 2D videos. In some implementations, the instructions when executed are further operable to cause the one or more processors to perform operations including: determining one or more model data modifications to 3D model data; and applying the one or more model data modifications to the 3D model data. In some implementations, the instructions when executed are further operable to cause the one or more processors to perform operations including: determining one or more scene settings; generating one or more virtual cameras; and adding the one or more virtual cameras to the 3D scene set based on the scene settings. In some implementations, the instructions when executed are further operable to cause the one or more processors to perform operations including: obtaining one or more annotations associated with the plurality of 2D videos; and applying the one or more annotations to the 2D dataset.
In some implementations, a method includes: obtaining a plurality of 2-dimensional (2D) videos of a subject performing at least one action; generating a 3-dimensional (3D) model based on the plurality of 2D videos; generating a 3D scene based on the 3D model; and generating a 2D dataset based on the 3D scene.
With further regard to the method, in some implementations, the plurality of 2D videos is synchronized. In some implementations, the plurality of 2D videos is obtained from a plurality of physical cameras that are positioned at arbitrary locations in a physical environment. In some implementations, the method further includes obtaining one or more annotations associated with the plurality of 2D videos. In some implementations, the method further includes: determining one or more model data modifications to 3D model data; and applying the one or more model data modifications to the 3D model data. In some implementations, the method further includes: determining one or more scene settings; generating one or more virtual cameras; and adding the one or more virtual cameras to the 3D scene set based on the scene settings.
A further understanding of the nature and the advantages of particular implementations disclosed herein may be realized by reference of the remaining portions of the specification and the attached drawings.
Embodiments described herein enable, facilitate, and manage the creation of a synthetic 2-dimensional (2D) dataset using 2D and 3-dimensional (3D) computer vision techniques. Embodiments combine existing 3D reconstruction techniques and computer vision techniques to generate the 2D dataset. Embodiments generate the 2D dataset with arbitrary points of view from multiple angle 2D cameras. Embodiments also provide a user interface for video annotation tools.
In various embodiments, a system obtains 2D videos of a subject performing one or more actions. The system then generates a 3D model based on the 2D videos, and then generates a 3D scene based on the 3D model. The system then generates a 2D dataset based on the 3D scene. Although embodiments disclosed herein are described in the context of subjects being humans, these embodiments may also apply to other subjects such as animals, smart mechanical devices, etc. that may perform actions. The 2D dataset may be used for training in the context of machine learning or deep learning.
In various embodiments, a system obtains at least one video of at least one object or subject performing one or more actions. The system displays one or more portions of the video in a user interface. The system also displays annotation tracks in the user interface, where each annotation track is associated with one or more observed subjects and with at least one action in the video. In various embodiments, the system obtains one or more annotations associated with the video and based on the user interaction with the annotation tracks. Although embodiments disclosed herein are described in the context of subjects being humans, these embodiments may also apply to other objects such as animals, smart mechanical devices, etc. that may perform actions.
As shown, system 102 monitors the activity of a subject 108 in an activity area 110 using physical video cameras 112, 114, 116, and 118, which capture video of subject 108 at different angles. In various embodiments, physical video cameras 112, 114, 116, and 118 are positioned at arbitrary locations in order to capture multiple videos and/or still images at different points of view of the same subject. The terms cameras and video cameras may be used interchangeably.
Subject 108 may also be referred to as a person 108 or target user 108. In various embodiments, system 102 may utilize deep machine learning and computer vision techniques to detect and measure the body positions and movements of subject 108. As described in more detail herein, embodiments generate a 2D dataset with arbitrary points of view from 2D multiple-angle cameras. Embodiments combine existing 3D reconstruction techniques and computer vision techniques to generate the 2D dataset. Embodiments may be applied in various contexts such as for content creation for games or entertainment. For example, the system may capture players' 3D models in a game facility, where players use 3D models as their avatars in 3D game space. Embodiments may also expand annotations for 3D video and/or virtual reality content in addition to 2D video.
For ease of illustration,
While system 102 performs embodiments described herein, in other embodiments, any suitable component or combination of components associated with system 102 or any suitable processor or processors associated with system 102 may facilitate performing the embodiments described herein.
At block 204, the system generates a 3D model based on the 2D videos. Further example embodiments directed to generating a 3D model based on the 2D videos are described in more detail below.
At block 206, the system generates a 3D scene based on the 3D model. The following description provides example embodiments involved in the generation of a 3D scene, which is used for generating a 2D dataset.
Shown are physical video cameras 112, 114, 116, and 118, which are capturing videos of subject 108 at different angles. Also shown are virtual video cameras 302, 304, 306, 308, 310, 312, and 314. While 4 physical video cameras and 7 virtual video cameras are shown, the actual number physical and virtual video cameras may vary and will depend on the particular implementation.
In various embodiments, the positions and angles of physical video cameras 112, 114, 116, and 118 are limited and might not provide adequate coverage of subject 108 due to their limited numbers. The system may also generate as many virtual video cameras as needed to provide adequate coverage of subject 108. Also, the system may position the virtual cameras in many different locations and angles. As such, if subject 108 picks up an object, the existing physical video cameras and any number of virtual video cameras are available and positioned at different viewpoints to capture video of subject 108 performing the action. For example, if no physical video camera is in front of subject 108, the system may generate and add multiple virtual video cameras (e.g., virtual video cameras 310, 312, etc.) at various different positions to capture subject 108 from different angles. The system may generate an infinite number of virtual cameras, and then capture video footage of subject 108 performing different actions at different s and in different locations and positions in activity area 110. Further example embodiments directed to generating a 3D scene based on the 3D model are described in more detail below.
At block 208, the system generates a 2D dataset based on the 3D scene. In various embodiments, the system generates the 2D dataset with arbitrary points of view from multiple-angle 2D videos. As described in more detail below, the system combines existing 3D reconstruction techniques and computer vision techniques to generate the 2D dataset. Further example embodiments directed to generating a 2D dataset based on the 3D scene are described in more detail below.
Although the steps, operations, or computations may be presented in a specific order, the order may be changed in particular implementations. Other orderings of the steps are possible, depending on the particular implementation. In some particular implementations, multiple steps shown as sequential in this specification may be performed at the same time. Also, some implementations may not have all of the steps shown and/or may have other steps instead of, or in addition to, those shown herein.
At block 404, the system generates a 3D computer graphics (CG) model based on the 2D videos. In various embodiments, the 3D model may be stored on a main server or in the cloud, depending upon the particular implementation. In various embodiments, the system determines one or more model data modifications to the 3D model data of the 3D model. For example, in various embodiments, the system determines modifications to the 3D model based on scene settings. Example scene settings may include custom backgrounds, 3D objects, filters, changing camera parameters, position angle or movements of camera, etc.
The system then applies the one or more model data modifications to the 3D model data. In various embodiments, the 3D CG generation process may include 3D scene/object reconstruction from 2D images/videos. The 3D information may be described as 3D OBJ format or point cloud format. The pipeline may use 3D reconstruction techniques in the market (e.g., structure from motion, etc.).
At block 406, the system generates a 3D scene based on the 3D model, which may include 3D movie edits. As described above in connection with
In various embodiments, the system determines one or more scene settings (e.g., scene settings 408). In various embodiments, the system generates a 3D scene based on one or more of the scene settings. Examples scene settings may include virtual camera settings (e.g., number of cameras, positions, angles, etc.), 3D background information, and other 3D models. Each scene setting describes how to modify or edit the 3D scene for the dataset. For example, a user may want to change or customize a background and/or remove unnecessary objects from the 3D scene. In various embodiments, a scene setting may include custom backgrounds, 3D objects, and filters to modify 3D scenes. In various embodiments, a scene setting may include a number of virtual camera number, identifiers of each virtual camera, and may include changing camera parameters position, angles, and/or movements. In various embodiments, a scene setting may include camera settings (e.g., aperture, zooming, etc.).
In various embodiments, the system generates one or more virtual cameras. The system the adds one or more virtual cameras to the 3D scene set based on the scene settings. As described herein, the system uses the combination of physical video cameras and virtual video cameras to capture videos of a given subject from multiple different angles.
In various embodiments, the system generates the 3D scene and provides a 3D movie edit process to fix errors and to modify scenes and/or objects (e.g., adds, removes, etc.) against generated 3D CG data. In this process, the system also adds virtual cameras that are specified by the scene settings, as described herein. In various implementations, this process may be performed with any suitable 3D CG editor.
At block 410, the system generates a 2D video dataset based on the 3D scene. In various embodiments the 2D dataset is a video dataset that includes 2D training data. The system may store the 2D dataset on a server or in the cloud, depending on the particular implementation.
In various embodiments, the system obtains one or more annotations associated with the 2D videos, which are to be included in or applied to the 2D dataset. Such annotations may include various descriptions of subjects including objects in a video (e.g., labels of each subject and/or object, etc.). The particular types of annotations and content in the annotations may vary, and will depend on the particular implementation. The system applies the one or more annotations to the 2D dataset.
In various embodiments, the 2D dataset is based on annotations 412 and dataset settings 414. The 2D dataset may include generated 2D videos and generated annotations. The system utilizes dataset settings 414 for outputting one or more dataset configurations. Dataset configuration may include dataset file formats, filters, etc., for example.
Annotations 412 may include metadata that describes various aspects of synchronized 2D videos. For example, annotations may include actions, object types, object location, time segments, etc. In some embodiments, an annotation may be created by outside of the system. For example, an annotation may describe who or what is in the synchronized 2D videos. In another example, an annotation may describe human actions or object locations.
The system matches generated 2D videos and generated annotations with virtual camera settings for 2D video dataset generation. For example, if input annotation contains location (x, y) of objects/subjects, generated annotations may have transformed coordination based on the virtual camera settings.
In various embodiments, dataset settings 414 may specify one or more output dataset configurations. For example, users can specify dataset file format or filters to narrow down data. Example dataset settings may include the dataset file format, the output quality, etc.
At block 416, the system outputs the 2D video dataset. In various embodiments, the 2D video dataset may include 2D videos, which are generated from virtual cameras and annotation files which are transformed from input annotation files corresponding to generated 2D videos. For example, if input annotation data includes subject position in input 2D videos, the 2D dataset should have transformed subject positions corresponding to virtual camera settings (position/angle etc.).
In some embodiments, in order to train a deep learning model, the system processes both raw data and annotations. For example, there may be a 2D image of a person picking up an object. There may also be one or more annotations (e.g., “picking up object,” etc.). The system may use pairings (subject, action) to train a deep learning model. The deep learning model may be subsequently used with an input of a 2D image and may output “picking up object,” presuming the 2D image is of someone picking up an object.
Although the steps, operations, or computations may be presented in a specific order, the order may be changed in particular implementations. Other orderings of the steps are possible, depending on the particular implementation. In some particular implementations, multiple steps shown as sequential in this specification may be performed at the same time. Also, some implementations may not have all of the steps shown and/or may have other steps instead of, or in addition to, those shown herein.
At block 508, the system applies a blender or video editor. In various embodiments, the system may apply the blender based on annotations 510 (e.g., labels, locations including object IDs, time segments, etc.), which are input to an annotation plugin 512. The system may also apply the blender based on scene settings 514. Example scene settings may include virtual camera settings (e.g., number of cameras, positions, angles, etc.), background, other 3D models, etc. At block 516, the system applies a blender plugin. At block 518, the system outputs the 2D video dataset.
Although the steps, operations, or computations may be presented in a specific order, the order may be changed in particular implementations. Other orderings of the steps are possible, depending on the particular implementation. In some particular implementations, multiple steps shown as sequential in this specification may be performed at the same time. Also, some implementations may not have all of the steps shown and/or may have other steps instead of, or in addition to, those shown herein.
At block 604, the system displays one or more portions of the at least one video in a user interface. Such portions may include subjects and/or objects that appear in the video. As described in more detail herein, the system tracks objects and actions in a video for reviewing, adding annotations including time-based metadata.
In various embodiments, the system parses the at least one video into segments. In various embodiments, the system identifies one or more segments for each of the objects, and then associates each of the one or more segments with each of the corresponding objects. In various embodiments, the system also identifies one or more segments for each of the actions, and then associates each of the one or more segments with each of the corresponding actions. For example, in various embodiments, the system enables each object including Persons 1 and 2, and Object 1 to be delineated from other objects. In some embodiments, the system may enable a user to place a bounding box around each object. In some embodiments, the system may automatically without user intervention place a bounding box around each object using any suitable object recognition techniques. In various embodiments, the system may use the bounding box to segment and group segments of particular objects, actions, etc.
In various embodiments, the system provides various controls (e.g., play, reverse, forward, etc.), which enable a user to navigate to different frames of the video. In some embodiments, the system may also display the particular frame (e.g., frame 35, etc.) and the number of frames in the video (e.g., 1,000 frames, etc.).
At block 606, the system displays annotation tracks in the user interface. In various embodiments, each annotation track of the annotation tracks is associated with one or more of the at least one object and the at least one action in the at least one video. Referring still to
In various embodiments, the system may generate and associate multiple annotation tracks with a given object. For example, as shown, Track 2 is associated with Person 2 performing Action A (e.g., Person 2 standing in front of Object 1, etc.). In another example, Track 3 is associated with Person 2 performing Action B (e.g., Person 2 picking up Object 1, etc.). The system may also enable a user to add or remove annotation tracks from section 704. The system may also enable a user to play, review, and add/remove/modify annotation tracks and annotations.
Also shown is a seek bar that enables a user to navigate to any given frame of an annotation track. For example, as shown, the seek bar is placed at Tracks 1, 2, and 4 at moments in the video where frames show objects. For example, the seek bar is placed at a location in the video timeline corresponding to Track 1, Person 1, Action A, which takes up 3 frames as indicated. The seek bar is also placed at a location in the video timeline corresponding to Track 2, Person 2, Action A, which takes up 10 frames as indicated. The seek bar is also placed at a location in the video timeline corresponding to Track 4, Object 1, which takes up 100 frames as indicated.
In various embodiments, the system may indicate how many annotation tracks exist on an entire, given video. As indicated herein, each annotation track shows annotations (metadata) describing objects including subjects, and associated actions. Each annotation track also shows such objects and the frames at which the metadata starts and ends. For an object that is a person, such annotations may include human actions such as “waving hands,” “walking,” etc., depending on the particular implementation. For an object that is inanimate, such annotations may include object status such as “turned-on,” “used,” “exists,” etc., depending on the particular implementation.
The annotation tracks helps in creating training data that would be used later for automatic detection of particular objects. For example, if the subject is waving his or her hands and is walking, the system enables the addition of both metadata in connection hand waving and metadata in connection with walking. The system enables the metadata to be manipulated in user interface 700 using any suitable input device (e.g., touchpad, pointing device, keyboard, etc.). For example, a tool may support a drag-and-drop segment to modify start- and/or end-locations.
In various embodiments, the system enables a user to selectively annotate one or more of the at least one object and the at least one action in the at least one video based on at least one corresponding annotation track. In various embodiments, the system enables a user to add annotations including metadata about each object. In various embodiments, while the system enables a user to annotate one or more of multiple videos of the same object to provide various annotations, the system associates the annotations with each object. In some embodiments, the system may automatically add and associate some annotations without user intervention to each object. For example, the system may determine and indicate a particular frame number or frame numbers that correspond to moments in the video where a particular object (e.g., Person 2, etc.) is performing a particular action (e.g., Action A, etc.) up another object (e.g., Object 1).
In various embodiments, the system generates training data from the at least one video and the one or more annotations. In various embodiments, the one or more annotations include one or more of object information, localization information, and action information.
In some embodiments, annotations may include whether a particular object (e.g., Person 2) is a main target subject to be observed and tracked. Annotations may also include whether a particular object is being acted upon (e.g., Object 1). For example, the system may track Person 2 walking over to Object 1, picking up Object 1, and then handing Object 1 to Person 1 or placing object 1 on a surface such as on a table (not shown).
Embodiments reduce a user effort in annotating time-based data (e.g., video, audio, etc.) with metadata. Embodiments utilize a track-based user interface, also referred to as an annotation track user interface (UI). The user interface facilitates a user in reviewing and annotating videos, which may include human actions and object status. Embodiments may be applied to editing annotations in virtual reality environments.
In various embodiments, the system may utilize machine learning in facilitating a user in reviewing a dataset and its annotations. In various embodiments, in addition to the various examples of annotations provided herein, annotations may also include time-based metadata (e.g., time stamps, frame numbers, beginning frames, ending frames associated with or linked to particular subjects and actions, etc.).
At block 608, the system obtains one or more annotations associated with the at least one video based on the annotation tracks. The system may store the annotations locally at the system or at the client device or other suitable and accessible storage location.
Although the steps, operations, or computations may be presented in a specific order, the order may be changed in particular implementations. Other orderings of the steps are possible, depending on the particular implementation. In some particular implementations, multiple steps shown as sequential in this specification may be performed at the same time. Also, some implementations may not have all of the steps shown and/or may have other steps instead of, or in addition to, those shown herein.
For ease of illustration,
While server device 804 of system 802 performs embodiments described herein, in other embodiments, any suitable component or combination of components associated with system 802 or any suitable processor or processors associated with system 802 may facilitate performing the embodiments described herein.
In the various embodiments described herein, a processor of system 802 and/or a processor of any client device 810, 820, 830, and 840 cause the elements described herein (e.g., information, etc.) to be displayed in a user interface on one or more display screens.
Computer system 900 also includes a software application 910, which may be stored on memory 906 or on any other suitable storage location or computer-readable medium. Software application 910 provides instructions that enable processor 902 to perform the implementations described herein and other functions. Software application may also include an engine such as a network engine for performing various functions associated with one or more networks and network communications. The components of computer system 900 may be implemented by one or more processors or any combination of hardware devices, as well as any combination of hardware, software, firmware, etc.
For ease of illustration,
Although the description has been described with respect to particular embodiments thereof, these particular embodiments are merely illustrative, and not restrictive. Concepts illustrated in the examples may be applied to other examples and implementations.
In various implementations, software is encoded in one or more non-transitory computer-readable media for execution by one or more processors. The software when executed by one or more processors is operable to perform the implementations described herein and other functions.
Any suitable programming language can be used to implement the routines of particular embodiments including C, C++, Java, assembly language, etc. Different programming techniques can be employed such as procedural or object oriented. The routines can execute on a single processing device or multiple processors. Although the steps, operations, or computations may be presented in a specific order, this order may be changed in different particular embodiments. In some particular embodiments, multiple steps shown as sequential in this specification can be performed at the same time.
Particular embodiments may be implemented in a non-transitory computer-readable storage medium (also referred to as a machine-readable storage medium) for use by or in connection with the instruction execution system, apparatus, or device. Particular embodiments can be implemented in the form of control logic in software or hardware or a combination of both. The control logic when executed by one or more processors is operable to perform the implementations described herein and other functions. For example, a tangible medium such as a hardware storage device can be used to store the control logic, which can include executable instructions.
Particular embodiments may be implemented by using a programmable general purpose digital computer, and/or by using application specific integrated circuits, programmable logic devices, field programmable gate arrays, optical, chemical, biological, quantum or nanoengineered systems, components and mechanisms. In general, the functions of particular embodiments can be achieved by any means as is known in the art. Distributed, networked systems, components, and/or circuits can be used. Communication, or transfer, of data may be wired, wireless, or by any other means.
A “processor” may include any suitable hardware and/or software system, mechanism, or component that processes data, signals or other information. A processor may include a system with a general-purpose central processing unit, multiple processing units, dedicated circuitry for achieving functionality, or other systems. Processing need not be limited to a geographic location, or have temporal limitations. For example, a processor may perform its functions in “real-time,” “offline,” in a “batch mode,” etc. Portions of processing may be performed at different times and at different locations, by different (or the same) processing systems. A computer may be any processor in communication with a memory. The memory may be any suitable data storage, memory and/or non-transitory computer-readable storage medium, including electronic storage devices such as random-access memory (RAM), read-only memory (ROM), magnetic storage device (hard disk drive or the like), flash, optical storage device (CD, DVD or the like), magnetic or optical disk, or other tangible media suitable for storing instructions (e.g., program or software instructions) for execution by the processor. For example, a tangible medium such as a hardware storage device can be used to store the control logic, which can include executable instructions. The instructions can also be contained in, and provided as, an electronic signal, for example in the form of software as a service (SaaS) delivered from a server (e.g., a distributed system and/or a cloud computing system).
It will also be appreciated that one or more of the elements depicted in the drawings/figures can also be implemented in a more separated or integrated manner, or even removed or rendered as inoperable in certain cases, as is useful in accordance with a particular application. It is also within the spirit and scope to implement a program or code that can be stored in a machine-readable medium to permit a computer to perform any of the methods described above.
As used in the description herein and throughout the claims that follow, “a”, “an”, and “the” includes plural references unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise.
Thus, while particular embodiments have been described herein, latitudes of modification, various changes, and substitutions are intended in the foregoing disclosures, and it will be appreciated that in some instances some features of particular embodiments will be employed without a corresponding use of other features without departing from the scope and spirit as set forth. Therefore, many modifications may be made to adapt a particular situation or material to the essential scope and spirit.
This application is related to U.S. patent application Ser. No. ______, entitled “PROVIDING A USER INTERFACE FOR VIDEO ANNOTATION TOOLS,” filed ______ (Attorney Docket No. 020699-115300US/Client Reference No. 201806027.01), which is hereby incorporated by reference as if set forth in full in this application for all purposes.