The present invention is directed to approaches for using a machine learning model trained using deterministically generated labeled data.
Development of computer vision models can be hindered by a lack of sufficient training data for teaching the model to correctly classify the content of images. For example, computer vision models for classifying images are typically assembled as collections of pre-existing image data that is labeled by humans to indicate the items depicted in the images. However, by this approach, expanding the training data set is difficult because the available images may not be similar to the types of images that must be classified in practice, as well as because the model must learn to determine where the item is in the image using relatively imprecise labels—that is, the labels typically do not indicate exactly what parts of the image contain the labeled item and what parts contain other content. For example, a conventional approach for obtaining labeled image training data is to pay humans hired via Taskrabbit or Mechanical Turk to label images, or to obtain human-labeled images via CAPTCHA-based authentication services. Additionally, conventional, prior approaches do not permit generating targeted training data as needed in response to current conditions at the location where the machine learning model is used to process images.
Moreover, if the training data amount, quality, or its labeling are insufficient, the accuracy of a machine learning model will be unsatisfactory. In some circumstances, synthetic training data generated in order to assemble a sufficient training data set may be used to train the model. However, such a model may not generalize well to identifying the contents in real images.
Accordingly, there is a need for approaches that address these problems, and the present application discloses embodiments that address aspects of this need.
Embodiments are described for methods, systems, and computer-readable media for training a machine-learning model to convert real-domain images to synthetic-appearing images, wherein the machine-learning model is associated with a mounted camera device at a location, the location associated with a scene type. A first set of real-domain training images associated with the scene type is received, and a second set of synthetic-domain training images also associated with the scene type is generated or received. The machine-learning model may then be trained using the first and second sets of training images, to generate respective synthetic-appearing images based on respective sample real-domain images, wherein the respective synthetic-appearing output images have visual characteristics that are more similar to the visual characteristics of the synthetic-domain training images than to the visual characteristics of the real-domain training characteristics.
Additional embodiments are described for methods, systems, and computer-readable media for using a machine-learning model to identify objects depicted in real-domain sample images, wherein the machine-learning model includes an object-recognition component and a real-to-synthetic-image component, and wherein the machine-learning model is associated with a mounted camera device. By one or more image sensors of a mounted camera device, one or more real-domain sample images may be generated, the one or more real domain sample images depicting the view of the mounted camera device. At the mounted camera device, by the real-to-synthetic image component, respective synthetic-appearing sample images may be generated based on the respective real-domain sample images. Next, at the mounted camera device, and by the object-recognition component, objects depicted in the synthetic-appearing sample images may be identified, wherein the object-recognition component was trained using a set of synthetic-domain image data. A report concerning the identified objects may be prepared and provided as a result of the approach.
Embodiments of apparatuses, computer systems, computer-readable media, and methods for deploying systems for real-time image processing are described, including approaches for deterministically generating labeled data for training or validating machine learning models. For example, in certain embodiments, the approaches described herein may be used to generate targeted training data in real-time in response to conditions at the locations where the images awaiting inference (i.e., the “sample images” generated by image sensors at camera devices) are produced. Embodiments of the invention may be used to transform sample images or sample video into semantic meaning. In certain embodiments, audio data may additionally be incorporated into the determination of semantic meaning. For example, various scenarios may be imaged and, using the approaches described herein, the scenarios may be identified and responsive action (e.g., sending a notification containing a semantic description of the scenario) may be taken. For example, video of a possible terrorist leaving a possible explosive device in a train station may be identified and given a semantic description—e.g., a person placing a backpack at a particular location within the view of a camera. In another example, video of a car blocking a driveway may be converted to a semantic description—e.g., a specification of a range of time points associated with a type of vehicle positioned in front of the driveway and a second range of time points associated with a person exiting the vehicle. In another example, a count of water bottles in an image of people at a musical event may be obtained. In another example, events such as a car accident or a landslide may be inferred from a video stream of a roadway, leading to a responsive notification of the events. In another example, a system may prepare a semantic description including a count of customers entering and leaving a store, including how long each customer remained in the store and what each customer handled or gestured toward while inside the store.
In order for a system to convert sample image data into semantic descriptions of the sample image data, the system may first be trained to identify “targeted content”—i.e., the content, circumstances, and events that the system is trained to identify, and that may be represented by such semantic descriptions. As used herein, a “semantic description” is a specification concerning the meaning of the content depicted in the image data or an event involving the depicted content. Accordingly, in certain embodiments, the system is configured to generate image training data that depicts the targeted content or events that should be identifiable by the system. In particular, in certain embodiments, the image training data should depict a range of examples of the targeted content. For example, the examples may include variations in the context of the targeted content, such as depicting the targeted content in different types of weather if the sampled images will be outdoor, depicting the targeted content having various orientations relative to the camera perspective, or depicting the targeted content in connection with prop items. In certain embodiments, certain variations in the context for the training data may be responsive to current or expected conditions at the location of the targeted content—for example, a deployed camera device may provide the average brightness of the scene at the location; this average brightness may then be used to generate a set of image training data based on the average brightness value, which may then be used to train the machine learning model used by the deployed camera device, update the model at the deployed camera device, and accordingly improve identification of the targeted content at the current average brightness for the location. Conditions at the location of the targeted content may include, for example, weather (snow, rain, fog), brightness, physical deformities in or changes to surrounding largely static objects. In cases of indoor settings, changes to the conditions may include, for example, a retail store remodel, the introduction of holiday specific decorations (Halloween, Christmas, and the like), or changes in conditions as a result of physical changes in the mounting location of the camera device.
In certain embodiments, the image training data should depict examples of the targeted content as viewed from the expected perspective and optical characteristics of the device having image sensors used to capture the sample images (wherein the device may be a camera with one or more image sensors). For example, the image training data may depict content as viewed from the mounted height and particular perspective of each image sensor of the device having image sensors. Additionally, the image training data may match the resolution and color profile of particular image sensors. These perspective and optical characteristics are discussed further below.
A scene specification outline (concerning targeted content) and a seed value may be provided as input 202 to prepare image training data for training a machine learning model to identify the targeted content in image data. The scene specification outline is a set of text commands defining a range of scenes, where certain of the scenes (1) include object(s) representing aspects of the targeted content (leading to positive examples of the targeted content) and certain of the scenes (2) do not include the object(s) representing the targeted content (leading to negative examples). The specified objects may be defined in terms of items in the asset database 208. In certain embodiments, the range of scenes is defined using a set of exemplar scenes. The scene definitions in the scene specification outline may be specified using a terse grammar. In certain embodiments, the range of scenes includes features, such as context-specific constraints, based on a camera device that will use the machine learning model to process sample data, including, for example, the scene topology (e.g., the types of object instances in the environment of the camera device), the mounting location and perspective of sensors of the camera device relative to the scene, and whether the camera device is moving or still.
In one example, a portion of an exemplar scene in a scene specification outline may include the following three text commands that define aspects of the scene:
In this example, the backpack object may represent the targeted content (or an aspect of the targeted content, if, e.g., the targeted content is the event of a backpack being discarded by a person at a train station).
Objects may be defined to represent a broad variety of actors and props. For example, human objects may be specified as having a particular gender, age or age range, ethnicity, articles of clothing associated with various colors; objects may additionally represent particular vehicles or accessories. Certain objects may be defined to be composed of other objects or to have complex labels for object components, such as defining the coordinates of human body joints, face positions, orientations, and expressions. For example, in order to train a machine learning model to identify a person wearing a backpack, the model may be trained using training data representing the person alone, the backpack alone, and the person wearing the backpack. Additionally, the granular portions of the training data (e.g., pixels) corresponding to the person and the backpack, respectively, may be specified.
Objects may be defined using a library of environmental structures to serve as props or context, including weather, vegetation (e.g., trees, grasses, shrubs, which may be, e.g., placed as props to aid detection of a target object moving behind the prop object), and buildings. Robust use of prop objects and providing a thoughtful range of environments may aid in generating more realistic locations or environments in order to improve the machine learning model's capacity to identify target objects.
The scene specification outline and the seed value may be provided as input to a scene randomizer 204 (106). The scene randomizer generates an expanded set of scene specifications based on the scene specification outline and the seed value (108). Stated another way, a variety of scenes and associated objects may be procedurally created based on the scene specification outline. The scene randomizer populates the expanded set of scene specifications by generating a set of different versions of the individual text commands, using the seed value (e.g., a number or string) to seed commands for generating semi-random output (e.g., where such commands are drawn from a fuzzing library) that may be used to parameterize the different versions of the individual text commands. The scene randomizer may be context-aware—that is, the scene randomizer may generate versions of the individual text commands in which the range of versions is dependent on aspects of the scene, such that the type of variation generated is appropriate or plausible. The scene context may be maintained by the randomizer, which can allow plugins (e.g., small Python scripts, loaded at runtime) to model various attributes like gravity, other physics, local weather, time-of-day, and the like. The plugins may implement functions that can semi-randomly generate plausible positions, textures, rotations, and scale for various objects in the asset database. Plausible variations for scenes may be modeled using climate engines, physics engines, and the like. For example, if the scene is indoors, the scene randomizer may generate indoor props rather than outdoor props. If the scene is outdoors and a rain scene, the scene randomizer may generate different types of rain and limit lighting to lower light levels appropriate for a rain scene. In certain embodiments, the semi-random output may be, for example, numbers drawn from a certain distribution anchored by the parameters in the scene specification outline commands, such as a normal distribution having a mean set by a parameter from a scene specification outline command. In certain embodiments, the semi-random output will be seeded by the seed value or a derivative seed value based on the seed value, and will accordingly generate the same output each time the same seed value is used. Stated another way, in certain embodiments, the seed value is used to deterministically produce the same text when operated on by a fuzzing library. If the seed is changed, new varieties of the same type of labeled data will be generated.
The series of scene specifications generated by the scene randomizer may be provided to one or more renderers 206 in order to generate a set of images corresponding to each scene specification (110). The rendered images may be based on the perspective and optical characteristics of each particular image sensor of the camera device that will be used to generate the sample images, as specified in the scene specifications. Each set of images collectively represents a single “snapshot” of the scene from the perspective of each image sensor, and accordingly each image of the set of images is associated with the same hypothetical time point in the scene. In certain embodiments, each image of the set is generated according to a separate scene specification. The optical characteristics may include, for example, the sensor's resolution, color detection profile, the sensor's position relative to the other sensors of the camera device, lens properties such as a wide angle lens versus a regular lens, type of light information (infrared, visible, etc.), focal length, aperture, and the like. For example, if the camera device generates four 4 k images using its four image sensors, the set of images generated by the renderer may be four 4 k images. Additionally, the renderer may additionally use the assets from the asset database as specified in the scene specifications to render the set of images. In certain embodiments, the series of scene specifications may be apportioned to multiple renderers (e.g., a number N renderers 206), such that rendering of the images may be executed in parallel. Each set of rendered images based on a single scene specification may be packaged into an object-labeled training bundle. The object-labeled training bundle includes the set of rendered images and a label indicating the existence or lack thereof of an object in the rendered scene, the object corresponding to targeted content. The object-labeled training bundle may additionally specify the pixels in the set of rendered images that represent the object corresponding to targeted content, and/or other metadata, such as a description of lighting conditions, the existence or location of prop items in the images, a time point if the object-labeled training bundle is a member of a time series, and the like. In certain embodiments, a scene specification outline may be used to define a series of moving objects that represent targeted content that represents an event, and such an event may be represented in the image training data as a time series of object-labeled training bundles.
In certain embodiments, the render 206 uses a gaming engine, such as the Unreal engine, Unity, GoDot, Cry engine to render the scene specifications.
A fleet manager 204 may then stream the object-labeled training bundles as they are generated to one or more training instances 212 (112). In certain embodiments, there may be multiple training instances (e.g., a number M training instances). Each training instance 212 may be, for example, a server, a virtual machine, or a cloud service container hosting a machine learning model to be trained, such as a convolutional neural network model including the associated weights. In certain embodiments, prior to training the machine learning model with a set of received object-labeled training bundles, the training instance 212 may initialize a new machine learning model, or the training instance may load a checkpoint from a previously trained model (e.g., a checkpoint may contain or identify a set of weights and biases learned by a neural network having the same structure as the neural network to be trained by the training instance). In certain embodiments, the fleet manager 204 may collect the object-labeled training bundles and dispatch them to a single training instance when a set number is of bundles is collected.
The training instance may train or update the machine learning model using each of the received object-labeled training bundles, such that the machine learning model is optimized to associate each bundle image set with its appropriate label (114). In certain embodiments, the object-labeled training bundle is not retained after training by any component of machine learning training system 200, as the bundles can be re-generated as needed using the tersely defined scene specification outline and the seed value. This provides the advantage of permitting the use of large or high-resolution images for training the machine learning model, as there is no need to allocate a large storage space to maintain the training data in the case that the training data needs to be adjusted or revisited in order to retrain a machine learning model or determine why a particular machine learning model generated unexpected results when trained with the training data.
Camera device 300 may include one or more camera device processors 304. In certain embodiments, any of processors 304 may be a special-purpose processor for computing neural network inference calculations. In certain embodiments, processor 304 is a general-purpose processor. Processor 304 may be in communication with image sensors 302, a communication module 306, other sensors 308, a storage component 310, and a power system and/or battery 312. The power system/battery 312 may be in communication with one or more port(s) 314.
Camera device 300 may include one or more other sensors 308, such as a temperature sensor for monitoring thermal load or ambient temperature, an accelerometer, microphone, or the like. Communication module 306 may include a cellular radio, Bluetooth radio, ZigBee radio, Near Field Communication (NFC) radio, wireless local area network (WLAN) radio, a subscriber identity module (SIM) card, GPS receiver, and antennas used by each for communicating data over various networks such as a telecommunications network or wireless local area network. Storage 310 may include one or more types of computer readable medium, such as RAM, optical storage devices, or flash memory, and may store an operating system, applications, communication procedures, and a machine-learning model for inference based on the data generated by image sensors 302 (e.g., a local machine-learning model). The power system/battery 312 may include a power management system, one or more power sources such as a battery and recharging system, AC, DC, a power status indicator, and the like. In certain embodiments, the components of camera device 300 may be enclosed in a single housing 316.
In certain embodiments, an updated neural network model may be provided to camera devices 300 on a scheduled basis. For example, if a camera device 300 uses a neural network model that is trained to count children, and the monitoring area 404 contains lots of Trick-or-Treaters each Halloween, a specially trained neural network model trained to recognize children in costumes may be automatically provided to the camera device 300 to replace the ordinary local neural network model for the duration of Halloween.
System 800 includes a bus 2506 or other communication mechanism for communicating information, and one or more processors 2504 coupled with the bus 2506 for processing information. Computer system 800 also includes a main memory 2502, such as a random access memory or other dynamic storage device, coupled to the bus 2506 for storing information and instructions to be executed by processor 2504. Main memory 2502 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 2504.
System 800 may include a read only memory 2508 or other static storage device coupled to the bus 2506 for storing static information and instructions for the processor 2504. A storage device 2510, which may be one or more of a hard disk, flash memory-based storage medium, magnetic tape or other magnetic storage medium, a compact disc (CD)-ROM, a digital versatile disk (DVD)-ROM, or other optical storage medium, or any other storage medium from which processor 2504 can read, is provided and coupled to the bus 2506 for storing information and instructions (e.g., operating systems, applications programs and the like).
Computer system 800 may be coupled via the bus 2506 to a display 2512 for displaying information to a computer user. An input device such as keyboard 2514, mouse 2516, or other input devices 2518 may be coupled to the bus 2506 for communicating information and command selections to the processor 2504. Communications/network components 2520 may include a network adapter (e.g., Ethernet card), cellular radio, Bluetooth radio, NFC radio, GPS receiver, and antennas used by each for communicating data over various networks, such as a telecommunications network or LAN.
The processes referred to herein may be implemented by processor 2504 executing appropriate sequences of computer-readable instructions contained in main memory 2502. Such instructions may be read into main memory 2502 from another computer-readable medium, such as storage device 2510, and execution of the sequences of instructions contained in the main memory 2502 causes the processor 2504 to perform the associated actions. In alternative embodiments, hard-wired circuitry or firmware-controlled processing units (e.g., field programmable gate arrays) may be used in place of or in combination with processor 2504 and its associated computer software instructions to implement the invention. The computer-readable instructions may be rendered in any computer language including, without limitation, Python, Objective C, C#, C/C++, Java, Javascript, assembly language, markup languages (e.g., HTML, XML), and the like. In general, all of the aforementioned terms are meant to encompass any series of logical steps performed in a sequence to accomplish a given purpose, which is the hallmark of any computer-executable application. Unless specifically stated otherwise, it should be appreciated that throughout the description of the present invention, use of terms such as “processing”, “computing”, “calculating”, “determining”, “displaying”, “receiving”, “transmitting” or the like, refer to the action and processes of an appropriately programmed computer system, such as computer system 800 or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within its registers and memories into other data similarly represented as physical quantities within its memories or registers or other such information storage, transmission or display devices.
A challenge arising from the practice of training machine learning models to recognize objects in images using synthetic-domain image data is that models trained on synthetic data may not generalize well to detecting the same objects depicted in real-domain image data. One possible approach for addressing this issue is to follow the steps of (1) generate synthetic-domain image data for training a given machine-learning model, (2) convert the synthetic-domain images 902 to real-appearing images 904 (e.g., process 900)—for example, by applying a content-dependent noise model to the image data (e.g., by using a Generative Adversarial Network (GAN) algorithm) or by hallucinating plausible artifacts into the synthetic-domain image data, and (3) training the machine learning model using the real-appearing image data, such that the machine-learning model will then perform well on real-domain sample images. However, converting synthetic-domain image data to real-appearing image data is difficult, as, for example, hallucinating details into detail-poor synthetic-domain images is a challenging task. An alternative approach is to instead rely on converting real-domain images 906 to synthetic-appearing images 908 (e.g., process 905). One example of the alternative approach is to follow the steps of (1) generate synthetic-domain image data for training the machine learning model, (2) train the machine-learning model using the synthetic-domain image data, and (3) convert real-domain sample images to synthetic-appearing images 908 (process 905) prior to (4) using the trained machine-learning model to infer the contents of the now-synthetic-appearing sample images. This alternative approach has the benefit that, as determined by the present inventors, converting real-domain images to synthetic-appearing images (905)—by which detail is removed from the real-domain images—is an easier computational task than converting synthetic-domain images to real-appearing images (900). However, in general, domain transfer operations are computationally expensive (e.g., 10 to 50 GigaOps per image frame). For this reason, known image domain transfer implementations have typically relied on access to server or cloud computing for both training a model (where the model is, e.g., a GAN) and domain transfer of sample images using a trained model. See, e.g., Isola et al., Image-to-Image Translation with Conditional Adversarial Networks, arXiv:1611.07004v2 (2017) and Zhu et al., Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks, arXiv:1703.10593v4 (2018).
The machine-learning model may then be trained using the first and second sets of training images (1006). For example, where the machine-learning model is a cycle GAN, generator networks may be trained to generate a mapping from the first set (real) to the second set (synthetic) and the reverse, and accordingly generate first-set-appearing and second-set-appearing images based on first-set and second-set images, in combination with adversarial discriminator neural networks that are trained to distinguish between first-set images and first-set-appearing images, and second-set images and second-set-appearing images, respectively. In such an embodiment, the generators may be trained to generate images based on training images that look similar to images from the opposite domain, while the discriminators may be trained to distinguish between transferred images and training images (e.g., to distinguish between synthetic-appearing and synthetic images, and between real-appearing and real images). When such a machine-learning model is sufficiently trained, a real-to-synthetic generator component of the model will be capable of generating a synthetic-appearing image 908 that is structurally based on an input real-domain image 906. All or a component of the trained model may then be provided to an edge device, such as a mounted camera device 300 (1008). For example, the real-to-synthetic generator component of the model may be provided to a mounted camera device for converting real images obtained by an image sensor of the camera device to corresponding synthetic-appearing images 908. The corresponding synthetic-appearing images may then be used for subsequent inference by an object-recognition machine-learning model at the camera device.
While the preferred embodiments have been shown and described, it will be understood that there is no intent to limit the invention by such disclosure, but rather, is intended to cover all modifications and alternate constructions falling within the spirit and scope of the invention.
This is a NONPROVISIONAL of, claims priority to, and incorporates by reference U.S. Provisional Application No. 62/642,578, filed 13 Mar. 2018, and U.S. Provisional Application No. 62/674,497, filed 21 May 2018.
Number | Date | Country | |
---|---|---|---|
62642578 | Mar 2018 | US | |
62674497 | May 2018 | US |