This U.S. patent application claims priority under 35 U.S.C. § 119 to: India Application No. 202221029279, filed on May 20, 2022. The entire contents of the aforementioned application are incorporated herein by reference.
The disclosure herein generally relates to the field of computer vision, and, more particularly, to methods and systems for autonomous task composition of vision pipelines using an algorithm selection framework.
With evolution in field of computer vision, many applications where sensory data and artificial intelligence/machine learning (AI/ML) techniques are involved, solving a computer vision task effectively and efficiently is necessary. One important part of solving the vision task is to create a vision pipeline in which correct sequence of preprocessing steps and algorithms, that are most suitable for executing the vision task, are required to be identified. Creating a vision pipeline for different datasets to solve a computer vision task is a complex and time-consuming process. Conventionally, the vision pipelines have been developed based on human intelligence by relying on their experience, trial and error or using template-based approaches. However, human expert-based design is slow and requires more effort since search space for choosing suitable algorithms for achieving a particular vision task is large. Further, in few conventional systems, data available to construct a vision workflow belongs to a fixed distribution but building systems with such a constraint may lead to failures when these systems are deployed in real-world due to various uncertainties. Further, core components at system level to enable the vision workflow composition is missing in the conventional systems.
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a processor implemented method for autonomous task composition of vision pipelines using an algorithm selection framework is provided. The method comprising receiving, via one or more hardware processors, (i) a plurality of input data pertaining to one or more domains of one or more enterprises and (ii) a descriptive query from a user as input, wherein the plurality of input data comprises one or more input parameters, one or more domain requirements and corresponding solutions, and wherein the descriptive query describes a goal task to be executed on the plurality of input data; identifying, via the one or more hardware processors, a vision pipeline for execution of the goal task by inputting the descriptive query and one or more state attribute levels corresponding to the plurality of input data to a symbolic planner, wherein the symbolic planner dynamically composes one or more subtasks associated with the goal task and constructs a Directed Acyclic Graph (DAG) for each of the one or more subtasks using a parser; identifying, via the one or more hardware processors, a set of algorithms from a plurality of algorithms that are suitable to be executed at one or more stages of the vision pipeline for execution of the goal task using a transformers and Reinforcement Learning (RL) based autonomous pipeline composition framework, wherein the transformers and Reinforcement Learning based autonomous pipeline composition framework comprises a set of RL policies that are interlinked and resemble every step in the vision pipeline, and wherein each RL policy comprises: (i) a task specific module comprising the plurality of algorithms that performs a specific sub task from the one or more subtasks associated with the goal task; (ii) an embedding module comprising one or more neural networks corresponding to each algorithm in the plurality of algorithms comprised in the task specific module, wherein each fully connected neural network of the embedded module is configured to map output of each algorithm in the subset of algorithms to a specific embedding output dimensionality; and (iii) a transformer module comprising a key network and a query network, wherein the key network converts an embedding output of each of the set of algorithms into a key vector and the query network receives an aggregation output of the embedding output of each of the subset of algorithms to generate a global query vector; and dynamically configuring, via the one or more hardware processors, the vision pipeline for execution of one or more goal tasks in one or more environment and system configurations.
In another aspect, a system for autonomous task composition of vision pipelines using an algorithm selection framework is provided. The system comprising a memory storing instructions, one or more communication interfaces, and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to: receive, (i) a plurality of input data pertaining to one or more domains of one or more enterprises and (ii) a descriptive query from a user as input, wherein the plurality of input data comprises one or more input parameters, one or more domain requirements and corresponding solutions, and wherein the descriptive query describes a goal task to be executed on the plurality of input data; identify, a vision pipeline for execution of the goal task by inputting the descriptive query and one or more state attribute levels corresponding to the plurality of input data to a symbolic planner, wherein the symbolic planner dynamically composes one or more subtasks associated with the goal task and constructs a Directed Acyclic Graph (DAG) for each of the one or more subtasks using a parser; identify, a set of algorithms from a plurality of algorithms that are suitable to be executed at one or more stages of the vision pipeline for execution of the goal task using a transformers and Reinforcement Learning (RL) based autonomous pipeline composition framework, wherein the transformers and Reinforcement Learning based autonomous pipeline composition framework comprises a set of RL policies that are interlinked and resemble every step in the vision pipeline, and wherein each RL policy comprises: (i) a task specific module comprising the plurality of algorithms that performs a specific sub task from the one or more subtasks associated with the goal task; (ii) an embedding module comprising one or more neural networks corresponding to each algorithm in the plurality of algorithms comprised in the task specific module, wherein each fully connected neural network of the embedded module is configured to map output of each algorithm in the subset of algorithms to specific embedding output dimensionality; and (iii) a transformer module comprising a key network and a query network, wherein the key network converts an embedding output of each of the set of algorithms into a key vector and the query network receives an aggregated output of the embedding output of each of the subset of algorithms to generate a global query vector; and dynamically configure, the vision pipeline for execution of one or more goal tasks in one or more environment and system configurations.
In yet another aspect, a non-transitory computer readable medium for autonomous task composition of vision pipelines using an algorithm selection framework is provided. The non-transitory computer readable medium comprising receiving, (i) a plurality of input data pertaining to one or more domains of one or more enterprises and (ii) a descriptive query from a user as input, wherein the plurality of input data comprises one or more input parameters, one or more domain requirements and corresponding solutions, and wherein the descriptive query describes a goal task to be executed on the plurality of input data; identifying, a vision pipeline for execution of the goal task by inputting the descriptive query and one or more state attribute levels corresponding to the plurality of input data to a symbolic planner, wherein the symbolic planner dynamically composes one or more subtasks associated with the goal task and constructs a Directed Acyclic Graph (DAG) for each of the one or more subtasks using a parser; identifying, a set of algorithms from a plurality of algorithms that are suitable to be executed at one or more stages of the vision pipeline for execution of the goal task using a transformers and Reinforcement Learning (RL) based autonomous pipeline composition framework, wherein the transformers and Reinforcement Learning based autonomous pipeline composition framework comprises a set of RL policies that are interlinked and resemble every step in the vision pipeline, and wherein each RL policy comprises: (i) a task specific module comprising the plurality of algorithms that performs a specific sub task from the one or more subtasks associated with the goal task; (ii) an embedding module comprising one or more neural networks corresponding to each algorithm in the plurality of algorithms comprised in the task specific module, wherein each fully connected neural network of the embedded module is configured to map output of each algorithm in the subset of algorithms to a specific embedding output dimensionality; and (iii) a transformer module comprising a key network and a query network, wherein the key network converts an embedding output of each of the set of algorithms into a key vector and the query network receives an aggregation output of the embedding output of each of the subset of algorithms to generate a global query vector; and dynamically configuring, the vision pipeline for execution of one or more goal tasks in one or more environment and system configurations.
In accordance with an embodiment of the present disclosure, the transformer module comprised in each RL policy computes a dot product of each key vector corresponding to each algorithm in the plurality of algorithms comprised in the task specific module and the global query vector to obtain a weighted score.
In accordance with an embodiment of the present disclosure, the weighted score is used to identify an algorithm from the subset of algorithms to perform the specific subtask from the one or more subtasks associated with the goal task.
In accordance with an embodiment of the present disclosure, the symbolic planner dynamically composes the one or more subtasks associated with the goal task based on one or more user specified functionalities and corresponding metadata.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles.
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
Vision pipeline creation for different datasets to solve a computer vision task is a complex and time-consuming process. One of the most important parts of solving a computer vision task is to correctly identify correct sequence of preprocessing steps and algorithms that are most suitable for restoring input data to a format that can be used for achieving a goal task. Preprocessing of data such as images and videos plays a very vital role in performance of the vision pipeline. Inappropriate choices of the sequence of preprocessing steps and algorithms can drastically hamper the performance of the goal task. Vision pipeline can have different workflows and algorithms to choose from are fairly large in number. For a given task, there can exist multiple such sequence of algorithmic configurations to choose from. Also, for same task there can be multiple different workflows for different system and environment conditions. For example, an image corrupted by changing exposure and then adding noise can be retrieved by both doing exposure correction followed by denoising and by denoising followed by exposure correction. With such a diverse set of choices, the time, effort, and resources needed to build the vision pipeline increases exponentially. In many cases, data available to construct a vision pipeline belongs to a fixed distribution and hence building systems with such a constraint leads to failures when these systems are deployed in real world due to various uncertainties. In cases where there is a need to optimize memory, energy and time of entire vision pipeline, right choice of algorithms at different stages of the vision workflow becomes increasingly more difficult and complex. Along with these difficulties and due to fast moving nature of the field of computer vision, pool of algorithms to choose from keeps expanding. On the contrary, comparison of all algorithms based on intuition too, can yield suboptimal solutions. Conventionally, the vision pipelines are developed based on human intelligence by relying on their experience, trial and error or using template-based approaches. Human expert-based design is slow, especially in cases when the image has undergone multiple forms of distortions. Thus, there is a need to automate processes of design choices to achieve good results rapidly.
As the present disclosure embark on creating an automated system, there exists a gap in existing engineering framework to achieve a required goal. For example, a classification engineering platform is created to aid an expert in stitching an end-to-end computer vision solution. Further, key elements including meta-data and domain knowledge required by a server to stitch an end-to-end pipeline is absent in the existing solutions.
Embodiments of the present disclosure provide systems and methods for autonomous task composition of vision pipelines using an algorithm selection framework. The framework leverages transformer architecture along with deep reinforcement learning techniques to search an algorithmic space for unseen solution templates. In an embodiment, the present disclosure describes a two stage process of identifying the vision pipeline for a particular task. At first stage, a high-level vision pipeline comprising of a plurality of tasks such as denoising, exposure correction, classification, object detection, and/or the like and forming a sequence are put together to create the vision workflow. This is considered as a sequence to sequence (seq2seq) decision making problem. At second stage, suitable algorithms for each high-level task are selected. Here, the high-level tasks may include but are not limited to Denoising using fully functional deep neural network (FFDNet), exposure correction including gamma correction 0.5, classification using residual network (Resnet-50), and/or the like. This is achieved by making algorithmic choices based on representation power of the algorithms and improve selection process over a training period with help of Deep Reinforcement Learning. In present disclosure, a high-level sequence of the vision pipeline is provided by a symbolic planner. Further, a graph search using a transformer architecture over an algorithmic space is performed on each component of generated workflow. In order to make the overall system more robust, weights of embedding, key and query networks of a visual transformer are updated with a Deep Reinforcement Learning framework that uses Proximal Policy Optimization (PPO) as the underlying algorithm.
In other words, after the sequence of steps are decided, a knowledge based graph search is performed over the algorithmic space at every stage of the vision pipeline and identifies the algorithms and the corresponding parameters that would be well suited to complete the vision pipeline for a given input. As the method of the present disclosure retrieve algorithms dynamically, it reduces level of human intervention for algorithm selection. Further, the system of the present disclosure exhibits an ability to adapt to unforeseen algorithms that can be introduced at any point in the search space, hence requiring little to no retraining of the framework.
Referring now to the drawings, and more particularly to
The I/O interface device(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W 5 and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server.
The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, a database 108 is comprised in the memory 102, wherein the database 108 comprises a plurality of tasks, goal task, one or more subtasks, vision pipeline, knowledge base, domain expert knowledge, one or more domain files, training problem files and corresponding solutions, a plurality of algorithms. The database 108 further stores directed acyclic graphs for each of the one or more subtasks.
The database 108 further stores a set of RL policies, one or more architectures, one or more modules such as task specific module, embedding module, transformer module, one or more engines such as data management engine, data acquisition engine, data processing engine, inference and reasoning engine, and advisory generation engine.
The database 108 further comprises one or more networks such as one or more artificial intelligence networks, one or more neural network(s) which when invoked and executed perform corresponding steps/actions as per the requirement by the system 100 to perform the methodologies described herein. The memory 102 further comprises (or may further comprise) information pertaining to input(s)/output(s) of each step performed by the systems and methods of the present disclosure. In other words, input(s) fed at each step and output(s) generated at each step are comprised in the memory 102 and can be utilized in further processing and analysis.
Referring to
At step 204 of
In an embodiment, the vision pipeline is considered as sequential decision-making problem that can be modeled as Markov Decision Processes (MDPs). An MDP is defined by (S, A, P, R, ρ0, γ), where S is state space (i.e., image), A is action space (i.e., a set of algorithms), P(sl+1|sl, al) specifies state transition probability distribution (i.e., image obtained after processing by an algorithm), R(rl|sl, al) specifies a reward distribution (for example: validation accuracy for classification, reconstruction loss for preprocessing steps), ρ0(s0) denotes initial state distribution (i.e., distorted image), and γ∈(0, 1] denotes a discount factor. At each timestep, the RL policy selects an action independently according to its (corresponding or associated) state-conditioned policy πi(al|sl; θ), where sl denotes state information available to the RL policy and θ denotes its parameters. The RL policy subsequently earns a reward rl sampled from R, and environment undergoes a state transition, sl+1˜P(·|sl, al). In the present disclosure, focus is on solving the algorithm selection task, wherein at each timestep the vision pipeline progresses one step further, and the RL policy attempts to maximize the rewards. More precisely, optimal policy parameters are found that solve θ*=argmax J(θ), where J(θ) is determined using equation (1) below as:
J(θ)=E[Σl=0Lγtrl] (1)
Here, L denotes length of the vision pipeline. By policy gradient theorem, gradient of J with respect to the policy parameters θ is given by equation (2) below:
∇θJ(θ)=E[∇θlog π(al|sl)(Qπ(sl,al)−b(sl,al))] (2)
Here, Qπ(sl, al) denotes expected future reward, and b(sl, al) is commonly known as a baseline function, which can be any function that depends on the state, and the actions at length l. Often a learned value function is used for the baseline but in the present disclosure a running mean of validation accuracy from previous episodes is used as the baseline.
The step 204 is further better understood by way of following exemplary explanation.
In the present disclosure, two different neural networks are trained to identify exposure and noise levels of the input image. The noise levels are bifurcated into four categories; no-noise-level, low noise-level, mid-noise-level and high-noise-level and exposure levels are bifurcated into three categories; under-exposed, correctly-exposed and over-exposed. The state identifiers complement the search algorithm to restrict the search over the set of algorithms that would be suited to address the distortion the input image has undergone. This helps in constructing a knowledge base that enables the method of the present disclosure to perform a guided search over algorithm set. For example, if a state attribute identifier for exposure detected that the input image is underexposed, the search algorithm is restricted to perform its search over algorithms that are eligible for correcting underexposed images. In this way, as domain knowledge is introduced while performing the search, it is ensured that convergence is achieved at a faster rate. The State Attribute Identifier (SAI) neural networks are trained in a supervised way wherein they need to classify the distortion level in the input image. In the present disclosure, a known in the art image dataset CIFAR-10 (refer ‘A. Krizhevsky, V. Nair, and G. Hinton, “Cifar-10 (canadian institute for advanced research).” [Online]. Available: http://www.cs.toronto.edu/kriz/cifar.html’) is used, the images in the image dataset are distorted and distortion levels are associated with their respective labels. A residual network (Resnet-50) is used as backbone of both the state identifiers. There are other ways to extract the state attribute levels of an image, but they require manual intervention at various points in decision making process.
In an embodiment, at step 206 of
The step 206 is better understood by way of following exemplary explanation.
The Transformers and Reinforcement Learning (RL) based autonomous pipeline composition framework (i.e., Auto-TransRL approach) in the present disclosure connects a sequence of RL policies according to high-level preprocessing sequence. Every RL policy comprises three modules namely Task Specific Module (TSM), Embedding Module (EM) and Transformer Module (TM). Every RL policy's TSM comprises a set of algorithms that serve a very specific purpose such as edge detection, classification, exposure correction, and/or the like. Each TSM is followed by the EM. Each algorithm in the TSM is associated with an embedding network in the EM. The EM ensures that output of all algorithms is mapped to same output dimension. The EM is further followed by a TM. The TM consists of a key network K and q query network Q. The embedding networks in EM and the Key and Query Networks in the Transformer Module could be non-linear Multi-Layer Perceptron (MLPs) other forms of neural networks such as Convolutional Neural Networks, Graph Neural Networks, Recurrent Neural Networks and/or the like. The Query network takes as input a mean of all the algorithm (in the TSM) embedding to generate a global query vector which after dot product with the key vectors outputs a relative weight parameter corresponding to every algorithm in the TSM. In other words, some scores are produced for each algorithm by taking a dot product between the key vectors, corresponding to every algorithm in the Task Specific Module, and the global query vector. These scores are further passed through a softmax layer to generate the relative weight parameters corresponding to each algorithm in TSM. As the relative weight parameters generated by the TM are a measure of a similarity score between the mean of every algorithm's output against each algorithm's output, they act as a good metric to select an algorithm. Hence, higher the value of relative weight parameter, better an algorithm is on average because the values of relative weight parameters are a direct measure of an algorithm's representation power. In an embodiment, individual policies are trained to select an algorithm that achieves a specific task using PPO in the vision pipeline and classification accuracy is used as the reward signal for all the policies. In the present disclosure, three policies for Exposure Correction, Denoising and Classification tasks are trained. The relative weight parameters produced by the TM are used as RL policy output. Within the RL policy, the networks in the Embedding Modules and Transformer Module are learned. All the algorithms in Task Specific Module are pre-trained and are not updated during the training process. As a result, all the algorithms in the TSM convert an input image to a latent embedding that belongs to a fixed and learned distribution. Thus, EM in conjunction with TM learns to choose algorithms solely based on the representation power of every algorithm in the TSM. It is assumed that latent embedding generated by each algorithm captures information about distortions that have been made on input image. This assumption is based on empirical evidence that the performance of algorithms suffers if the image is distorted in any manner. For example, classification accuracy for a particular pretrained model for a distorted image dataset would be less when compared to one with no distortions.
The symbolic planner dynamically composes one or more subtasks associated with the goal task and constructs a Directed Acyclic Graph (DAG) for each of the one or more subtasks using a parser as shown in
In conjunction with the Symbolic Planner, an Auto-TransRL approach to select the algorithms at every stage of the vison pipeline is used. As shown in
Referring back to
Referring to steps of
In the present disclosure, performance of the system of the present disclosure is evaluated on classification, exposure correction and denoising tasks. The present disclosure provides a comparison of average episode reward against different image distortions among the system of the present disclosure, template-based approaches and by directly feeding into a classifier without any preprocessing layers (Vanilla approaches). This is validated in the present disclosure by observing that the system of the present disclosure generalizes well to unseen algorithms when the performance of the system of the present disclosure and the other baselines is compared on partially known and unknown settings. In the partially known experimental configuration, four unseen algorithms are added along with the ones used during train time and in the unknown experimental configuration, only unseen algorithms are used.
Comparison to Template-Based and Vanilla Approaches
To evaluate the effect of the system of the present disclosure on adapting tendency and performance, the method of the present disclosure is compared to the following baseline approaches on a set of vision tasks:
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined herein and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the present disclosure if they have similar elements that do not differ from the literal language of the embodiments or if they include equivalent elements with insubstantial differences from the literal language of the embodiments described herein.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
Number | Date | Country | Kind |
---|---|---|---|
202221029279 | May 2022 | IN | national |