This disclosure relates to methods and devices for distributed computing, such as for computing estimation output data based on obtained sensor data. More specifically, the solutions provided herein pertain to methods for managing a control function for distributed computation in a hierarchical system having a compute deployment including a plurality of compute nodes, in which machine learning is employed to optimize the system.
With the ever-increasing expansion of the Internet, the variety and number of devices that may be accessed is virtually limitless. Communication networks, usable for devices and users to interconnect, include wired systems as well as wireless systems, such as radio communication networks specified under the 3rd Generation Partnership Project, commonly referred to as 3GPP. While wireless communication was originally set up for person to person communication, there is presently high focus on the development of device to device D2D communication and machine type communications (MTC)/Narrow-band Internet of Thing (NB-IoT), both within 3GPP system development and in other models.
A term commonly referred to is the Internet of things (IoT), which is a network of physical devices, vehicles, home appliances and other items embedded with electronics, software, sensors, actuators, and connectivity which enables these objects to connect and exchange data. It has been forecast that IoT devices will be surrounding us by the billions within the next few years to come, with a recent quote declaring that “By 2030, 500 billion devices and objects will be connected to the Internet.” Hence, one may safely assume that we will be surrounded by more and less capable sensing devices in our close vicinity.
Less capable lower cost IoT devices will typically be deployed at large scale at the network edge, with more capable devices typically being more rarely deployed or having the function of a higher network node. An edge device is a device which provides an entry point into enterprise or service provider core networks. Examples include routers, routing switches, integrated access devices (IADs), multiplexers, and a variety of metropolitan area network (MAN) and wide area network (WAN) access devices. Edge devices may also provide connections into carrier and service provider networks. In general, edge devices may be routers that provide authenticated access to faster, more efficient backbone and core networks. The edge devices will normally be interconnected “vertically” in a peer-to-peer fashion using WAN/LPWAN/BLE/WiFi communication technologies, or “laterally” in mesh, one-to-many, or one-to-one fashion using local communication technologies.
The trend is to make the edge device smarter, so e.g. edge routers often include Quality of Service (QoS) and multi-service functions to manage different types of traffic. However, computation resources may be more powerful in vertically connected compute nodes. As noted, in modern IoT systems, sensor data may be collected in the devices at the edge of the system. The computational power of these edge devices is constrained by limitations of resources such as memory, CPU and energy. In practice, the limitations mean that these devices need to make use of simplified computational models, e.g. simplified Deep Neural Networks. The simplified models are not in all situations sufficient to achieve a “good” (according to some application defined metric) computational result in the edge device itself. Therefore, edge devices have the option to offload computation to more capable devices, further from the edge. These devices may also be resource constrained, with an additional offload option to an even more capable device. This computational hierarchy typically terminates in a cloud server, rich in resources.
However, there still exists a need for improvement it execution of computation in devices, where assistance may be required from other devices to fulfil a certain task. A reason why not all computations are done in the cloud is that there is a cost to offload, in terms of inter alia latency, bandwidth, power consumption, autonomy, privacy protection of data (e.g. computational cost of encryption), security etc. For this reason, it is important to make informed decisions in each compute node about when to offload computations. As an example, it would be valuable in wireless IoT systems in general to find means for limiting both frequency or magnitude of escalations, and alleviation of the need for complex device software for breaking down and aggregating compute tasks and results
Based on the aforementioned limitations related to distributed computing, an overall objective is to obtain system improvement. However, most real-world applications are highly dynamic in nature, and it is thus extremely difficult to achieve near-optimal system operation with e.g. statically defined logic and threshold values. Herein, a solution is therefore offered in which system-wide optimization is carried out using a logical control plane, with input and output interface to each compute node, powered by Machine Learning to dynamically optimize distributed computation. The proposed solution is provided in the claims.
According to a first aspect, a method is provided for distributed computation in a hierarchical system having a compute deployment including a plurality of compute nodes, comprising
providing a control function communicatively connected to said compute nodes;
determining a cost function for the system, which cost function includes at least one first parameter associated with carrying out a compute task and at least one second parameter associated with escalating a compute task;
employing a machine learning mechanism in the control function to optimize said cost function; and
configuring said compute deployment based on the optimization of the cost function by the machine learning mechanism.
In one embodiment, the method comprises
receiving first metrics from one or more of said nodes associated with a compute task; and
determining one or more of said first and/or second parameters based on said metrics.
In one embodiment, configuring said compute deployment includes providing compute deployment data to at least one of said nodes.
In one embodiment, configuring said compute deployment includes adjusting a confidence level threshold in one or more of said nodes.
In one embodiment, configuring said compute deployment includes updating a computation model in one or more of said nodes.
In one embodiment, said cost function includes a weight associated to one or more of the first and/or second parameters.
In one embodiment, said first parameter is associated with carrying out a compute task in a node of the system and depends on at least one of confidence threshold values, confidence level of an estimation model output, power consumption, bandwidth utilization, latency, sensor data.
In one embodiment, said second parameter is associated with escalating a compute task between nodes in the system and depends on at least one of latency, bandwidth utilization, power consumption, autonomy, privacy protection, security.
In one embodiment, said machine learning mechanism includes a reinforcement algorithm, the method further comprising, based on the reinforcement algorithm, configured to optimize control function decisions over time to take action to improve a current compute deployment state based on an observed environment including metrics received from said plurality of nodes.
According to a second aspect, a computer program product is provided for managing distributed computation in a hierarchical system having a compute deployment including a plurality of compute nodes, configured to
determine a cost function for the system, which cost function includes at least one first parameter associated with carrying out a compute task and at least one second parameter associated with escalating a compute task;
employ a machine learning mechanism in the control function to optimize said cost function; and
configure said compute deployment based on the optimization of the cost function by the machine learning mechanism.
According to a third aspect, a hierarchical system is provided, comprising a compute deployment including a plurality of compute nodes, and a control function communicatively connected to said compute nodes, wherein said control function comprises a computer program product for managing distributed computation in the hierarchical system, configured to
determine a cost function for the system, which cost function includes at least one first parameter associated with carrying out a compute task and at least one second parameter associated with escalating a compute task;
employ a machine learning mechanism in the control function to optimize said cost function; and
configure said compute deployment based on the optimization of the cost function by the machine learning mechanism.
In one embodiment, the computer program product comprises at least control circuitry, which control circuitry includes a processing device and a data memory holding computer program code, wherein said processing device is configured to execute the computer program code such that the control circuitry is configured to carry out the mentioned steps.
Various embodiments will be described with reference to the drawings, in which
The invention will be described more fully hereinafter with reference to the accompanying drawings, in which embodiments of the invention are shown. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
It will be understood that, when an element is referred to as being “connected” to another element, it can be directly connected to the other element or intervening elements may be present. In contrast, when an element is referred to as being “directly connected” to another element, there are no intervening elements present. Like numbers refer to like elements throughout. It will furthermore be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present invention. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
Well-known functions or constructions may not be described in detail for brevity and/or clarity. Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of this specification and the relevant art and will not be interpreted in an idealized or overly formal sense expressly so defined herein.
Embodiments of the invention are described herein with reference to schematic illustrations of idealized embodiments of the invention. As such, variations from the shapes and relative sizes of the illustrations as a result, for example, of manufacturing techniques and/or tolerances, are to be expected. Thus, embodiments of the invention should not be construed as limited to the particular shapes and relative sizes of regions illustrated herein but are to include deviations in shapes and/or relative sizes that result, for example, from different operational constraints and/or from manufacturing constraints. Thus, the elements illustrated in the figures are schematic in nature and their shapes are not intended to illustrate the actual shape of a region of a device and are not intended to limit the scope of the invention.
In the context of this disclosure, solutions are suggested for optimizing distributed computation in a hierarchical system having a compute deployment including a plurality of compute nodes. In such a system, a compute node may be a device for computing estimation output data, based on an estimation model. With increasing need and capability to push advanced computation to the edge of distributed systems, it will be an important and difficult discipline to decide when computation needs to be offloaded from the edge nodes by escalation. The proposed solutions provide a mechanism for dynamically and adaptively managing this process and keeping system behavior optimal over time.
Computation in a distributed system may typically involve obtaining sensor data, wherein a compute task is to be carried out based on that sensor data, such as a prediction or estimation. The sensor data may e.g. include a characterization of electromagnetic data, such as light intensity and spectral frequency at various points in an image plane, as obtained by an image sensor. The sensor data may alternatively, or additionally, include acoustic data, e.g. comprising magnitude and spectral characteristics over a period of time, meteorological data pertaining to e.g. wind, temperature and air pressure, seismological data, fluid flow data etc.
In a step S210, a compute node receives input data from a node at a lower level in the hierarchy. For an initial (lowest) node 100, such as an edge device, input is received from one or more attached sensors.
In a step S220, the node may execute a compute task, e.g. by executing a prediction model using the available computational model and resources in that node. The output is a classification decision. A key property of a prediction model is that a “confidence level” value is produced as the output of the executed prediction model. This may be a numerical measure of how certain the model is that the classification is correct.
In a step S230, the method selectively continues dependent on the determined certainty of the classification decision.
If the confidence level is below a threshold value, the node offloads the computation by sending 160 the original input data to a node higher up in the hierarchy in a step S240.
If the task has been escalated in step S240, a response may be received 170 from a higher node in a step S250, including a classification.
In a step S260, a classification has either been deemed certain (or not uncertain) in the node in step S230, or has been received from a higher node in step S250. That classification is thus either used in the node, or otherwise responded to a lower node from which the compute task was escalated. Using the classification may include storing data or metadata related to the original input data.
The device 300 may be an edge device 100 of a communication network, such as a WAN, comprising a number of further nodes 110 which have higher hierarchy in the network topology. The device 300 may further be configured to transmit data in uplink 160 and/or the downlink 170 to one or more network nodes of the distributed system. In various embodiments, the device 300 may include a network interface 306 operable to connect the device 300 in the uplink and/or a network interface 307 operable to connect the device 300 in the downlink. The network interfaces 306, 307 may also be different, configured to use different bearers of different communication technologies, such as ZigBee, BLE (Bluetooth Low Energy), WiFi, D2D LTE under 3GPP specifications, 3GPP LTE, MTC, NB-IoT, 5G New Radio (NR), and wired connection technologies.
In one embodiment, the control circuitry 303 is configured to control the device 300 to compute a first estimation score based on first input data obtained either by reception 160 from a lower node, or from a connected sensor 301. The estimation score may be computed using a local estimation model. In the context of this description, an estimation score can take various forms, from numbers, such as a probability factor, to strings to entire data structures. The estimation score may include or be associated with a value related to reliability or accuracy and may be related to a specific estimation task. In various scenarios, this computation may be carried out responsive to obtaining such an estimation task, e.g. to compute an estimation result. Such an estimation task may be a periodically scheduled reoccurring event. In other scenarios, the estimation task may be triggered by a request from another device or network node, or e.g. triggered by receiving first sensor data from the sensor 301. A system, compute node and method according to the embodiments provided herein can apply to sensing data of many sorts, such as image (e.g. object recognition), sound (e.g. event detection), multi-metric estimations, vibration, temperature or even data of less complexity. In the embodiments referred to herein, an estimation model may be one of many classical machine learning models, often referred to under the term “predictive modelling” or “machine learning”, using statistics to predict outcomes. Such models may be used to predict an event in the future but may equally be applied to any type of unknown event, regardless of when it occurred. For example, predictive models are often used to detect crimes and identify suspects, after the crime has taken place. Hence, the more general term estimation model is used herein. Nearly any regression model can be used for prediction or estimation purposes. Broadly speaking, there are two classes of predictive models: parametric and non-parametric. A third class, semi-parametric models, includes features of both. Parametric models make specific assumptions with regard to one or more of the population parameters that characterize the underlying distribution(s), while non-parametric regressions make fewer assumptions than their parametric counterparts. Various examples of such models are known in the art, such as using naive Bayes classifiers, a k-nearest neighbors algorithm, random forests etc., and the exact application of estimation model is not decisive for the invention or any of the embodiments provided herein. In the context of the invention, the estimation model could be a specific design of a Deep Neural Network (DNN) acting as an “object detector”. DNN's are compute-intensive algorithms which may employ millions of parameters which are specifically tuned by “training” using large amounts of relevant and annotated data, which makes them later, when deployed, being able to “detect”, i.e. predict or estimate to a certain “score”, the content of new, un-labelled, input data such as sensor data. In this context, a score may be a measure of the DNN's certainty of a specific classification of the input data. Such an estimation model may be trained to detect objects very generally from e.g. input sensor data representing an image, but typical examples include detecting e.g. “suspect people” or a specific individual. Continuous model adaptation, or “online learning”, where such a model could adapt and improve to its specific environment is complex and can take various forms, but one example is when a deployed model in a device 300 acting as a node 100 can escalate its sensor data vertically to a more capable node 110, 120, 130 with a more complex estimation model, which can provide a “ground truth” estimation and at the same time use the escalated sensor data to re-train the edge device model in the device 300 with some of its recently collected inputs, thereby adjusting the less capable device's 300 estimation model to its actual input.
The information received 406 in the control function from all nodes is fed into a Machine Learning (ML) mechanism of the control function, which is trained to optimize a cost function for the system. The cost function preferably relates to an overall system cost and balances the cost for escalation versus the cost for carrying out a computation task in a node. The cost function may thus include at least one first parameter associated with carrying out a compute task and at least one second parameter associated with escalating a compute task. The ML mechanism may be configured to optimize the cost function on one or more cost parameters, e.g. the overall power consumption of the system, aggregated reliability value output, or the overall system latency. The Control function may further be arranged to configure the compute deployment based on the machine learning mechanism output, which may involve sending 408 compute deployment data to one or more of the nodes of the system. The compute deployment data may include configuration data, such as a new set of confidence level threshold values that are communicated to the nodes for storing in a threshold mechanism 404. Other configuration data may include a change of compute responsibility (i.e. move a specific compute task to a more capable node in the system) or retraining of the neural network 402 function, such as by providing new or adjusted weight factors to an estimation model.
In a preferred embodiment, a Reinforcement Learning algorithm is employed in the control function to continuously optimize its decisions over time. In an active Reinforcement Learning system the agent (here the control function) learns what actions to take (here the changes of compute deployment) to continuously improve its state (here current compute deployment), by observing the environment (here the metrics available from all the nodes) and receiving rewards if a certain property (here the system wide optimization) is improved. Reinforcement learning is as such a known concept.
For a simple and general cost function model we can define a linear relationship in a weighted sum manner between the “costs” and “advantages” with parameters representing cost entities for executing a task in a node and for escalating the task, as exemplified herein. Using a few of those parameters as an example, the global cost function could be:
In various embodiments, the actual model used in a system may be more refined and of higher order, and the cost function will typically be system-specific.
With reference to
a step S610 of determining a cost function for the system, which cost function includes at least one first parameter associated with carrying out a compute task and at least one second parameter associated with escalating a compute task;
a step S620 of employing a machine learning mechanism to optimize said cost function; and
a step S630 of configuring said compute deployment based on the optimization of said cost function by the machine learning mechanism.
One embodiment relates to a computer program product of a control function for managing distributed computation in a hierarchical system having a compute deployment including a plurality of compute nodes, configured to carry out the steps of
The cost function may include a weighted sum of said first and second parameters. In various embodiments, said cost function includes a first parameter associated with carrying out a compute task in a node of the system, related to at least one of reliability threshold values, confidence level of an estimation model output, power consumption, bandwidth utilization, request to response latency, sensor data. Furthermore, the cost function may include a second parameter associated with escalating a compute task between nodes in the system, related to at least one of latency, bandwidth, power consumption, autonomy, privacy protection, security.
With reference to
In general terms, the system, node and method as proposed herein will improve upon a state of the system by utilizing an overall cost function optimized in a control function, which takes input from all nodes of the system. This provides a benefit over the state of the art procedure in which decisions and threshold setting are done in a pure hierarchical manner between nearest nodes. If overall optimizations are needed, then human interaction is necessary in state of the art systems. The solutions proposed herein allow a control function to collect data from all nodes in the system and apply system level Machine Learning as the means to achieve near optimum system performance. By applying reinforcement learning over time this could be accomplished without relying on human interaction.
Number | Date | Country | Kind |
---|---|---|---|
1850507-3 | Apr 2018 | SE | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/SE2019/050297 | 4/1/2019 | WO | 00 |