Edge data centers (EDCs) are complex computing systems that are deployed within a closer proximity to endpoint systems and devices, as compared to central data centers. EDCs can therefore serve those endpoint systems and devices much more efficiently. In light of their benefits, EDCs are therefore being increasingly deployed, leading to a massive number of EDCs that must be managed, maintained and monitored.
Certain examples are described in the following detailed description and in reference to the drawings, in which:
The proliferation of Artificial Intelligence (AI) and Machine Learning (ML) is resulting in massive amounts of data being generated, which in turn is overwhelming data networks. The impending widespread deployment of, for example, autonomous vehicles further contributes to the problems for data networks. To alleviate the saturation of the data networks, edge data centers (EDCs) are being deployed.
EDCs can be as small as a single rack with integrated power, cooling, and monitoring and management infrastructure, or several racks with their associated support infrastructure. The EDCs, in many cases, are remote. Keeping these EDCs well maintained and running could require a massive labor force, which is expensive and burdensome on resources.
As described herein, in some embodiments, AI and ML based tools can be deployed for comprehensive monitoring, management, and predictive maintenance. For instance, these tools can cause workload to be shifted from failed or failing EDCs to functional or optimal EDCs. These failures can result from failed fans, compressors, condenser coils, power supplies, switches, refrigerant leaks, and others known to those of skill in the art. To reduce or eliminate these issues and/or their impact, remote monitoring and data analytics can be used.
Thus, as described herein, in some embodiments, EDCs can be monitored, maintained, and managed using predictive models such as dynamic Bayesian networks that allow for the identification of actual and potential anomalies and failures, and prediction of future anomalies or failures. Based on this information, corrective actions can be performed, for example, to prevent these anomalies or failures if they have not yet occurred, or remedy or minimize their impact.
The invention described in this disclosure addresses the problems described herein, and more. In the foregoing description, numerous details are set forth to provide an understanding of the subject disclosed herein. However, implementations may be practiced without some or all of these details. Other implementations may include modifications and variations from the details discussed above. It is intended that the appended claims cover such modifications and variations.
The edge devices 101 can refer to and/or include any individual or group of physical objects (or “things”) configured to generate, transmit and/or receive data. In some embodiments, an edge device can refer to an individual object that produces, receives and/or transmits data, such as a sensor, device, or the like. Examples of sensors include sensors for measuring temperature, humidity, pressure, airflow, light, water, smoke, voltage, location, power, flow rates, doors, video, audio, speed, and many others known to those of skill in the art. Moreover, in some embodiments, an edge device can refer to an object that includes and/or is equipped or instrumented with multiple components capable of producing, receiving and/or transmitting data, such as sensors. For instance, an edge device can refer to a system, device, machine, person, building, infrastructure and the like, each of which contains a number of sensors or components that can produce, transmit and/or receive data. One such example, for purposes of illustration, is an automobile that includes a number of devices, sensors and the like that are capable of measuring or collecting data, and transmitting that data. For instance, an automobile can include radars, cameras, odometers, global positioning systems (GPS), and many others known to those of skill in the art. It should be understood that, in some embodiments, the edge devices 101 need not produce the data. That is, an edge device can be a computing device that can store, transmit and/or receive data (e.g., time series data, logs, etc.).
As described above, the edge devices 101 are equipped with communications capabilities that enable them to transmit and/or receive data. For instance, the edge devices 101 can be configured to communicate via wired or wireless networks—e.g., networks 103. The networks 103 can be or include multiple types of networks, including cellular networks, personal area networks (PAN), local area networks (LAN), wireless LANs (WLAN), campus area networks (CAN), metropolitan area networks (MAN), virtual private network (VPN), and others known to those of skill in the art. Moreover, the communications via the networks 103 can be performed using a number of different protocols such as ZigBee, Z-Wave, Bluetooth, BLE, 6LoWPAN, radio frequency identification (RFID), near field communication (NFC), cellular (e.g., 3G, 4G, 5G), Wi-Fi and others known to those of skill in the art.
The edge devices 101 are communicatively coupled with the EDCs 105. Although not illustrated in
An example EDC is described in further detail below with reference to
As discussed above, an EDC can include any number of systems (e.g., systems, devices, sensors, tools, components and the like) each of which can include or be made up of sub-systems. As described in further detail below—e.g., with reference to
The hardware 205-1 of the example EDC 205 includes, among other things, networking or input/output, memory, compute and/or storage elements or resources. In some embodiments, the hardware 205-1 can be referred to as a system, with each element (e.g., compute element, storage element, etc.) being referred to as a sub-system. The networking or input/output, memory, compute and/or storage elements or resources can be provided as part of one or more sub-systems (or component or device). For example, the hardware 205-1 can include servers such as compute or storage servers. In some embodiments, the hardware 205-1 can include a management system that includes, among other things, processing, memory and communication elements (e.g., processor, memory, interfaces), and can be configured to provide management of or for the EDC 205.
As known to those of skill in the art, each server can be equipped with one or more processors, memory and interfaces (e.g., network, I/O). In some embodiments; the processors can be or include one or more microprocessors, microcontrollers, application specific integrated circuits (ASICs), central processing units (CPUs), graphics processing units (GPUs), systems on a chip (SoCs), quantum processors, chemical processors or other processors, known to those of skill in the art, hardware state machines, hardware sequencers, digital signal processors (DSPs), field programmable gate array (FPGA) or other programmable logic device, gate, or transistor logic, or any combination thereof. In some embodiments; the memory can be volatile or non-volatile memory, and can include one or more devices such as read-only memory (ROM), Random Access Memory (RAM), electrically erasable programmable ROM (EEPROM)), flash memory, registers or any other form of storage medium known to those of skill in the art. In some embodiments, the hardware and/or servers can include network interface controllers or the like for communicating with the edge devices 101 through the networks 103 and the data center 109 through the networks 109. In some embodiments, the networking hardware can include network switches or the like.
In some embodiments, the hardware 205-1 can include memory for storing software (e.g., machine-readable instructions), for example, to perform functions on or for the EDC 205. For instance, the software can be configured to provide remote management, real-time monitoring and control of the EDC 205 and/or its systems, devices, components, etc.
The PDU 205-3 is a device configured to distribute electric power, for example, to racks of computers and networking equipment located within a data center. The power supply 205-4 is a device that can provide battery backup when the electrical power fails or drops to an unacceptable voltage level. In some embodiments, the PDU 205-3 can be configured remotely by other computing devices.
Still with reference to the EDC 205 of
The cooling unit 315 can, in some embodiments, be referred to as a system that includes various sub-systems and sensors. For example, as shown in
Although not illustrated in
Returning to
In some embodiments, the data center 109 can refer to a fixed or movable site or location that includes or houses systems, sub-systems, devices, components, mechanisms, tools, instruments, and the like, which can interchangeably be referred to as “systems” and/or “sub-systems,” for purposes of simplicity. For instance, such systems or sub-systems of the data center 109 can include computing hardware such as servers, monitors, hard drives, disk drives, memories, mass storage devices, processors, micro-controllers, high-speed video cards, semi-conductor devices, printed circuit boards (PCBs), power supplies and the like, as known to those of skill in the art. In some embodiments, all or portions of the computing hardware can be housed in any number or racks. The computing hardware of the data center 109 can be configured to execute a variety of operations, including computing, storage, switching, routing, cloud solutions, management and others. In some embodiments, the systems and/or sub-systems of the data center 109 can be configured to communicate among each other and with external systems or sub-systems via wired or wireless communications means.
Moreover, the systems and sub-systems of the data center 109 can include cooling distribution units (CDUs), cooling towers, power components (e.g., uninterruptible power supplies (UPSs)) that provide and/or control the power and cooling within the data center, temperature components, fluid control components, chillers, heat exchangers (“HXs” or “HEXs”), computer room air handlers (CRAHs), humidification and dehumidification systems, blowers, pumps, valves, generators, transformers, switchgear and the like, as known to those of skill in the art.
It should be understood that the distributed computing architecture 100 of
As described above, the number and complexity of EDCs can cause their management, maintenance and monitoring to be burdensome. In some embodiments, machine learning (ML) systems and techniques can be deployed to provide such management, maintenance and monitoring of EDCs. For purposes of simplicity, in some embodiments, the processes of monitoring, managing and maintaining EDCs can be referred to collectively as “managing” (or “management”). The ML systems and techniques for managing EDCs can be provided, in some embodiments, by a related central data center (e.g., data center 109 for the EDCs 105). Example aspects of such EDC management will now be described with reference to
Data sources, EDCs and data centers can be communicatively coupled and/or dependent on one another. One example of such dependencies can be an EDC that is configured to (i) collect data from a data source, and (2) receive and apply pre-trained ML models from a data center.
The computing architecture 400 can include any number of data sources, EDCs, data centers, and other systems, devices, components. In the illustrated example, the computing architecture 400 can include data sources d1, d2, . . . , dN (collectively “401”); EDCs edc1, edc2, . . . , edcM (collectively “405”); and data centers dc1, dc2, . . . , dcO (collectively “409”). At least a portion of the data centers 401, EDCs 405, and data centers 409, represented as nodes in
As described above, each data source, EDC and/or data center can include any number of systems, sub-systems, sensors, and the like. For example, as illustrated in the expanded illustration of edcM in
For example, the sysA of
Using ML systems and techniques such as those described herein, dependencies such as those represented in
At step 550, data is collected. The data that is collected can be any type of data that may be used, for example, for (1) structural learning; and (2) parameter learning or estimation, as described in further detail below with reference to steps 552 and 554. Examples of the data collected at step 550 include (i) graphical representations of data centers, EDCs or a distributed computing architecture; and (ii) historical data of variables associated with data centers, EDCs, and/or a distributed computing architecture, Graphical representations can be or refer to graphs, trees, hierarchical charts, tables, flow charts, and/or any diagram or that indicates relationships between variables. Variables can refer to any type of data, including states (e.g., discrete data) and observations (e.g., continuous variables). In some embodiments, variables and/or their graphical representations can correspond to systems, sub-systems, and the like. In some embodiments, such illustrations can be retrieved or obtained from other computing devices and/or memories.
In some embodiments, a graphical representation collected at step 550 can be a system architecture diagram showing relationships and/or dependencies between, for example, its data sources, EDCs, data centers, and the like (e.g., analogous to
Moreover, at step 550, historical data can be obtained from a memory or a computing device, for instance. The historical data can be or include data collected over a period of time and that is relevant to the data sources, EDCs, data centers or otherwise of a distributed computing architecture. The historical data can include records of data of any type, including observation data and/or state data, among others. Table 2 below is a listing of an example dataset of historical data associated with an EDC.
The historical data shown in Table 2 includes n′ number of records, which as used herein refers to an associated set (e.g., row) of the historical data. For example, in some embodiments, a set of data can be associated by time—meaning that the values correspond to measurements of a same given time period. In some embodiments, the variables shown in Table 2 correspond to the variables shown in Table 1.
Table 3 below is a listing of an example dataset of historical data associated with multiple EDCs:
The historical data of Table 3 includes n″ number of records or sets. The EDC with which a row or set of the historical data of Table 3 is associated can be identified using the Record_ID, which for purposes of illustration includes an identifier of the EDC with which that data is associated. For example, in some embodiments, a set of data can be associated by time—meaning that the values correspond to measurements of a same given time period. For example, the data of the row with index 3 is associated with one EDC (edcB), while the data of the row with index 4 is associated with another EDC (edcA). In some embodiments, the variables shown in Table 3 can correspond to the variables shown in Tables 1 and/or 2.
As described in further detail below, the graphical and/or historical data that is obtained at step 550 can be used in turn to build the structure of the model (e.g., Bayesian network) and/or estimate the parameters of the model.
At step 552, a model structure is generated, forming a probabilistic framework that represents a distributed computing architecture as well as each individual data source, EDC and/or data center. Of course, in some embodiments, one or more model structures can be generated. The model structure can be or refer to a graph (e.g., directed acyclic graph (DAG)), including its nodes (e.g., which represent variables) and edges (e.g., representing relationships or dependencies among the variables). As mentioned, in some embodiments, the nodes can represent variables of any kind (e.g., states, observations, latent variables, unknown parameters, hypotheses, and the like.
In some embodiments, a model structure can be generated or learned from graphical representations and historical data associated with an entire distributed architecture and individual data centers, data sources, EDCs, and the like. Building the model structure can be performed by employing, for example, expert knowledge leveraged through computing devices. That is, the graphical representations and historical data, which can illustrate relationships among variables, can be used at step 552 to generate the model structure. On the other hand, in some embodiments, structural learning can be performed using ML techniques known to those of skill in the art, including constraint-based algorithms and/or score-based algorithms.
The Bayesian network 700 includes nodes 730-1 to 730-11 (collectively “730”). Each of the nodes 730 represents a variable corresponding to the variable identifiers (IDs) illustrated therein. In some embodiments, the variable IDs shown in
A fully defined and deployable Bayesian network model includes the probability distribution of every node therein. Yet, because some distributions (e.g., conditional distributions) include parameters that are unknown (e.g., the probability distribution for a node conditional upon that node's parents), the value of those parameters is obtained. At step 554 of
At step 854, N number of observation sets y are generated from among the dataset Y. The dataset Y can be historical data corresponding to, for example, a single EDC (e.g., Table 2) or to an architecture or group of related EDCs (e.g., Table 3). An observation set refers to one subset or group of values for the variables in the dataset. For example, in Tables 2 and 3, an observation set y can refer to a single row of data (e.g., values) for the variables (e.g., variables v100 to v119), For purposes of illustration, in one example, observation sets y from can refer to the data in indices 1 (corresponding to a single EDC) and 50 (corresponding to edcA_010 from among many EDCs) of Tables 2 and 3, respectively:
It should be understood that the number of observation sets N that are generated can refer to all or a portion of the subsets or groupings of data (e.g., n′ in Table 2, n″ in Table 3) in the datasets.
In turn, at step 856, a probability is calculated for each observation set generated at step 854. That is, starting with j=1 through j=N, and incrementing at each iteration, the probability of the observation set y_j given the parameter set θ_i is calculated; (e.g., p(y_j|θ_i). At step 858, the likelihood (e.g., log likelihood) of the parameter set θ_i given all of the observations sets y_1 . . . N is calculated (e.g., |(θ_i; y_1−N)). As known to those of skill in the art, the likelihood calculated at step 858 can based on the probabilities of each observation set calculated at step 856—e.g., the sum of all of the probabilities estimated at step 856.
At step 860, a determination is made as to whether the likelihood estimated at step 858 is the maximum or optimal likelihood—indicating that the selected or generated θ_i includes the parameter values that are most likely responsible for generating or most likely explain the observed data (e.g., observation sets y_1 . . . N). If it is determined at step 860 that the likelihood calculated at step 858 is optimal or maximum, in turn at step 866, the parameter set θ_i is deemed to be and or assigned as the model parameter set Θ_model (e.g., Θ_model=θ_i).
On the other hand, if the likelihood calculated at step 858 is not deemed to be the maximum or optimal likelihood then, in turn, at step 862 the iteration index i is incremented (i=i+1), and a new parameter set θ_i is generated at step 864. The newly generated parameter set θ_i is a new proposed set of parameter values that are based on the previous parameter set (e.g., θ_i−1), For example, the previous parameter set can be used as the mean of a multi-variate Gaussian with a pre-defined covariance matrix. In this way, the new parameter set causes the likelihood (e.g., log likelihood of step 858) to converge to its maximum value.
It should be understood that, in some embodiments, other machine learning (ML) techniques known to those of skill in the art, including deep learning and neural networks can also or alternatively be used to build and deploy the models described herein.
In some embodiments, the models described herein, such as the model that is generated using the process 500 of
For example, in table 4, the data at index 1 includes values of variables measured and/or collected at time period t−5. As described above with reference to step 552 of
The parameters of the dynamic Bayesian network 900 can also be estimated as described above with reference to step 554 and the process 800 of
In some embodiments, a Markov chain can be used to represent the dependencies within each iteration of the MCMC. To do so, the dependencies are constructed or generated such that each iteration consists or represents multiple time steps of behavior (e.g. normal state, anomalous state).
Once a model has been generated as described above with reference to example process 500, the model can be deployed to perform various functions that can be used, for example, to optimize management of EDCs. It should be understood that, although not described herein, building a model can include alternative or additional steps such as pre-processing; validation, testing and others known to those of skill in the art.
In some embodiments, the process 1100 can be executed or performed by one or more management computing systems or devices. For example, these management computing systems or devices can be part of a central data center or the like with which multiple EDCs being managed are associated. Moreover, it should be understood that, for purposes of illustration only, (i) the process 1100 is described with reference to a “current” time period t, relative to a real-time or substantially real-time system and process; (ii) the model that is deployed is a dynamic Bayesian network that models relationships and dependencies of systems (e.g., EDCs) collectively and individually, and across time; and (iii) prior to time t, data (e.g., historical data) can be collected (e.g., using sensors of the EDCs), transmitted and/or stored (e.g., in a memory of the management system).
At time t, in step 1150 of the process 1100, data (hereinafter referred to as data_t) is obtained. The obtained data_t can be collected; aggregated and/or transmitted by the EDCs to the management system. That data_t can therefore include data collected and/or corresponding to multiple systems (e.g., EDCs). The data_t, which is labeled or otherwise associated with the time t (e.g., as shown in Table 4 above); can be of any type, including observed data and/or state data. The data_t can function as potential evidence for predicting or identifying anomalies. In some embodiments, “anomalies” can be used interchangeably for purposes of simplicity to refer failures; errors, discrepancies, problems, issues, and the like that indicate some level of abnormal behavior or function.
At step 1152, the model is run or deployed using input data that includes at least some or all of the data_t. As described above, the model can be a dynamic Bayesian network, such as that illustrated in
Moreover, in some embodiments, the data_t can include observed or measured data (e.g., observation data from sensors) or data that is otherwise collected or calculated. For instance, as shown in
Still with reference to step 1152 of
It should be understood that at least some of the nodes represent anomalies (referred to as “anomalous nodes” or “anomalous nodes”) or other states that may not be known or measureable prior to the deployment of the model. Nonetheless, the model can generate model outputs for those nodes—e.g., anomalous nodes. For instance, the input data that is fed into the model may not include a value for the state of a node at time t, such as whether one or more fans have failed in an EDC. Still, the model can determine the probabilities for the potential values true and false of the variable “One or more fans have failed,” represented by an anomaly node in the graph.
Accordingly, in turn, at step 1154, the model outputs obtained at step 1152 are used to determine whether, based on existing data, anomalies or potential anomalies have occurred. Anomalies refer to actual anomalous behavior that has been detected and/or confirmed; and potential anomalies refer to behavior that has been deemed to probably or possibly exist or have occurred. Detecting anomalies is now described in further detail with reference, for purposes of illustration, to the models 700 of
Anomalies can be represented as nodes in a model in the form of anomalous nodes. For example, the nodes in
At step 1154, anomalies are detected based on, among other things, the model outputs which include or indicate the probable values of nodes, including anomalous nodes, given a known data (e.g., . . . , data_t−1, data_t, etc.). The known data can be data from an immediately preceding time instance such as time t−1. However, it should be understood that, by virtue of the function of the dynamic Bayesian network, the data from the immediately preceding time t−1 can incorporate outputs and data from time instances before time t−1. Determining anomalies or potentially anomalies at step 1154 therefore includes identifying nodes having probability values that are equal to 100% or within a certain threshold (e.g., within 0.1%, 0.5%, 1%, 2%, 5%, 10%, etc.) that indicates a sufficiently high probability of the occurrence or existence of an anomaly. Thus, with reference to
In turn, if anomalies (potential or actual) are not detected at step 1154, the process iterates at the next time instance, starting at step 1150 with t+1 being assigned as the new time t. This indicates that the architecture and individual systems therein are operating normally and/or without anomalous behavior. On the other hand, if at step 1154, one or more anomalies or potential anomalies are detected, the anomalies are diagnosed at step 1156 and corrective measures are taken at step 1158.
That is, at step 1156, diagnosing the detected anomaly can include pinpointing the one or more causes of that failure. To do so, the model can be traversed starting from the anomaly node and identifying each path extending away from the anomaly node. For instance, with reference to
In some embodiments, the diagnosing of step 1156 can also include identifying which nodes for which anomalies have not yet been detected are likely to fail. Predicting future failures can be based on the model outputs that indicate the probability of each state of each node. For instance, if a value is not yet beyond a probability threshold that triggers the existence of an anomaly, that node can be further analyzed to determine the likelihood of a failure (e.g., high probability) at a subsequent time period after t given the known data. Moreover, based on the identification of the most probably path that caused the failure, as described above, the path can be further be analyzed and leveraged to determine the next node likely to fail within that path.
In turn, at step 1158, one or more corrective actions can be performed to remedy or attempt to remedy the identified anomaly and/or causes of the anomaly. It should be understood that many corrective measures known to those of skill in the art can be triggered, based on the information identified in the diagnosis of step 1156. For purposes of illustration, examples includes migrating a workload to another datacenter, turning on additional cooling resources, rebooting, and/or transmitting notifications to other computing systems or devices. It should be understood that the corrective actions can be performed not merely to address existing anomalies but also to address issues based on predicted anomalies or anomalous behavior. In turn, the management, monitoring and maintenance process iterates at a next time instance starting at step 1150.
It should be understood that the detecting, remedying, and attempting to correct anomalies can be performed for individual systems (e.g., EACs) as well as a collection of systems (e.g., the architecture of
It should also be understood that, in some embodiments, the models described herein can be deployed on-demand, in addition to or alternative to their deployment as part of a continuous management, maintenance of monitoring process.