This disclosure relates generally to software processing, and, more particularly, to methods, systems, and apparatus for mapping active assurance intents to resource orchestration and life cycle management.
Edge environments (e.g., an Edge, Fog, multi-access edge computing (MEC), or Internet of Things (IoT) network) enable workload execution (e.g., execution of one or more computing tasks, execution of a machine learning model using input data, etc.) near endpoint devices that request an execution of the workload. Edge environments may include infrastructure, such as an edge platform, that is connected to cloud infrastructure, endpoint devices, and/or additional edge infrastructure via networks such as the Internet. Edge platforms may be closer in proximity to endpoint devices than cloud infrastructure, such as centralized servers.
In general, the same reference numbers will be used throughout the drawing(s) and accompanying written description to refer to the same or like parts. The figures are not to scale. Unless specifically stated otherwise, descriptors such as “first,” “second,” “third,” etc., are used herein without imputing or otherwise indicating any meaning of priority, physical order, arrangement in a list, and/or ordering in any way, but are merely used as labels and/or arbitrary names to distinguish elements for ease of understanding the disclosed examples. In some examples, the descriptor “first” may be used to refer to an element in the detailed description, while the same element may be referred to in a claim with a different descriptor such as “second” or “third.” In such instances, it should be understood that such descriptors are used merely for identifying those elements distinctly that might, for example, otherwise share a same name.
As used herein, the phrase “in communication,” including variations thereof, encompasses direct communication and/or indirect communication through one or more intermediary components, and does not require direct physical (e.g., wired) communication and/or constant communication, but rather additionally includes selective communication at periodic intervals, scheduled intervals, aperiodic intervals, and/or one-time events.
As used herein, “programmable circuitry” is defined to include (i) one or more special purpose electrical circuits (e.g., an application specific circuit (ASIC)) structured to perform specific operation(s) and including one or more semiconductor-based logic devices (e.g., electrical hardware implemented by one or more transistors), and/or (ii) one or more general purpose semiconductor-based electrical circuits programmable with instructions to perform specific functions(s) and/or operation(s) and including one or more semiconductor-based logic devices (e.g., electrical hardware implemented by one or more transistors). Examples of programmable circuitry include programmable microprocessors such as Central Processor Units (CPUs) that may execute first instructions to perform one or more operations and/or functions, Field Programmable Gate Arrays (FPGAs) that may be programmed with second instructions to cause configuration and/or structuring of the FPGAs to instantiate one or more operations and/or functions corresponding to the first instructions, Graphics Processor Units (GPUs) that may execute first instructions to perform one or more operations and/or functions, Digital Signal Processors (DSPs) that may execute first instructions to perform one or more operations and/or functions, XPUs, Network Processing Units (NPUs) one or more microcontrollers that may execute first instructions to perform one or more operations and/or functions and/or integrated circuits such as Application Specific Integrated Circuits (ASICs). For example, an XPU may be implemented by a heterogeneous computing system including multiple types of programmable circuitry (e.g., one or more FPGAs, one or more CPUs, one or more GPUs, one or more NPUs, one or more DSPs, etc., and/or any combination(s) thereof), and orchestration technology (e.g., application programming interface(s) (API(s)) that may assign computing task(s) to whichever one(s) of the multiple types of programmable circuitry is/are suited and available to perform the computing task(s).
As used herein integrated circuit/circuitry is defined as one or more semiconductor packages containing one or more circuit elements such as transistors, capacitors, inductors, resistors, current paths, diodes, etc. For example, an integrated circuit may be implemented as one or more of an ASIC, an FPGA, a chip, a microchip, programmable circuitry, a semiconductor substrate coupling multiple circuit elements, a system on chip (SoC), etc.
Edge computing, at a general level, refers to the transition of compute and storage resources closer to endpoint devices (e.g., consumer computing devices, user equipment, etc.) to reduce total cost of ownership, reduce application latency, improve service capabilities, and improve compliance with data privacy or security requirements. Edge computing may, in some scenarios, provide a cloud-like distributed service that offers orchestration and management for applications among many types of storage and compute resources. As a result, some implementations of edge computing have been referred to as the “edge cloud” or the “fog,” as powerful computing resources previously available only in large remote data centers are moved closer to endpoints and made available for use by consumers at the “edge” of the network.
Edge computing use cases in mobile network settings have been developed for integration with multi-access edge computing (MEC) approaches, also known as “mobile edge computing.” MEC approaches are designed to allow application developers and content providers to access computing capabilities and an information technology (IT) service environment in dynamic mobile network settings at the edge of the network. Edge computing, MEC, and related technologies attempt to provide reduced latency, improved responsiveness, and more available computing power than offered in traditional cloud network services and wide area network connections. However, the integration of mobility and dynamically launched services to some mobile use and device processing use cases has led to limitations and concerns with orchestration, functional coordination, and resource management, especially in complex mobility settings where many participants (e.g., devices, hosts, tenants, service providers, operators, etc.) are involved.
In a similar manner, Internet of Things (IoT) networks and devices are designed to offer a distributed compute arrangement from a variety of endpoints. IoT devices can be physical or virtualized objects that may communicate on a network, and can include sensors, actuators, and/or other input/output components, which may be used to collect data or perform actions in a real-world environment. For example, IoT devices can include low-powered endpoint devices that are embedded or attached to everyday things, such as buildings, vehicles, packages, etc., to provide an additional level of artificial sensory perception of those things. In recent years, IoT devices have become more popular and thus applications using these devices have proliferated.
In some examples, an edge environment can include an enterprise edge in which communication with and/or communication within the enterprise edge can be facilitated via wireless and/or wired connectivity. The deployment of various Edge, Fog, MEC, and IoT networks, devices, and services have introduced a number of advanced use cases and scenarios occurring at and towards the edge of the network. However, these advanced use cases have also introduced a number of corresponding technical challenges relating to security, processing and network resources, service availability and efficiency, among many other issues. One such challenge is in relation to Edge, Fog, MEC, and IoT networks, devices, and services executing workloads on behalf of endpoint devices.
The present techniques and configurations may be utilized in connection with many aspects of current networking systems, but are provided with reference to Edge Cloud, IoT, MEC, and other distributed computing deployments. The following systems and techniques may be implemented in, or augment, a variety of distributed, virtualized, or managed edge computing systems. These include environments in which network services are implemented or managed using MEC, fourth generation (4G) or fifth generation (5G) wireless network configurations; or in wired network configurations involving fiber, copper, and other connections. Further, aspects of processing by the respective computing components may involve computational elements which are in geographical proximity of a user equipment or other endpoint locations, such as a smartphone, vehicular communication component, IoT device, etc. Further, the presently disclosed techniques may relate to other Edge/MEC/IoT network communication standards and configurations, and other intermediate processing entities and architectures.
Edge computing is a developing paradigm where computing is performed at or closer to the “edge” of a network, typically through the use of computing platforms implemented at base stations, gateways, network routers, or other devices which are much closer to end point devices producing and consuming the data. For example, edge gateway servers may be equipped with pools of memory and storage resources to perform computation in real-time for low latency use-cases (e.g., autonomous driving or video surveillance) for connected client devices. As another example, base stations may be augmented with compute and acceleration resources to directly process service workloads for connected user equipment, without further communicating data via backhaul networks. As another example, central office network management hardware may be replaced with computing hardware that performs virtualized network functions and offers compute resources for the execution of services and consumer functions for connected devices.
Edge environments include networks and/or portions of networks that are located between a cloud environment and an endpoint environment. Edge environments enable computations of workloads at edges of a network. For example, an endpoint device may request a nearby base station to compute a workload rather than a central server in a cloud environment. Edge environments include edge platforms, which include pools of memory, storage resources, and/or processing resources. Edge platforms perform computations, such as an execution of a workload, on behalf of other edge platforms and/or edge nodes. Edge environments facilitate connections between producers (e.g., workload executors, edge platforms) and consumers (e.g., other edge platforms, endpoint devices).
Because edge platforms may be closer in proximity to endpoint devices than centralized servers in cloud environments, edge platforms enable computations of workloads with a lower latency (e.g., response time) than cloud environments. Edge platforms may also enable a localized execution of a workload based on geographic locations or network topographies. For example, an endpoint device may require a workload to be executed in a first geographic area, but a centralized server may be located in a second geographic area. The endpoint device can request a workload execution by an edge platform located in the first geographic area to comply with corporate or regulatory restrictions.
Examples of workloads to be executed in an edge environment include autonomous driving computations, video surveillance monitoring, machine learning model executions, and real time data analytics. Additional examples of workloads include delivering and/or encoding media streams, measuring advertisement impression rates, object detection in media streams, speech analytics, asset and/or inventory management, and augmented reality processing.
Edge platforms enable both the execution of workloads and a return of a result of an executed workload to endpoint devices with a response time lower than the response time of a server in a cloud environment. For example, if an edge platform is located closer to an endpoint device on a network than a cloud server, the edge service may respond to workload execution requests from the endpoint device faster than the cloud server. An endpoint device may request an execution of a time-constrained workload from an edge service rather than a cloud server.
In addition, edge platforms enable the distribution and decentralization of workload executions. For example, an endpoint device may request a first workload execution and a second workload execution. In some examples, a cloud server may respond to both workload execution requests. With an edge environment, however, a first edge platform may execute the first workload execution request, and a second edge platform may execute the second workload execution request.
To meet the low-latency and high-bandwidth demands of endpoint devices, orchestration in edge clouds is performed on the basis of timely information about the utilization of many resources (e.g., hardware resources, software resources, virtual hardware and/or software resources, etc.), and the efficiency with which those resources are able to meet the demands placed on them. Such timely information is generally referred to as telemetry (e.g., telemetry data, telemetry information, etc.).
Telemetry can be generated from a plurality of sources including each hardware component or portion thereof, virtual machines (VMs), operating systems (OSes), applications, and orchestrators. Telemetry can be used by orchestrators, schedulers, etc., to determine a quantity, quantities, and/or type of computation tasks to be scheduled for execution at which resource or portion(s) thereof, and an expected time to completion of such computation tasks based on historical and/or current (e.g., instant or near-instant) telemetry. For example, a core of a multi-core central processing unit (CPU) can generate over a thousand different varieties of information every fraction of a second using a performance monitoring unit (PMU) sampling the core and/or, more generally, the multi-core CPU. Periodically aggregating and processing all such telemetry in a given edge platform, edge node, etc., can be an arduous and cumbersome process. Prioritizing salient features of interest and extracting such salient features from telemetry to identify current or future problems, stressors, etc., associated with a resource is difficult. Furthermore, identifying a different resource to offload workloads from a burdened resource is a complex undertaking.
Some edge environments desire to obtain telemetry data associated with resources executing a variety of functions or services, such as data processing or video analytics functions (e.g., machine vision, image processing for autonomous vehicle, facial recognition detection, visual object detection, etc.). However, many high-throughput workloads, including one or more video analytics functions, may execute for less than a millisecond (or other relatively small time duration). Such edge environments do not have distributed monitoring software or hardware solutions or a combination thereof that are capable of monitoring such highly-granular stateless functions that are executed on a platform (e.g., a resource platform, a hardware platform, a software platform, a virtualized platform, etc.).
Many edge environments include a diversity of components for resource management and orchestration. Such edge environments may employ static orchestration when deciding on placement of services and workload at specific edge platforms and perform service level agreement monitoring of the applications and/or services in an any-cost framework. An any-cost framework includes orchestration components that manage resources and services at an edge platform but do not consider the computational costs associated with the orchestration components. Additionally, an any-cost framework includes orchestration components that are not responsive to the availability of computational resources and power to perform operations associated with those orchestration resources. Thus, edge environments may include orchestration resources that are inelastic and consume resources of an edge platform in a non-proportionate manner with respect to the resources and power that they manage. Additionally, edge environments may not include orchestration components that can be executed at an accelerator. The any-cost framework of existing components is a vulnerability (e.g., a glass jaw) of most edge environments. Orchestration components in most edge environments focus on optimizing resource utilization(s) of services and/or application executing at an edge platform and meeting application and/or workload service level agreements (SLAs).
In today's orchestration solutions, much of the focus is around requesting the correct quantity of resources (e.g., number of vCPUs), or abstracting hardware capabilities (e.g., such as Resource Director Technology (RDT), Running Average Power Limit (RAPL), Hardware Controlled Performance (HWP)) to facilitate their use by Quality of Service (QoS) software. However, issues with such imperative approaches include (1) unwanted vendor lock-in results as the communications service providers (CSPs) decide what to expose and how, (2) declaration of incorrect information leading to sub-optimal performance, and (2) limited to no awareness by applications and workload cohorts of critical details (e.g., where a Xeon® versus Atom® has much performance impact and where other cores/threads can unintentionally produce hidden interferences in shared resources like core-to-uncore queues, which cannot be easily controlled only through RDT). Furthermore, as applications transform from monolithic to microservices style, customers' burden of selecting the right cost versus responsiveness versus throughput becomes complicated and is made even more difficult as memory and computation become heterogeneous. It becomes essential to unburden users from the responsibility of having to detail how various desired assurances are to be met, and instead, to focus directly on resource-mapping, monitoring, evaluating, and controlling outcomes for the assurances that need to be met.
In some examples, customers need a way to map assurance intents to service orchestration and resource orchestration, which includes reservation of resources for ‘on-demand’dynamic service assurance probes (e.g., assuring the operation of a 5G core and actively assessing root cause issues using both passive and active assurance methods). In some examples, customers need a method to evaluate intent-based assurance effectiveness and generate an alert when an assurance intent is not met. For example, when workloads are deployed, the monitoring, orchestration and analytics stacks can be in many different failed states, preventing assurance. Failed states can be identified as (1) not-deployed, (2) failed, (3) unreachable, (4) unable to support, or respond to in a timely manner, active probes, (5) platform telemetry unavailable, and/or (6) orchestration system telemetry interface (e.g., cluster metrics) not available. Some methods have focused on deploying a dedicated platform to contain dynamic probes as an attempt to guarantee platform availability for probes to be deployed in the future. Other methods have focused on using Kubernetes or other container or application orchestration engine to deploy active probes to a platform providing a service. However, dedicating an entire server to probes that may be deployed in the future is wasteful and/or resources in a datacenter/cloud deployment and extra resources are not available at the edge of the network. Furthermore, considering dedicated servers, software and probes need to change as more advanced platforms enter deployment (e.g., a solution with active probes on dedicated servers of a first type may need to be reworked considerably when a server of a second type is an improved choice with significant performance and acceleration options). In some examples, Kubernetes may not have compute resources available to deploy on-demand probes when required. Additionally, an impact of responsiveness versus resource/traffic overhead of using on-demand probes in the cluster is another factor that needs consideration.
Example methods and apparatus disclosed herein facilitate forced reservation for active probes and introduce a new workflow to perform a series of automated checks. In at least some examples disclosed herein, a new workflow is introduced to prioritize deploying an active probe by forcefully freeing up capacity through a combination of forced scaling down of deployed workload capacity (e.g., apart from the workload under test), temporarily evicting other workloads, and/or adding capacity to the workload under test to possibly deploy the active probe with a sidecar pattern. In some examples, policy governance can be used to decide whether a permanent reserve or forceful deploy pattern can be used. Furthermore, in at least some examples disclosed herein, a new workflow to perform a series of automated checks is introduced based on a predefined policy, which defines the required assurance capabilities including: (1) platform collectors deployed and active, (2) platform collector reachable, (3) monitoring system deployed, (4) monitoring system accessible, (5) reserved space for active probes available, (6) Kubernetes (K8S) cluster accessible, (7) K8S ingress load balancer available, (8) cluster telemetry service available and reachable, (9) monitoring and analytics system platform fault count within tolerance, and/or (10) software-defined networking (SDN) system available. For example, monitoring and automatic checks can be performed using network schemes (e.g., infrastructure processing units (IPUs) and switches) to have more complex triggering rules. For example, a scale-out application may be acceptable if certain services fail or have a transient failure. However, a high risk of application failure can occur if both services have transient connectivity failure at the same time. Therefore, network schemes (e.g., IPUs and switches) can be programmed to monitor such a multi-modal dependency.
Additionally, risk management computation can be complex and associated with context and intention dependent weights assigned to different events, outages, and service level objectives (SLOs). An intention-based orchestration policy can automate and prioritize dynamically the allocation of risk budget by taking various known and emerging predictors (e.g., factors, observations) and mitigate risk by calling into pre-planned resource allocation, configuration scaling, task migration, and resource sequestration policies. It can also raise alerts as-and-when such dynamically managed risk budget crosses thresholds and requires human attention. For example, reactive site reliability engineering (SIZE) risk management can be brought under the rubric of intent-based orchestration.
Example methods and apparatus disclosed herein additionally or alternatively facilitate mapping a risk intent to deployment methods and mitigations. In at least some examples disclosed herein, risk mitigations generation includes expressing the risk as a probability of occurrence and impact, expressed as a cost value, with impact to end users and the cost to repair used as inputs. The risk assessment component builds models of risks overtime, based in observed risk occurrence, impact and meantime to repair and produces a model for each layer of the stack. Risk models are produced for faults in the orchestration layer, infrastructure layer, service orchestration layer, monitor and analytics layer, etc. In at least some examples disclosed herein, the service to be deployed provides an intent-based risk tolerance profile/descriptor, that includes (1) allowable outage time, (2) time to repair, (3) cost to repair, (4) allowable number of users to be impacted, and/or (5) degradation allowable on app specific SLOs. For example, applying risk mitigations includes considering the risk profile and distributing risk to each layer of the stack and to specific resources. In at least some examples, highly reliable resources are matched to risk intents that have the largest impact on cost and the lowest tolerance to outage time. The effectiveness of mitigations is monitored and evaluated over time. Given that not all risk assessments are sufficiently accurate for high confidence, at least some examples disclosed herein collect data on faults and numbers of interactions that are impacted by faults so that divergence of actual risk (e.g., as measured by a cost function of impacts) from projected risk can be used in retraining the risk assessments and for focusing postmortem analyses and adapting escalations. The risk model disclosed herein for generating risk mitigations can be continually updated using this approach. In at least some examples disclosed herein, risk hierarchy and risk relationships can be built into the models supporting an up leveling of risk from lower layers of the stack to higher layers. This modeling allows for an impact (e.g., blast radius) to be associated with certain risks.
In at least some examples disclosed herein, risk mitigations are part of the risk model and the mitigations can be expressed as intents (e.g., user desires to mitigate high impact risks, application of automatic remediations, and notification of human operators when remediations do not work). Mitigations can include adding more capacity on failure conditions, among others (e.g., 1+N, 1:1, protection schemes, path rerouting to alternate sites, etc.). Risk model updates (e.g., reputation) can be part of an attestation architecture such that trust can be not only established, but also validated. In at least some examples disclosed herein, declarative means of specifying extended telemetry are provided to assess whether those mitigations help, and to what extent. For example, mitigations may take some time to work and may produce temporary but acceptable setbacks (e.g., more latency, less throughput, etc.) before they produce improvements.
Individual platforms or devices of the edge computing system 200 are located at a particular layer corresponding to example layers 220, 230, 240, 250, and/or 260. For example, the client compute platforms 202a, 202b, 202c, 202d, 202e, 202f are located at an example endpoint layer 220, while the edge gateway platforms 212a, 212b, 212c are located at an example edge devices layer 230 (local level) of the edge computing system 200. Additionally, the edge aggregation platforms 222a, 222b (and/or fog platform(s) 224, if arranged or operated with or among a fog networking configuration 226) are located at an example network access layer 240 (an intermediate level). Fog computing (or “fogging”) generally refers to extensions of cloud computing to the edge of an enterprise's network or to the ability to manage transactions across the cloud/edge landscape, typically in a coordinated distributed or multi-node network. Some forms of fog computing provide the deployment of compute, storage, and networking services between end devices and cloud computing data centers, on behalf of the cloud computing locations. Some forms of fog computing also provide the ability to manage the workload/workflow level services, in terms of the overall transaction, by pushing certain workloads to the edge or to the cloud based on the ability to fulfill the overall service level agreement.
In the example of
Consistent with the examples provided herein, a client compute platform (e.g., one of the client compute platforms 202a, 202b, 202c, 202d, 202e, 2020 may be implemented as any type of endpoint component, device, appliance, or other thing capable of communicating as a producer or consumer of data. For example, a client compute platform can include a mobile phone, a laptop computer, a desktop computer, a processor platform in an autonomous vehicle, etc. In additional or alternative examples, a client compute platform can include a camera, a sensor, etc. Further, the label “platform,” “node,” and/or “device” as used in the edge computing system 200 does not necessarily mean that such platform, node, and/or device operates in a client or slave role; rather, any of the platforms, nodes, and/or devices in the edge computing system 200 refer to individual entities, platforms, nodes, devices, and/or subsystems which include discrete and/or connected hardware and/or software configurations to facilitate and/or use the edge cloud 210.
As such, the edge cloud 210 is formed from network components and functional features operated by and within the edge gateway platforms 212a, 212b, 212c and the edge aggregation platforms 222a, 222b of layers 230, 240, respectively. The edge cloud 210 may be implemented as any type of network that provides edge computing and/or storage resources which are proximately located to radio access network (RAN) capable endpoint devices (e.g., mobile computing devices, IoT devices, smart devices, etc.), which are shown in
In some examples, the edge gateway platforms 212a, 212b, 212c and the edge aggregation platforms 222a, 222b cooperate to provide various edge services and security to the client compute platforms 202a, 202b, 202c, 202d, 202e, 202f. Furthermore, because a client compute platforms (e.g., one of the client compute platforms 202a, 202b, 202c, 202d, 202e, 202f) may be stationary or mobile, a respective edge gateway platform 212a, 212b, 212c may cooperate with other edge gateway platforms to propagate presently provided edge services, relevant service data, and security as the corresponding client compute platforms 202a, 202b, 202c, 202d, 202e, 202f moves about a region. To do so, the edge gateway platforms 212a, 212b, 212c and/or edge aggregation platforms 222a, 222b may support multiple tenancy and multiple tenant configurations, in which services from (or hosted for) multiple service providers, owners, and multiple consumers may be supported and coordinated across single or multiple compute device(s) in a cluster of compute devices.
Additionally, edge platforms and/or orchestration components thereof may consider several factors when orchestrating services and/or applications in an edge environment. These factors can include next-generation central office smart network functions virtualization and service management, improving performance per watt at an edge platform and/or of orchestration components to overcome the limitation of power at edge platforms, reducing power consumption of orchestration components and/or an edge platform, improving hardware utilization to increase management and orchestration efficiency, providing physical and/or end to end security, providing individual tenant quality of service and/or service level agreement satisfaction, improving network equipment-building system compliance level for each use case and tenant business model, pooling acceleration components, and billing and metering policies to improve an edge environment.
A “service” is a broad term often applied to various contexts, but in general, it refers to a relationship between two entities where one entity offers and performs work for the benefit of another. However, the services delivered from one entity to another may be performed with certain guidelines, which ensure trust between the entities and manage the transaction according to the contract terms and conditions set forth at the beginning, during, and end of the service. One type of service that may be offered in an edge environment hierarchy is Silicon Level Services. For instance, Software Defined Silicon (SDSi)-type hardware provides the ability to ensure low level adherence to transactions, through the ability to intra-scale, manage and assure the delivery of operational service level agreements. Use of SDSi and similar hardware controls provide the capability to associate features and resources within a system to a specific tenant and manage the individual title (rights) to those resources. Use of such features is among one way to dynamically “bring” the compute resources to the workload.
For example, an operational level agreement and/or service level agreement could define “transactional throughput” or “timeliness”—in case of SDSi, the system and/or resource can sign up to guarantee specific service level specifications (SLS) and objectives (SLO) of a service level agreement (SLA). For example, SLOs can correspond to particular key performance indicators (KPIs) (e.g., frames per second, floating point operations per second, latency goals, etc.) of an application (e.g., service, workload, etc.) and an SLA can correspond to a platform level agreement to satisfy a particular SLO (e.g., one gigabyte of memory for 10 frames per second). SDSi hardware also provides the ability for the infrastructure and resource owner to empower the silicon component (e.g., components of a composed system that produce metric telemetry) to access and manage (add/remove) product features and freely scale hardware capabilities and utilization up and down. Furthermore, SDSi hardware can provide deterministic feature assignments on a per-tenant basis. In some examples, SDSi hardware also provides the capability to tie deterministic orchestration and service management to the dynamic (or subscription based) activation of features without the need to interrupt running services, client operations or by resetting or rebooting the system.
At a lower layer, SDSi can provide services and guarantees to systems to ensure active adherence to contractually agreed-to service level specifications that a single resource has to provide within the system. Additionally, SDSi provides the ability to manage the contractual rights (title), usage and associated financials of one or more tenants on a per component, or even silicon level feature (e.g., SKU features). Silicon level features may be associated with compute, storage or network capabilities, performance, determinism or even features for security, encryption, acceleration, etc. These capabilities ensure not only that the tenant can achieve a specific service level agreement, but also assist with management and data collection, and assure the transaction and the contractual agreement at the lowest manageable component level.
At a higher layer in the services hierarchy, Resource Level Services, includes systems and/or resources which provide (in complete or through composition) the ability to meet workload demands by either acquiring and enabling system level features via SDSi, or through the composition of individually addressable resources (compute, storage and network). At yet a higher layer of the services hierarchy, Workflow Level Services, is horizontal, since service-chains may have workflow level requirements. Workflows describe dependencies between workloads in order to deliver specific service level objectives and requirements to the end-to-end service. These services may include features and functions like high-availability, redundancy, recovery, fault tolerance or load-leveling (we can include lots more in this). Workflow services define dependencies and relationships between resources and systems, describe requirements on associated networks and storage, as well as describe transaction level requirements and associated contracts in order to assure the end-to-end service. Workflow Level Services are usually measured in Service Level Objectives (SLOs) and have mandatory and expected service requirements.
In the example illustrated in
In other examples, one or more of the orchestrator controller circuitry 302, the capability controller circuitry 304, the telemetry controller circuitry 306, the EP database 308, and the resource(s) controller circuitry 310 is/are separate devices included in an edge environment. Further, one or more of the orchestrator controller circuitry 302, the capability controller circuitry 304, the telemetry controller circuitry 306, the EP database 308, and the resource(s) controller circuitry 310 can be included in an edge device layer (e.g., the edge device layer 330), a network access layer (e.g., the network access layer 340), a core network layer (e.g., the core network layer 350), and/or a cloud data center layer (e.g., the cloud data center layer 360). For example, the orchestrator controller circuitry 302 can be included in an edge devices layer (e.g., the edge devices layer 230), or the resource(s) controller circuitry 310 can be included in a network access layer (e.g., the network access layer 240), a core network layer (e.g., the core network layer 250), and/or a cloud data center layer (e.g., the cloud data center layer 260).
In some examples, in response to a request to execute a workload from a client compute platform (e.g., one of the client compute platforms 202a, 202b, 202c, 202d, 202e, 202f), the orchestrator controller circuitry 302 communicates with at least one of the resource(s) controller circuitry 310 and the client compute platform (e.g., one of the client compute platforms 202a, 202b, 202c, 202d, 202e, 2020 to create a contract (e.g., a workload contract) associated with a description of the workload to be executed. The client compute platform (e.g., one of the client compute platforms 202a, 202b, 202c, 202d, 202e, 2020 provides a task associated with the contract and the description of the workload to the orchestrator controller circuitry 302, and the orchestrator controller circuitry 302 schedules the task to be executed at the edge platform. The task can include the contract and the description of the workload to be executed. In some examples, the task includes requests to acquire and/otherwise allocate resources used to execute the workload.
In some examples, the orchestrator controller circuitry 302 maintains records and/or logs of actions occurring in an endpoint layer (e.g., the endpoint layer 220), an edge device layer (e.g., the edge device layer 230), a network access layer (e.g., the network access layer 240), a core network layer (e.g., the core network layer 250), and/or a cloud data center layer (e.g., the cloud data center layer 260) of an edge environment. For example, the resource(s) controller circuitry 310 can notify receipt of a workload description to the orchestrator controller circuitry 302. The orchestrator controller circuitry 302 and/or the resource(s) controller circuitry 310 provide records of actions and/or allocations of resources to the orchestrator controller circuitry 302. For example, the orchestrator controller circuitry 302 maintains and/or stores a record of receiving a request to execute a workload (e.g., a contract request provided by one of the client compute platforms 202a, 202b, 202c, 202d, 202e, 202f). In some examples, the orchestrator controller circuitry 302 accesses a task and provides and/or assigns the task to one or more of the resource(s) controller circuitry 310 to execute or complete. The resource(s) controller circuitry 310 executes a workload based on a description of the workload included in the task.
In some examples, the orchestrator controller circuitry 302 can be configured to calibrate the power consumption and utilization of the orchestrator controller circuitry 302 (e.g., ones of the resource(s) allocated to the orchestrator 302) and adapt orchestration based on available or predicted power, thermal, and/or resource settings (e.g., budgets). For example, the orchestrator controller circuitry 302 may receive from a client compute platform, with a workload, configuration settings for resource(s) allocated to the orchestrator controller circuitry 302. The orchestrator controller circuitry 302 is configured to adjust a frequency of monitoring and/or scheduling of monitoring data collections, to manage the consumption of resource(s) 310 by the orchestrator controller circuitry 302 (e.g., orchestration components) to comply with SLA objectives while efficiently orchestrating tasks. For example, the orchestrator controller circuitry 302 can adjust the frequency of monitoring telemetry data based on a priority (e.g., priority level) associated with resources at an edge platform (e.g., the edge platform circuitry 300).
In the example of
In the illustrated example of
In some examples, the capability controller circuitry 304 retrieves the capability data from the EP database 308. For example, when the orchestrator controller circuitry 302 receives a request to execute a workload, the orchestrator controller circuitry 302 identifies, by accessing the capabilities controller circuitry 304 and/or the EP database 308, whether the capabilities of the edge platform circuitry 300 includes proper resource(s) to fulfill the workload task. For example, if the orchestrator controller circuitry 302 receives a request to execute a workload that requires a processor with two cores, the orchestrator controller circuitry 302 can access the capabilities controller circuitry 304 and/or the EP database 308 to determine whether the edge platform circuitry 300 includes the capability to process the requested workload.
In the example of
In the illustrated example of
In the illustrated example of
In the illustrated example of
In the example of
The orchestrator interface generator circuitry 402 controls communication (e.g., communications related to orchestration) with the edge platform circuitry 300 and/or remote edge platforms (e.g., near-edge platforms with respect to the edge platform circuitry 300, a next-tier, etc.). The orchestrator interface generator circuitry 402 is configured to determine whether the edge platform circuitry 300 has received telemetry data from a remote edge platform. For example, the orchestrator interface generator circuitry 402, and/or more generally, the orchestrator controller circuitry 302, can receive telemetry data from an edge platform that is geographically closer to a client compute platform than the edge platform circuitry 300. In response to determining that the edge platform 200 has received telemetry data from a remote edge platform, the orchestrator interface generator circuitry 402 transmits the telemetry data and/or any additional data (e.g., indication of granularity, configuration settings for remote edge platform orchestrator, etc.) to the resource management controller circuitry 404.
The resource management controller circuitry 404 is configured to manage resource consumption of resource(s) by orchestration components (e.g., the orchestration interface generator circuitry 402, the resource management controller circuitry 404, the workload scheduler circuitry 406) and/or other components of the edge platform circuitry 300 (e.g., the capability controller circuitry 304, the telemetry controller circuitry 306, the EP database 308 and/or the resource(s) controller circuitry 310). For example, the resource management controller circuitry 404 monitors the utilization of power and/or various other resources by orchestration components and/or other components of an edge platform. Depending on the amount of resources that is available at the edge platform, and the estimated or pledged amount of each to the workloads executing at the edge platform, the resource management controller circuitry 404 may raise, lower, or transfer the work for telemetry and orchestration to a next near-edge tier.
To manage the resources at an edge platform (e.g., the edge platform circuitry 300), the resource management controller circuitry 404 requests, from an orchestrator at a remote edge platform and/or another computer, the orchestration results. Additionally or alternatively, the resource management controller circuitry 404 can manage resources at an edge platform based on KPIs associated with an application (e.g., a workload, service, etc.). In some examples, the resource management controller circuitry 404 and/or the orchestrator controller circuitry 302 can adjust resource allocation at the edge platform to meet given SLOs of an SLA for each service and/or workload executing at the edge platform. Additionally or alternatively, the resource management controller circuitry 404 estimates, based on the telemetry data collected by the orchestrator interface generator circuitry 402, the amount of resources to be utilized by various services, applications, and/or workloads assigned to the edge platform to meet the respective SLAs associated with each of the services, applications, and/or workloads. Based on the amount of services estimated to be utilized, the resource management controller circuitry 404 determines what quantity of resources may be released from, or made available to, the orchestration components at the edge platform.
The workload scheduler circuitry 406 generally schedules one or more workloads, services, and/or applications to execute at an edge platform. In some examples, scheduling includes accessing a task received and/or otherwise obtained by the resource management controller circuitry 404 and provide the task to one or more of the resources at an edge platform to execute or complete. In some examples, scheduling includes selecting ones of workloads assigned to an edge platform to offload to a remote edge platform to be executed. The workload scheduler circuitry 406 accesses a result of the execution of the workload from one or more of the resources at the edge platform that executed the workload. The workload scheduler circuitry 406 provides the result to the device that requested the workload to be executed, such as a client compute platform and/or other edge platform. In some examples, the workload scheduler circuitry 406 is configured to determine whether a candidate schedule satisfies one or more SLAs associated with one or more workloads.
The assurance intent mapper circuitry 408 maps assurance intents and evaluates intent-based assurance effectiveness. For example, an assurance intent refers to a cluster level assurance focusing on policies such as a percentage of nodes in a cluster with a certain level, coverage of the orchestration control plane, high-availability (HA) deployment, etc. For example, a cluster level assurance intent represents a resource availability criterion to meet a target availability of the resource. A cluster (e.g., Kubernetes cluster) represents a set of nodes that run containerized applications. In some examples, the cluster is a set of servers that are managed together and participate in workload management. The assurance intent mapper circuitry 408 maps assurance intents to service orchestration and resource orchestration (e.g., reservation of resources for on-demand dynamic service assurance probes). In examples disclosed herein, a probe can be used for monitoring and gathering of information about events affecting containers and/or validating the health of workloads (e.g., applications running on Kubernetes). In some examples, probes can be used to collect telemetry information (e.g., Kubernetes probes allow validating the state of pods running within a cluster, while the workloads run within the pods). For example, once appropriate passive monitoring and analytics stacks are selected and deployed, the assurance intent mapper circuitry 408 reserves compute resources. In some examples, the assurance intent mapper circuitry 408 reserves a single dedicated server (e.g., as part of an edge deployment or reserve core(s) on a 5G core server) for the co-deployment of an active probe on demand (e.g., at some point in the future). In some examples, the assurance intent mapper circuitry 408 reserves memory bandwidth, cache, Structural Simulation Toolkit (SST) cores, and/or interface bandwidth for the active probe. In some examples, the assurance intent mapper circuitry 408 tracks active probes using artificial intelligence-based tracking to monitor active probe(s) and based on core availability provides the cores that are available and have the desired capabilities (e.g., 5G capable, etc.). In some examples, the assurance intent mapper circuitry 408 performs forced reservation for active probes to prioritize deploying the active probe by forcefully freeing up capacity. For example, the assurance intent mapper circuitry 408 performs forced reservation using a combination of forced scaling down of deployed workload capacity (e.g., apart from the workload under testing), temporarily evicts other workloads, and/or adds capacity to the workload under test to deploy the active probe with a sidecar pattern (e.g., a single node pattern including an application container and a sidecar container). In some examples, the assurance intent mapper circuitry 408 uses policy governance to determine whether a permanent reserve or a forceful deploy pattern can be used for reservation of active probes. In some examples, the assurance intent mapper circuitry 408 uses a supervised tree-based machine learning model to determine when to perform freeing and/or scaling down of deployed workload capacity. However, any other type of machine learning model can be implemented for determining the type of forced reservation to perform. In some examples, the assurance intent mapper circuitry 408 uses a dataset for a given policy to assist in the freeing or scaling down of the deployed workload. As such, when the orchestrator controller circuitry 302 determines that reservation of resources is needed (e.g., via input from a monitoring and analytics stack), the orchestrator controller circuitry 302 deploys active probes to the compute resources reserved for the active probe using the assurance intent mapper circuitry 408.
In examples disclosed herein, assurance domain polices include K8S deployment in high availability configurations, high availability K8S ingress load balancer configurations (e.g., external to K8S), enablement of storage availability resiliency schemes (RAID X, etc.), storage monitoring and analytics enablement, K8S auditing and tracing enablement, SDN/NMS availability policies (1:1 redundant switch, etc.), service mesh monitoring and analytics enablement (e.g., Cilium), network interface HA schemes (e.g., port redundancy, multi-path) enablement, IPU monitoring enablement, port flow telemetry enablement, open telemetry gateways enablement, etc. In some examples, automated assurance check policies include infra telemetry collectors deployment, infra telemetry collectors reachability, monitoring system deployment and reachability, analytics deployment and reachability, reserved space for active probes availability, K8S cluster accessibility, cluster telemetry (e.g., Kubernetes state metrics) API accessibility, monitoring and analytics system infra health, network management station (NMS)/software defined networking (SDN) system reachability and activity, open telemetry gateways activity and reachability, etc. In some examples, validation and periodic verifying that various software and hardware aspects of the system are within acceptable ranges is performed and/or excursions are predicted. In examples disclosed herein, cluster level assurance policies include percentages of nodes in a cluster with a certain level, coverage of the orchestration control plane, high-availability (HA) deployment, storage resiliency models, extraction of assurance capabilities/wellness from the underlying Infrastructure as a Service (IaaS) layer, assessment of active/passive assurance capabilities (e.g., with possibility of associating charging with capability), and/or cluster audit capability availability.
In examples disclosed herein, infrastructure network equipment to implement the assurance domain polices, automated assurance check policies, and/or cluster level assurance policies can vary. For example, various architectures (e.g., Intel® Tofino, Infrastructure Processing Units (IPUs), etc.) can be used to establish distributed/delegate monitoring, followed by more centralized and coordinated measurement entities (e.g., switches). A type of network of virtual channels or network flows can be defined to perform this aspect, with separation from remaining traffic and management by K8S (e.g., exposing capabilities to execute K8S plugins that are specific for a switch). In some examples, switches can include methods to register rules that identify situations that should not happen at the same time and that can be identified by monitoring of multiple KPIs from various platforms/resources/services. In some examples, switches can require the IPUs connecting a particular platform to monitor certain resources or services and trigger back an alarm (e.g., using alert generator circuitry 410) whenever a particular monitoring condition occurs (e.g., service not responding or resource not working properly). In some examples, switches receiving events relate them to a specific rule and on rule assertion may trigger notification to the orchestrator controller circuitry 302.
As illustrated in
Training is performed using training data. In examples disclosed herein, the training data originates from previous freeing up or scaling down of deployed workload capacity to determine which resource reservation approach is effective for a given task (e.g., based on whether a permanent reserve or a forceful deploy pattern can be used for reservation of active probes). In some examples, the training data is labeled. In some examples, the training data is sub-divided such that a portion of the data is used for validation purposes.
Once training is complete, the resource reservation model(s) are stored in one or more databases (e.g., the database 446 of
In some examples, output of the deployed model(s) may be captured and provided as feedback. By analyzing the feedback, an accuracy of the deployed model(s) can be determined. If the feedback indicates that the accuracy of the deployed model(s) is less than a threshold or other criterion, training of an updated model can be triggered using the feedback and an updated training data set, hyperparameters, etc., to generate an updated, deployed model(s).
As shown in
The first computing system 430 of
The alert generator circuitry 410 generates an alert when an assurance intent is not met. In some examples, the alert generator circuitry 414 performs a series of automated checks, based on a predefined policy, which defines the target (e.g., required, specified, etc.) assurance capabilities. In some examples, the target assurance capabilities include the following: (1) platform collectors deployed and active, (2) platform collector reachable, (3) monitoring system deployed, (4) monitoring system accessible, (5) reserved space for active probes available, (6) K8S cluster accessible, (7) K8S ingress load balancer available, (8) Kube-stats service available and reachable, (9) Monitoring & analytics system platform fault count within tolerance, and/or (10) SDN system available. In some examples, the alert generator circuitry 414 performs automated checks using artificial intelligence (e.g., using a supervised model). In some examples, the alert generator circuitry 414 calculates an overall score indicating the number of intents met. Based on the calculated score, the alert generator circuitry 414 generates an alert when a threshold number of intents is not reached (e.g., as compared to intents expressed for assurance by a service owner or resource owner). In some examples, the alert generator circuitry 414 identifies test results from cluster deployment health metrics.
The risk mapper circuitry 412 maps a risk intent to deployment methods and mitigations. In examples disclosed herein, risk mitigations generation includes expressing the risk as a probability of occurrence and impact, expressed as a cost value, with impact to end users and the cost to repair used as inputs. In some examples, the risk mapper circuitry 412 builds models of risks overtime, based on observed risk occurrence, impact and meantime to repair, and produces a model for each layer of a stack. Risk models are produced for faults in the orchestration layer, infrastructure layer, service orchestration layer, monitor and analytics layer, etc. In some examples, the service to be deployed provides an intent-based risk tolerance profile/descriptor, that can include allowable outage time, time to repair, cost to repair, allowable number of users to be impacted, degradation allowable on app specific SLOs. In some examples, the risk mapper circuitry 412 considers the risk profile and distributes risk to each layer of the stack and to specific resources. For example, the risk mapper circuitry 412 matches highly reliable resources to risk intents that have the largest impact on cost and the lowest tolerance to outage time. In some examples, the risk mapper circuitry 412 monitors and evaluates the effectiveness of mitigations over time.
In some examples, the risk mapper circuitry 412 collects data on faults and numbers of interactions that are impacted by faults so that divergence of actual risk (e.g., as measured by a cost function of impacts) from projected risk can be used in retraining the risk assessments and for focusing postmortem analyses and adapting escalations, allowing generated risk models to be continually updated over time. For example, risk hierarchy and risk relationships can be built into the models supporting an up-leveling of risk from lower layers of the stack to higher layers, allowing for an impact to be associated with certain risks. In examples disclosed herein, risk mitigations (e.g., as part of the risk model) can be expressed as desired intents (e.g., mitigate high impact risks using automatic remediations and notify human operators when remediations do not work, etc.). In some examples, mitigations can include adding more capacity on failure conditions, 1+N protection switching, 1:1 protection schemes, path rerouting to alternate sites, etc. In some examples, risk model updates can be part of an attestation architecture, allowing the trust to not only be established but also validated. In some examples, extended telemetry can be used to assess whether certain mitigations are helpful and to what extent, since mitigations may take some time to work and may produce temporary but acceptable setbacks (e.g., more latency, less throughput) before improvements are achieved.
In examples disclosed herein, the risk mapper circuitry 412 receives an intent based risk tolerance profile. For example, the service to be deployed provides an intent-based risk tolerance profile, that includes allowable outage time, time to repair, cost to repair, allowable number of users to be impacted, etc. In some examples, the information includes regulatory risks for availability of services (e.g., associated with the Federal Communications Commission (FCC)) and/or reliability risks from analytics. The risk mapper circuitry 412 performs risk intent mapping by mapping the received allowable risk to a probability of occurrence in each domain (e.g., infra, software, switching, cluster, Kubernetes, etc.) expressed as a service risk profile. In some examples, the service risk profile is the probability of occurrence in each domain and impact, expressed as a cost (e.g., dollar value), impact to end users, and/or cost to repair as inputs. The risk mapper circuitry 412 generates domain specific risk mitigations by assessing the service risk profile and producing risk mitigations for each risk domain. In some examples, the risk mapper circuitry 412 uses a reliability modeling component to build models of risks overtime, based on observed risk occurrence, impact and meantime to repair, and produces a model for each layer of the stack. Risk models are produced for faults in the orchestration layer, infrastructure layer, service orchestration layer, monitor and analytics layer.
In some examples, risk mitigations include selecting resources using reliability ranges and reliability features (e.g., CPU reliability within acceptable range, memory reliability within acceptable range, etc.). Select platforms can include CPU reliability, availability, serviceability (RAS) features/memory RAS features/QAT RAS, IPU RAS, K8 cluster reliability features, high availability deployment configurations, multi-homed networking capabilities, etc. In examples disclosed herein, reliability functions can include functions such as SDN controllers for multi-pathing and/or load balancing for service resiliency. In some examples, the risk mapper circuitry 412 sends a trigger to the resource orchestrator to rebalance/reprovision workloads to operate around the identified sources of risk (e.g., under the governance of the hierarchical risk model).
As illustrated in
Once training is complete, the risk model(s) are stored in one or more databases (e.g., the database 466 of
As shown in
The second computing system 450 of
The risk assessment controller circuitry 414 performs risk assessment at a local level or at a cluster level. For example, the risk assessment controller circuitry 414 performs risk assessment at a local level by analyzing performance metrics and generating real time alerts when risk mitigations are removed and/or a platform is misconfigured. The risk assessment controller circuitry 414 generates alerts when platform risk such as reliability change occurs and/or triggers real-time risk mitigations after the risk has occurred. In examples disclosed herein, the risk assessment controller circuitry 414 additionally or alternatively analyzes cluster metrics and generates real-time alerts when cluster level risk mitigations are removed or misconfigured. The risk assessment controller circuitry 414 generates alerts when the cluster cannot support a risk such as reliability change and triggers real-time risk mitigations after the risk has occurred for inter-cluster scheduling. In examples disclosed herein, reputation attestation accounts for resources risks that can evolve over time and may have different perspectives or experiences depending on who is using the resource itself. Services that are responsible to establish the risk can be part of an attestation architecture as follows: (1) the assessment these services provide can be tracked in blockchain to be traceable over time and attested and (2) the reputation for those services to provide a real assessment can be monitored. For example, given that service A provides high risk assessment based on the execution on resource B at a specific timestamp (e.g., 20 seconds and afterwards), executing multiple services A using resource B can be determined to provide low risk. Over time, the reputation of service A can be established as well.
In examples disclosed herein, hardware support for risk mitigation accounts for the platform having various resources that can have different risk mitigation software defined silicon (SDSi)-based configurations. In some examples, the risk mapper circuitry 412 maps different properties of the platform and identifies node architecture that may provide different features that could be used to mitigate risk. For example, a sub-NUMA cluster (SNC) to create independent compute domains within a CPU could allow for use of different types of interleaving to have different domains-based memory corruption, and a Compute Express Link (CXL) could be used to isolate different elements of the architecture. For example, each of these aspects has implications that need to be handled and matched with respect to the application or service key performance indicators (KPIs) when implementing risk mitigation. Furthermore, hardware support for risk variance can be performed such that the platform can reassess risk over time (e.g., on detection of changing error counts or frequencies such as from memory, I/O, or networking, etc.) to modify the perceived risk of the platform.
The orchestration database 416 stores telemetry data, workloads, models, schedules, SLAs, SLOs, KPIs, etc. The orchestration database 416 can be used to store any information associated with the orchestrator interface generator circuitry 402, resource management controller circuitry 404, workload scheduler circuitry 406, assurance intent mapper circuitry 408, alert generator circuitry 410, risk mapper circuitry 412, and/or risk assessment controller circuitry 414. The orchestration database 416 of the illustrated example of
In some examples, the apparatus includes means for generating an orchestrator interface. For example, the means for generating an orchestrator interface may be implemented by orchestrator interface generator circuitry 402. In some examples, the orchestrator interface generator circuitry 402 may be instantiated by programmable circuitry such as the example programmable circuitry 1512 of
In some examples, the apparatus includes means for resource management. For example, the means for resource management may be implemented by resource management controller circuitry 404. In some examples, the resource management controller circuitry 404 may be instantiated by programmable circuitry such as the example programmable circuitry 1512 of
In some examples, the apparatus includes means for scheduling a workload. For example, the means for scheduling a workload may be implemented by workload scheduler circuitry 406. In some examples, the workload scheduler circuitry 406 may be instantiated by programmable circuitry such as the example programmable circuitry 1512 of
In some examples, the apparatus includes means for mapping an assurance intent. For example, the means for mapping an assurance intent may be implemented by assurance intent mapper circuitry 408. In some examples, the assurance intent mapper circuitry 408 may be instantiated by programmable circuitry such as the example programmable circuitry 1512 of
In some examples, the apparatus includes means for generating an alert. For example, the means for generating an alert may be implemented by alert generator circuitry 410. In some examples, the alert generator circuitry 410 may be instantiated by programmable circuitry such as the example programmable circuitry 1512 of
In some examples, the apparatus includes means for mapping a risk. For example, the means for mapping a risk may be implemented by risk mapper circuitry 412. In some examples, the risk mapper circuitry 412 may be instantiated by programmable circuitry such as the example programmable circuitry 1512 of
In some examples, the apparatus includes means for assessing a risk. For example, the means for assessing a risk may be implemented by risk assessment controller circuitry 414. In some examples, the risk assessment controller circuitry 414 may be instantiated by programmable circuitry such as the example programmable circuitry 1512 of
While an example manner of implementing the orchestrator controller circuitry 302 of
Flowcharts representative of example machine readable instructions, which may be executed by programmable circuitry to implement and/or instantiate the orchestrator controller circuitry 302 of
The program may be embodied in instructions (e.g., software and/or firmware) stored on one or more non-transitory computer readable and/or machine readable storage medium such as cache memory, a magnetic-storage device or disk (e.g., a floppy disk, a Hard Disk Drive (HDD), etc.), an optical-storage device or disk (e.g., a Blu-ray disk, a Compact Disk (CD), a Digital Versatile Disk (DVD), etc.), a Redundant Array of Independent Disks (RAID), a register, ROM, a solid-state drive (SSD), SSD memory, non-volatile memory (e.g., electrically erasable programmable read-only memory (EEPROM), flash memory, etc.), volatile memory (e.g., Random Access Memory (RAM) of any type, etc.), and/or any other storage device or storage disk. The instructions of the non-transitory computer readable and/or machine readable medium may program and/or be executed by programmable circuitry located in one or more hardware devices, but the entire program and/or parts thereof could alternatively be executed and/or instantiated by one or more hardware devices other than the programmable circuitry and/or embodied in dedicated hardware. The machine readable instructions may be distributed across multiple hardware devices and/or executed by two or more hardware devices (e.g., a server and a client hardware device). For example, the client hardware device may be implemented by an endpoint client hardware device (e.g., a hardware device associated with a human and/or machine user) or an intermediate client hardware device gateway (e.g., a radio access network (RAN)) that may facilitate communication between a server and an endpoint client hardware device. Similarly, the non-transitory computer readable storage medium may include one or more mediums. Further, although the example program is described with reference to the flowcharts illustrated in
The machine readable instructions described herein may be stored in one or more of a compressed format, an encrypted format, a fragmented format, a compiled format, an executable format, a packaged format, etc. Machine readable instructions as described herein may be stored as data (e.g., computer-readable data, machine-readable data, one or more bits (e.g., one or more computer-readable bits, one or more machine-readable bits, etc.), a bitstream (e.g., a computer-readable bitstream, a machine-readable bitstream, etc.), etc.) or a data structure (e.g., as portion(s) of instructions, code, representations of code, etc.) that may be utilized to create, manufacture, and/or produce machine executable instructions. For example, the machine readable instructions may be fragmented and stored on one or more storage devices, disks and/or computing devices (e.g., servers) located at the same or different locations of a network or collection of networks (e.g., in the cloud, in edge devices, etc.). The machine readable instructions may require one or more of installation, modification, adaptation, updating, combining, supplementing, configuring, decryption, decompression, unpacking, distribution, reassignment, compilation, etc., in order to make them directly readable, interpretable, and/or executable by a computing device and/or other machine. For example, the machine readable instructions may be stored in multiple parts, which are individually compressed, encrypted, and/or stored on separate computing devices, wherein the parts when decrypted, decompressed, and/or combined form a set of computer-executable and/or machine executable instructions that implement one or more functions and/or operations that may together form a program such as that described herein.
In another example, the machine readable instructions may be stored in a state in which they may be read by programmable circuitry, but require addition of a library (e.g., a dynamic link library (DLL)), a software development kit (SDK), an application programming interface (API), etc., in order to execute the machine-readable instructions on a particular computing device or other device. In another example, the machine readable instructions may need to be configured (e.g., settings stored, data input, network addresses recorded, etc.) before the machine readable instructions and/or the corresponding program(s) can be executed in whole or in part. Thus, machine readable, computer readable and/or machine readable media, as used herein, may include instructions and/or program(s) regardless of the particular format or state of the machine readable instructions and/or program(s).
The machine readable instructions described herein can be represented by any past, present, or future instruction language, scripting language, programming language, etc. For example, the machine readable instructions may be represented using any of the following languages: C, C++, Java, C #, Perl, Python, JavaScript, HyperText Markup Language (HTML), Structured Query Language (SQL), Swift, etc.
As mentioned above, the example operations of
“Including” and “comprising” (and all forms and tenses thereof) are used herein to be open ended terms. Thus, whenever a claim employs any form of “include” or “comprise” (e.g., comprises, includes, comprising, including, having, etc.) as a preamble or within a claim recitation of any kind, it is to be understood that additional elements, terms, etc., may be present without falling outside the scope of the corresponding claim or recitation. As used herein, when the phrase “at least” is used as the transition term in, for example, a preamble of a claim, it is open-ended in the same manner as the term “comprising” and “including” are open ended. The term “and/or” when used, for example, in a form such as A, B, and/or C refers to any combination or subset of A, B, C such as (1) A alone, (2) B alone, (3) C alone, (4) A with B, (5) A with C, (6) B with C, or (7) A with B and with C. As used herein in the context of describing structures, components, items, objects and/or things, the phrase “at least one of A and B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B. Similarly, as used herein in the context of describing structures, components, items, objects and/or things, the phrase “at least one of A or B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B. As used herein in the context of describing the performance or execution of processes, instructions, actions, and/or activities, the phrase “at least one of A and B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B. Similarly, as used herein in the context of describing the performance or execution of processes, instructions, actions, and/or activities, the phrase “at least one of A or B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B.
As used herein, singular references (e.g., “a”, “an”, “first”, “second”, etc.) do not exclude a plurality. The term “a” or “an” object, as used herein, refers to one or more of that object. The terms “a” (or “an”), “one or more”, and “at least one” are used interchangeably herein. Furthermore, although individually listed, a plurality of means, elements, or actions may be implemented by, e.g., the same entity or object. Additionally, although individual features may be included in different examples or claims, these may possibly be combined, and the inclusion in different examples or claims does not imply that a combination of features is not feasible and/or advantageous.
In the example of
As previously described, probes can be used for monitoring and gathering of information about events affecting containers and/or collecting telemetry information. In some examples, the probes can be used to indicate whether a container is operating, whether an application running in the container is ready to accept requests, and/or whether an application running in the container has started, etc. In some examples, an ‘on-demand probe’ is an active probe that is inserted into an operational network at specific points (e.g., by the management system) to determine the root cause of an issue, where the term ‘on-demand’ indicates that probes can be inserted based on any number of conditions and at any point in the network for root cause analysis and/or troubleshooting (e.g., Extended Berkeley Packet Filter (EBPF) probes used for monitoring networking in a cloud environment). Furthermore, service assurance can rely on passive probing and/or active probing as measurement techniques for evaluating service performance. In some examples, passive probes monitor traffic flows and do not impact the services themselves (e.g., passive probes reading probe level statistics, etc.). For example, passive probes can be engineered into a given network to obtain detailed information at key points. Conversely, in some examples, active probes insert synthetic test traffic into a network and observe how the network and/or a service responds, allowing the active probe to measure service performance. Active probes can be used for generating real-time performance data on specific services. In some examples, active probes can be used in services with performance-based SLAs ensure fulfillment of service agreements.
In some examples, the assurance intent mapper circuitry 408 reserves compute resources based on the resource reservation model(s) and/or policy governance. For example, the assurance intent mapper circuitry 408 can reserve resources (e.g., a dedicated server, memory bandwidth, cache, interface bandwidth, etc.) for the co-deployment of an active probe on demand. In some examples, the assurance intent mapper circuitry 408 determines whether to forcefully free up capacity and/or whether to perform freeing or scaling down of deployed workload capacity in accordance with the trained resource reservation model described in connection with
In the example of
In the example of
In examples disclosed herein, the risk mapper circuitry 408 applies risk mitigations (e.g., prior to service deployment) (block 1345). Subsequently, the orchestrator interface generator circuitry 102 configures the infrastructure (e.g., by configuring or deploying monitors with reliability KPIs to support SLO monitoring with global and cross domain risk contexts, etc.) (block 1350). For example, the workload scheduler circuitry 406 deploys workloads (block 1355) and configures or deploys monitors and/or probes (block 1360), resulting in the monitoring of configured key performance indicators (KPIs) (e.g., frames per second, floating point operations per second, latency goals, etc.) of an application (e.g., service, workload, etc.) (block 1365). In the example of
In the example of
Expressing risk includes risk input rates that can be applied (e.g., frequency). In some examples, risk is dimensionless, but still indicates a need to “understand” and/or transparently communicate the risk. Risk normalization across various stakeholders (e.g., service owners, resource owners) can be beneficial. Risk assessments can change over time/dynamicity. In some examples, risk hierarchy can be developed, up leveling the risk from a K8S centric cluster view into generically termed aggregation zones and data centers. In examples disclosed herein, hardware risk mitigation features (e.g., sub-NUMA clusters (SNCs)) include mapping of different properties of the platform and node architecture that may provide different features that could be used to mitigate risk. For example, an SNC creates two localization domains within a processor by mapping addresses from a first local memory controller in one half of last level cache (LLC) slices closer to the first memory controller and addresses mapped to a second memory controller into the LLC slices in another half. Through this address-mapping mechanism, processes running on cores on one of the SNC domains using memory from the memory controller in the same SNC domain observe lower LLC and memory latency compared to latency on accesses mapped to locations outside of the same SNC domain. For example, SNC can be used to create independent compute domains within a central processing unit (CPU), with different types of interleaving to have different domain-based memory corruption. In some examples, a computer express link (CXL) can be used to isolate different elements of the architecture. Each of those “knobs” has implications that need to be handled and matched with respect to the application or service KPIs when implementing risk mitigation. In some examples, a federated or distributed compute-based architecture may contain different types of hardware nodes with different sets of risk mitigation features versus performance glass-jaws. In examples disclosed herein, this information can be captured as part of an orchestration manifest and provided to the user in different ways to express the “intent” of a risk. Fast closed-loop controller models (e.g., Intel® Resource Director Technology Dynamic Resource Controller (DRC)) could be extended to support hardware risk mitigation through dynamically switching on/off parts of the systems and supporting moving workloads away from risker components (e.g., considering a memory controller reporting higher errors, a DRC can be used to dynamically disable channels in that memory controller or the entire controller).
The programmable circuitry platform 1500 of the illustrated example includes programmable circuitry 1512. The programmable circuitry 1512 of the illustrated example is hardware. For example, the programmable circuitry 1512 can be implemented by one or more integrated circuits, logic circuits, FPGAs microprocessors, CPUs, GPUs, DSPs, and/or microcontrollers from any desired family or manufacturer. The programmable circuitry 1512 may be implemented by one or more semiconductor based (e.g., silicon based) devices. In this example, the processor circuitry 1512 implements the orchestrator interface generator circuitry 402, the resource management controller circuitry 404, the workload scheduler circuitry 406, the assurance intent mapper circuitry 408, the alert generator circuitry 410, the risk mapper circuitry 412, and/or the risk assessment controller circuitry 414.
The programmable circuitry 1512 of the illustrated example includes a local memory 1513 (e.g., a cache, registers, etc.). The programmable circuitry 1512 of the illustrated example is in communication with a main memory including a volatile memory 1514 and a non-volatile memory 1516 by a bus 1518. The volatile memory 1514 may be implemented by Synchronous Dynamic Random Access Memory (SDRAM), Dynamic Random Access Memory (DRAM), RAMBUS® Dynamic Random Access Memory (RDRAM®), and/or any other type of RAM device. The non-volatile memory 1516 may be implemented by flash memory and/or any other desired type of memory device. Access to the main memory 1514, 1516 of the illustrated example is controlled by a memory controller 1517. In some examples, the memory controller 1517 may be implemented by one or more integrated circuits, logic circuits, microcontrollers from any desired family or manufacturer, or any other type of circuitry to manage the flow of data going to and from the main memory 1514, 1516.
The programmable circuitry platform 1500 of the illustrated example also includes interface circuitry 1520. The interface circuitry 1520 may be implemented by hardware in accordance with any type of interface standard, such as an Ethernet interface, a universal serial bus (USB) interface, a Bluetooth® interface, a near field communication (NFC) interface, a Peripheral Component Interconnect (PCI) interface, and/or a Peripheral Component Interconnect Express (PCIe) interface.
In the illustrated example, one or more input devices 1522 are connected to the interface circuitry 1520. The input device(s) 1522 permit(s) a user (e.g., a human user, a machine user, etc.) to enter data and/or commands into the programmable circuitry 1512. The input device(s) 1522 can be implemented by, for example, an audio sensor, a microphone, a camera (still or video), a keyboard, a button, a mouse, a touchscreen, a track-pad, a trackball, an isopoint device, and/or a voice recognition system.
One or more output devices 1524 are also connected to the interface circuitry 1520 of the illustrated example. The output devices 1524 can be implemented, for example, by display devices (e.g., a light emitting diode (LED), an organic light emitting diode (OLED), a liquid crystal display (LCD), a cathode ray tube (CRT) display, an in-place switching (IPS) display, a touchscreen, etc.), a tactile output device, a printer, and/or speaker. The interface circuitry 1520 of the illustrated example, thus, typically includes a graphics driver card, a graphics driver chip, and/or graphics processor circuitry such as a GPU.
The interface circuitry 1520 of the illustrated example also includes a communication device such as a transmitter, a receiver, a transceiver, a modem, a residential gateway, a wireless access point, and/or a network interface to facilitate exchange of data with external machines (e.g., computing devices of any kind) by a network 1526. The communication can be by, for example, an Ethernet connection, a digital subscriber line (DSL) connection, a telephone line connection, a coaxial cable system, a satellite system, a line-of-site wireless system, a cellular telephone system, an optical connection, etc.
The programmable circuitry platform 1500 of the illustrated example also includes one or more mass storage devices 1528 to store software and/or data. Examples of such mass storage devices 1528 include magnetic storage devices (e.g., floppy disk, drives, HDDs, etc.), optical storage devices (e.g., Blu-ray disks, CDs, DVDs, etc.), RAID systems, and/or solid-state storage discs or devices such as flash memory devices and/or SSDs.
The machine executable instructions 1532, which may be implemented by the machine readable instructions of
The programmable circuitry platform 1600 of the illustrated example includes programmable circuitry 1612. The programmable circuitry 1612 of the illustrated example is hardware. For example, the programmable circuitry 1612 can be implemented by one or more integrated circuits, logic circuits, FPGAs microprocessors, CPUs, GPUs, DSPs, and/or microcontrollers from any desired family or manufacturer. The programmable circuitry 1612 may be implemented by one or more semiconductor based (e.g., silicon based) devices. In this example, the programmable circuitry 1612 implements the example neural network processor 444, the example trainer 442, and the example training controller 440.
The programmable circuitry 1612 of the illustrated example includes a local memory 1613 (e.g., a cache, registers, etc.). The programmable circuitry 1612 of the illustrated example is in communication with a main memory including a volatile memory 1614 and a non-volatile memory 1616 by a bus 1618. The volatile memory 1614 may be implemented by Synchronous Dynamic Random Access Memory (SDRAM), Dynamic Random Access Memory (DRAM), RAMBUS® Dynamic Random Access Memory (RDRAM®), and/or any other type of RAM device. The non-volatile memory 1616 may be implemented by flash memory and/or any other desired type of memory device. Access to the main memory 1614, 1616 of the illustrated example is controlled by a memory controller 1617. In some examples, the memory controller 1617 may be implemented by one or more integrated circuits, logic circuits, microcontrollers from any desired family or manufacturer, or any other type of circuitry to manage the flow of data going to and from the main memory 1614, 1616.
The programmable circuitry platform 1600 of the illustrated example also includes interface circuitry 1620. The interface circuitry 1620 may be implemented by hardware in accordance with any type of interface standard, such as an Ethernet interface, a universal serial bus (USB) interface, a Bluetooth® interface, a near field communication (NFC) interface, a Peripheral Component Interconnect (PCI) interface, and/or a Peripheral Component Interconnect Express (PCIe) interface.
In the illustrated example, one or more input devices 1622 are connected to the interface circuitry 1620. The input device(s) 1622 permit(s) a user (e.g., a human user, a machine user, etc.) to enter data and/or commands into the programmable circuitry 1612. The input device(s) 1622 can be implemented by, for example, an audio sensor, a microphone, a camera (still or video), a keyboard, a button, a mouse, a touchscreen, a track-pad, a trackball, an isopoint device, and/or a voice recognition system.
One or more output devices 1624 are also connected to the interface circuitry 1620 of the illustrated example. The output devices 1624 can be implemented, for example, by display devices (e.g., a light emitting diode (LED), an organic light emitting diode (OLED), a liquid crystal display (LCD), a cathode ray tube (CRT) display, an in-place switching (IPS) display, a touchscreen, etc.), a tactile output device, a printer, and/or speaker. The interface circuitry 1620 of the illustrated example, thus, typically includes a graphics driver card, a graphics driver chip, and/or graphics processor circuitry such as a GPU.
The interface circuitry 1620 of the illustrated example also includes a communication device such as a transmitter, a receiver, a transceiver, a modem, a residential gateway, a wireless access point, and/or a network interface to facilitate exchange of data with external machines (e.g., computing devices of any kind) by a network 1626. The communication can be by, for example, an Ethernet connection, a digital subscriber line (DSL) connection, a telephone line connection, a coaxial cable system, a satellite system, a line-of-site wireless system, a cellular telephone system, an optical connection, etc.
The programmable circuitry platform 1600 of the illustrated example also includes one or more mass storage devices 1628 to store software and/or data. Examples of such mass storage devices 1628 include magnetic storage devices (e.g., floppy disk, drives, HDDs, etc.), optical storage devices (e.g., Blu-ray disks, CDs, DVDs, etc.), RAID systems, and/or solid-state storage discs or devices such as flash memory devices and/or SSDs.
The machine executable instructions 1632, which may be implemented by the machine readable instructions of
The programmable circuitry platform 1700 of the illustrated example includes programmable circuitry 1712. The programmable circuitry 1712 of the illustrated example is hardware. For example, the programmable circuitry 1712 can be implemented by one or more integrated circuits, logic circuits, FPGAs microprocessors, CPUs, GPUs, DSPs, and/or microcontrollers from any desired family or manufacturer. The programmable circuitry 1712 may be implemented by one or more semiconductor based (e.g., silicon based) devices. In this example, the programmable circuitry 1712 implements the example neural network processor 464, the example trainer 462, and the example training controller 460.
The programmable circuitry 1712 of the illustrated example includes a local memory 1713 (e.g., a cache, registers, etc.). The programmable circuitry 1712 of the illustrated example is in communication with a main memory including a volatile memory 1714 and a non-volatile memory 1716 by a bus 1718. The volatile memory 1714 may be implemented by Synchronous Dynamic Random Access Memory (SDRAM), Dynamic Random Access Memory (DRAM), RAMBUS® Dynamic Random Access Memory (RDRAM®), and/or any other type of RAM device. The non-volatile memory 1716 may be implemented by flash memory and/or any other desired type of memory device. Access to the main memory 1714, 1716 of the illustrated example is controlled by a memory controller 1717. In some examples, the memory controller 1717 may be implemented by one or more integrated circuits, logic circuits, microcontrollers from any desired family or manufacturer, or any other type of circuitry to manage the flow of data going to and from the main memory 1714, 1716.
The programmable circuitry platform 1700 of the illustrated example also includes interface circuitry 1720. The interface circuitry 1720 may be implemented by hardware in accordance with any type of interface standard, such as an Ethernet interface, a universal serial bus (USB) interface, a Bluetooth® interface, a near field communication (NFC) interface, a Peripheral Component Interconnect (PCI) interface, and/or a Peripheral Component Interconnect Express (PCIe) interface.
In the illustrated example, one or more input devices 1722 are connected to the interface circuitry 1720. The input device(s) 1722 permit(s) a user (e.g., a human user, a machine user, etc.) to enter data and/or commands into the programmable circuitry 1712. The input device(s) 1722 can be implemented by, for example, an audio sensor, a microphone, a camera (still or video), a keyboard, a button, a mouse, a touchscreen, a track-pad, a trackball, an isopoint device, and/or a voice recognition system.
One or more output devices 1724 are also connected to the interface circuitry 1720 of the illustrated example. The output devices 1724 can be implemented, for example, by display devices (e.g., a light emitting diode (LED), an organic light emitting diode (OLED), a liquid crystal display (LCD), a cathode ray tube (CRT) display, an in-place switching (IPS) display, a touchscreen, etc.), a tactile output device, a printer, and/or speaker. The interface circuitry 1720 of the illustrated example, thus, typically includes a graphics driver card, a graphics driver chip, and/or graphics processor circuitry such as a GPU.
The interface circuitry 1720 of the illustrated example also includes a communication device such as a transmitter, a receiver, a transceiver, a modem, a residential gateway, a wireless access point, and/or a network interface to facilitate exchange of data with external machines (e.g., computing devices of any kind) by a network 1726. The communication can be by, for example, an Ethernet connection, a digital subscriber line (DSL) connection, a telephone line connection, a coaxial cable system, a satellite system, a line-of-site wireless system, a cellular telephone system, an optical connection, etc.
The programmable circuitry platform 1700 of the illustrated example also includes one or more mass storage devices 1728 to store software and/or data. Examples of such mass storage devices 1728 include magnetic storage devices (e.g., floppy disk, drives, HDDs, etc.), optical storage devices (e.g., Blu-ray disks, CDs, DVDs, etc.), RAID systems, and/or solid-state storage discs or devices such as flash memory devices and/or SSDs.
The machine executable instructions 1732, which may be implemented by the machine readable instructions of
The cores 1802 may communicate by a first example bus 1804. In some examples, the first bus 1804 may implement a communication bus to effectuate communication associated with one(s) of the cores 1802. For example, the first bus 1804 may implement at least one of an Inter-Integrated Circuit (I2C) bus, a Serial Peripheral Interface (SPI) bus, a PCI bus, or a PCIe bus. Additionally or alternatively, the first bus 1804 may implement any other type of computing or electrical bus. The cores 1802 may obtain data, instructions, and/or signals from one or more external devices by example interface circuitry 1806. The cores 1802 may output data, instructions, and/or signals to the one or more external devices by the interface circuitry 1806. Although the cores 1802 of this example include example local memory 1820 (e.g., Level 1 (L1) cache that may be split into an L1 data cache and an L1 instruction cache), the microprocessor 1800 also includes example shared memory 1810 that may be shared by the cores (e.g., Level 2 (L2_ cache)) for high-speed access to data and/or instructions. Data and/or instructions may be transferred (e.g., shared) by writing to and/or reading from the shared memory 1810. The local memory 1820 of each of the cores 1802 and the shared memory 1810 may be part of a hierarchy of storage devices including multiple levels of cache memory and the main memory (e.g., the main memory 1814, 1816 of
Each core 1802 may be referred to as a CPU, DSP, GPU, etc., or any other type of hardware circuitry. Each core 1802 includes control unit circuitry 1814, arithmetic and logic (AL) circuitry (sometimes referred to as an ALU) 1816, a plurality of registers 1818, the L1 cache 1820, and a second example bus 1822. Other structures may be present. For example, each core 1802 may include vector unit circuitry, single instruction multiple data (SIMD) unit circuitry, load/store unit (LSU) circuitry, branch/jump unit circuitry, floating-point unit (FPU) circuitry, etc. The control unit circuitry 1814 includes semiconductor-based circuits structured to control (e.g., coordinate) data movement within the corresponding core 1802. The AL circuitry 1816 includes semiconductor-based circuits structured to perform one or more mathematic and/or logic operations on the data within the corresponding core 1802. The AL circuitry 1816 of some examples performs integer-based operations. In other examples, the AL circuitry 1816 also performs floating-point operations. In yet other examples, the AL circuitry 1816 may include first AL circuitry that performs integer-based operations and second AL circuitry that performs floating point operations. In some examples, the AL circuitry 1816 may be referred to as an Arithmetic Logic Unit (ALU).
The registers 1818 are semiconductor-based structures to store data and/or instructions such as results of one or more of the operations performed by the AL circuitry 1816 of the corresponding core 1802. For example, the registers 1818 may include vector register(s), SIMD register(s), general purpose register(s), flag register(s), segment register(s), machine specific register(s), instruction pointer register(s), control register(s), debug register(s), memory management register(s), machine check register(s), etc. The registers 1818 may be arranged in a bank as shown in
Each core 1802 and/or, more generally, the microprocessor 1800 may include additional and/or alternate structures to those shown and described above. For example, one or more clock circuits, one or more power supplies, one or more power gates, one or more cache home agents (CHAs), one or more converged/common mesh stops (CMS s), one or more shifters (e.g., barrel shifter(s)) and/or other circuitry may be present. The microprocessor 1800 is a semiconductor device fabricated to include many transistors interconnected to implement the structures described above in one or more integrated circuits (ICs) contained in one or more packages.
The microprocessor 1800 may include and/or cooperate with one or more accelerators (e.g., acceleration circuitry, hardware accelerators, etc.). In some examples, accelerators are implemented by logic circuitry to perform certain tasks more quickly and/or efficiently than can be done by a general-purpose processor. Examples of accelerators include ASICs and FPGAs such as those discussed herein. A GPU, DSP and/or other programmable device can also be an accelerator. Accelerators may be on-board the microprocessor 1800, in the same chip package as the microprocessor 1800 and/or in one or more separate packages from the microprocessor 1800.
More specifically, in contrast to the microprocessor 1800 of
In the example of
In some examples, the binary file is compiled, generated, transformed, and/or otherwise output from a uniform software platform utilized to program FPGAs. For example, the uniform software platform may translate first instructions (e.g., code or a program) that correspond to one or more operations/functions in a high-level language (e.g., C, C++, Python, etc.) into second instructions that correspond to the one or more operations/functions in an HDL. In some such examples, the binary file is compiled, generated, and/or otherwise output from the uniform software platform based on the second instructions. In some examples, the FPGA circuitry 1900 of
The FPGA circuitry 1900 of
The FPGA circuitry 1900 also includes an array of example logic gate circuitry 1908, a plurality of example configurable interconnections 1910, and example storage circuitry 1912. The logic gate circuitry 1908 and the configurable interconnections 1910 are configurable to instantiate one or more operations/functions that may correspond to at least some of the machine readable instructions of
The configurable interconnections 1910 of the illustrated example are conductive pathways, traces, vias, or the like that may include electrically controllable switches (e.g., transistors) whose state can be changed by programming (e.g., using an HDL instruction language) to activate or deactivate one or more connections between one or more of the logic gate circuitry 1908 to program desired logic circuits.
The storage circuitry 1912 of the illustrated example is structured to store result(s) of the one or more of the operations performed by corresponding logic gates. The storage circuitry 1912 may be implemented by registers or the like. In the illustrated example, the storage circuitry 1912 is distributed amongst the logic gate circuitry 1908 to facilitate access and increase execution speed.
The example FPGA circuitry 1900 of
Although
It should be understood that some or all of the circuitry of
In some examples, some or all of the circuitry of
In some examples, the programmable circuitry 1512, 1612, 1712 of FIGS. may be in one or more packages. For example, the microprocessor 1800 of
A block diagram illustrating an example software distribution platform 2005 to distribute software such as the example machine readable instructions 1512, 1612, 1712 of
From the foregoing, it will be appreciated that example systems, methods, apparatus, and articles of manufacture have been disclosed that facilitate forced reservation for active probes and introduce a new workflow to perform a series of automated checks associated with edge platform-based workflows. In examples disclosed herein, a new workflow to perform a series of automated checks is introduced based on a predefined policy. For example, monitoring and automatic checks can be performed using network schemes (e.g., IPUs and switches) to include complex triggering rules. Furthermore, network schemes (e.g., IPUs and switches) can be programmed to monitor for multi-modal dependency. Additionally, methods and apparatus disclosed herein facilitate mapping a risk intent to deployment methods and mitigations. The risk assessment component builds models of risks overtime, based on observed risk occurrence, impact and meantime to repair and produces a model for faults in the orchestration layer, infrastructure layer, service orchestration layer, and/or monitor and analytics layer. In examples disclosed herein, intent-based orchestration is developed (e.g., intent driven orchestration). Intent driven orchestration allows for differentiation through software of smart orchestration platforms, including actuators and supporting components. For example, the intent driven orchestration-based framework disclosed herein allows for intents (e.g., assurance intents) to be mapped to provisioning and life management of resources to provide robust service assurance solutions for various purpose-engineered telco and edge platforms. In some examples, a user can express their intents in the form of objectives (e.g., as required latency, throughput, or reliability targets) and the orchestration stack determines what resources in the infrastructure are required to fulfill the objectives. Methods and apparatus disclosed herein provide for unique enabling of service assurance probes (e.g., active and/or passive) using intent driven orchestration. As such, methods and apparatus disclosed herein apply to a breadth of infrastructure components (e.g., including compute, graphics, IPU, memory and storage), thereby monitoring these components and bringing them into an integrated operation responsive to the needs of service owners and resource providers.
Example methods, apparatus, systems, and articles of manufacture for efficient execution of convolutional neural networks for compressed video sequences are disclosed herein. Further examples and combinations thereof include the following:
Example 1 includes an apparatus comprising interface circuitry, machine readable instructions, and programmable circuitry to utilize the machine readable instructions to reserve a probe on a compute device in a cluster of compute devices based on a request to satisfy a resource availability criterion associated with a resource of the cluster, apply a risk mitigation operation based on the resource availability criterion before deployment of a workload to the cluster, and monitor whether the criterion is satisfied based on data from the probe after deployment of the workload to the cluster.
Example 2 includes the apparatus of example 1, wherein the probe is used to at least one of (1) monitor a container or (2) validate workload performance.
Example 3 includes the apparatus of example 1, wherein the programmable circuitry is to perform a forced reservation of the probe.
Example 4 includes the apparatus of example 1, wherein when the resource availability criterion corresponds to a cluster level assurance intent to meet a target availability of the resource, the programmable circuitry is to generate a service risk profile, the service risk profile associated with the cluster level assurance intent.
Example 5 includes the apparatus of example 4, wherein the programmable circuitry is to map, based on the service risk profile, (1) an allowable risk associated with a risk tolerance profile to (2) a probability of risk occurrence on a domain of the compute device.
Example 6 includes the apparatus of example 1, wherein the risk mitigation includes adding capacity on failure conditions, applying an automatic remediation, or distributing risk in an orchestration layer or an analytics layer.
Example 7 includes the apparatus of example 1, wherein the programmable circuitry is to train a risk model based on at least one of an observed risk occurrence or a meantime to error repair.
Example 8 includes a method comprising reserving a probe on a compute device in a cluster of compute devices based on a request to satisfy a resource availability criterion associated with a resource of the cluster, applying a risk mitigation operation based on the resource availability criterion before deployment of a workload to the cluster, and monitoring whether the criterion is satisfied based on data from the probe after deployment of the workload to the cluster.
Example 9 includes the method of example 8, wherein the probe is used to at least one of (1) monitor a container or (2) validate workload performance.
Example 10 includes the method of example 8, further including performing a forced reservation of the probe.
Example 11 includes the method of example 8, wherein when the resource availability criterion corresponds to a cluster level assurance intent to meet a target availability of the resource, further including generating a service risk profile, the service risk profile associated with the cluster level assurance intent.
Example 12 includes the method of example 11, further including mapping, based on the service risk profile, (1) an allowable risk associated with a risk tolerance profile to (2) a probability of risk occurrence on a domain of the compute device.
Example 13 includes the method of example 8, wherein the risk mitigation includes adding capacity on failure conditions, applying an automatic remediation, or distributing risk in an orchestration layer or an analytics layer.
Example 14 includes the method of example 8, further including training a risk model based on at least one of an observed risk occurrence or a meantime to error repair.
Example 15 includes a non-transitory machine readable storage medium comprising instructions to cause programmable circuitry to at least reserve a probe on a compute device in a cluster of compute devices based on a request to satisfy a resource availability criterion associated with a resource of the cluster, apply a risk mitigation operation based on the resource availability criterion before deployment of a workload to the cluster, and monitor whether the criterion is satisfied based on data from the probe after deployment of the workload to the cluster.
Example 16 includes the non-transitory machine readable storage medium of example 15, wherein the probe is used to at least one of (1) monitor a container or (2) validate workload performance.
Example 17 includes the non-transitory machine readable storage medium of example 15, wherein the instructions are to cause the programmable circuitry to perform a forced reservation of the probe.
Example 18 includes the non-transitory machine readable storage medium of example 15, wherein when the resource availability criterion corresponds to a cluster level assurance intent to meet a target availability of the resource, the instructions are to cause the programmable circuitry to generate a service risk profile, the service risk profile associated with the cluster level assurance intent.
Example 19 includes the non-transitory machine readable storage medium of example 18, wherein the instructions are to cause the programmable circuitry to map, based on the service risk profile, (1) an allowable risk associated with a risk tolerance profile to (2) a probability of risk occurrence on a domain of the compute device.
Example 20 includes the non-transitory machine readable storage medium of example 15, wherein the instructions are to cause the programmable circuitry to train a risk model based on at least one of an observed risk occurrence or a meantime to error repair.
The following claims are hereby incorporated into this Detailed Description by this reference. Although certain example systems, methods, apparatus, and articles of manufacture have been disclosed herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all systems, methods, apparatus, and articles of manufacture fairly falling within the scope of the claims of this patent.