ACCELERATOR OR ACCELERATED FUNCTIONS AS A SERVICE USING NETWORKED PROCESSING UNITS

Information

  • Patent Application
  • 20230133020
  • Publication Number
    20230133020
  • Date Filed
    December 29, 2022
    2 years ago
  • Date Published
    May 04, 2023
    a year ago
Abstract
Various approaches for deploying and controlling distributed accelerated compute operations with the use of infrastructure processing units (IPUs) and similar networked processing units are disclosed. A system for orchestrating acceleration functions in a network compute mesh is configured to access a flowgraph, the flowgraph including data producer-consumer relationships between a plurality of tasks in a workload; identify available artifacts and resources to execute the artifacts to complete each of the plurality of tasks, wherein an artifact is an instance of a function to perform a task of the plurality of tasks; determine a configuration assigning artifacts and resources to each of the plurality of tasks in the flowgraph; and schedule, based on the configuration, the plurality of tasks to execute using the assigned artifacts and resources.
Description
TECHNICAL FIELD

Embodiments described herein generally relate to data processing, network communication, and communication system implementations of distributed computing, including the implementations with the use of networked processing units (or network-addressable processing units) such as infrastructure processing units (IPUs) or data processing units (DPUs).


BACKGROUND

System architectures are moving to highly distributed multi-edge and multi-tenant deployments. Deployments may have different limitations in terms of power and space. Deployments also may use different types of compute, acceleration and storage technologies in order to overcome these power and space limitations. Deployments also are typically interconnected in tiered and/or peer-to-peer fashion, in an attempt to create a network of connected devices and edge appliances that work together.


Edge computing, at a general level, has been described as systems that provide the transition of compute and storage resources closer to endpoint devices at the edge of a network (e.g., consumer computing devices, user equipment, etc.). As compute and storage resources are moved closer to endpoint devices, a variety of advantages have been promised such as reduced application latency, improved service capabilities, improved compliance with security or data privacy requirements, improved backhaul bandwidth, improved energy consumption, and reduced cost. However, many deployments of edge computing technologies—especially complex deployments for use by multiple tenants—have not been fully adopted.





BRIEF DESCRIPTION OF THE DRAWINGS

In the drawings, which are not necessarily drawn to scale, like numerals may describe similar components in different views. Like numerals having different letter suffixes may represent different instances of similar components. Some embodiments are illustrated by way of example, and not limitation, in the figures of the accompanying drawings in which:



FIG. 1 illustrates an overview of a distributed edge computing environment, according to an example;



FIG. 2 depicts computing hardware provided among respective deployment tiers in a distributed edge computing environment, according to an example;



FIG. 3 depicts additional characteristics of respective deployments tiers in a distributed edge computing environment, according to an example;



FIG. 4 depicts a computing system architecture including a compute platform and a network processing platform provided by an infrastructure processing unit, according to an example;



FIG. 5 depicts an infrastructure processing unit arrangement operating as a distributed network processing platform within network and data center edge settings, according to an example;



FIG. 6 depicts functional components of an infrastructure processing unit and related services, according to an example;



FIG. 7 depicts a block diagram of example components in an edge computing system which implements a distributed network processing platform, according to an example;



FIG. 8 is a block diagram illustrating the general flow for providing acceleration as a service, according to an example;



FIG. 9 depicts the decomposition or refactoring of a compound service into its component microservices, functions, etc., according to an example;



FIG. 10 depicts a number of “events” or triggers that either cause the various tasks to be triggered, resumed from a waiting state, or interrupted in order to respond to some event, according to an example;



FIG. 11 depicts various data that is produced, consumed, or produced and consumed by different tasks, according to an example;



FIG. 12 depicts the execution of a task that may generate triggers that affect other tasks, according to an example;



FIG. 13 depicts a process of flow optimization, according to an example;



FIG. 14 depicts a subset of the graph illustrated in FIG. 12, in which fewer tasks and edges are shown, according to an example;



FIG. 15 depicts a table with a correspondence between the logical identifier of a task and the corresponding logical identifiers of available accelerator implementations for that task, according to an example;



FIG. 16 depicts an undirected dataflow graph of the tasks, according to an example;



FIG. 17 depicts the transformation from an unoptimized dataflow graph to an optimized version of acceleration as a service as implemented by the agency of IPUs, for the subset graph shown in FIG. 14, according to an example;



FIG. 18 depicts a database of information that shows for each type of logical artifact various instances (i.e., execution capable resources) that can support its execution, according to an example;



FIG. 19 depicts various functional components of an IPU, according to an example; and



FIG. 20 depicts a flowchart of a method for orchestrating acceleration functions in a network compute mesh, according to an example.





DETAILED DESCRIPTION

Various approaches for providing accelerators and accelerated functions in an edge computing setting are discussed herein. Existing approaches rely on a centralized model where functions are executed at a central store, typically a datacenter, by clients that connect to the central store. In an edge-to-cloud compute continuum, there is a need to weave acceleration into computations at scale. The rise of artificial intelligence (AI) and the need for real time machine-learning guided large-scale distributed operations in all walks of life means that processing devices will increasingly need to spend a lot of time collaborating not just with their remote peers but also with remote accelerators. Infrastructure processing units (IPUs) may offload some of this work, but a large unaddressed need remains, which is to aggregate acceleration capabilities that are accessible at low latency within a host (a VM, a process, a container, etc.) and provide a seamless acceleration-as-a-service (XaaS) capability to high level business/client software layers. The architecture, which includes an acceleration-as-a-service layer, provides a mechanism to optimize the data flows between different CPU/XPU elements that run different parts of a composite workload.


Various approaches and mechanisms are described herein to implement and enable acceleration pooling, intelligent scheduling and orchestration of dependent tasks, and providing acceleration-as-a-service (XaaS) or accelerated-function-as-a-service (XFaaS).


In various examples, the logic that is used to configure the mechanisms for acceleration pooling and providing XaaS or XFaaS are managed by a network switch or other network-addressable component. For instance, a network switch can monitor or orchestrate execution flow of a data between CPU and non-CPU-based (e.g., hardware accelerators) functions or microservices among network addressable compute nodes in a network. Non-CPU-based hardware may include circuitry and devices such as application-specific integrated circuits (ASIC), field-programmable gate arrays (FPGA), coarse-grained reconfigurable arrays (CGRA), system-on-chip (SOC), graphics processing units (GPUs), and the like.


Accordingly, the following describes coordinated, intelligent components to configure a combination of memory and compute resources for servicing client workloads and increasing speed. While many of the techniques may be implemented by a switch, orchestrator, or controller, the techniques are also suited for use by networked processing units such as infrastructure processing units (IPUs, such as respective IPUs operating as a memory owner and remote memory consumer).


Additional implementation details of the providing acceleration or acceleration as a function in an edge computing network, implemented by way of a network switch or IPUs are provided among provided in FIGS. 8 to 20, below. General implementation details of an edge computing network and the use of distributed networked processing units in such a network is provided in FIGS. 1 to 7, below.


Distributed Edge Computing and Networked Processing Units



FIG. 1 is a block diagram 100 showing an overview of a distributed edge computing environment, which may be adapted for implementing the present techniques for distributed networked processing units. As shown, the edge cloud 110 is established from processing operations among one or more edge locations, such as a satellite vehicle 141, a base station 142, a network access point 143, an on premise server 144, a network gateway 145, or similar networked devices and equipment instances. These processing operations may be coordinated by one or more edge computing platforms 120 or systems that operate networked processing units (e.g., IPUs, DPUs) as discussed herein.


The edge cloud 110 is generally defined as involving compute that is located closer to endpoints 160 (e.g., consumer and producer data sources) than the cloud 130, such as autonomous vehicles 161, user equipment 162, business and industrial equipment 163, video capture devices 164, drones 165, smart cities and building devices 166, sensors and IoT devices 167, etc. Compute, memory, network, and storage resources that are offered at the entities in the edge cloud 110 can provide ultra-low or improved latency response times for services and functions used by the endpoint data sources as well as reduce network backhaul traffic from the edge cloud 110 toward cloud 130 thus improving energy consumption and overall network usages among other benefits.


Compute, memory, and storage are scarce resources, and generally decrease depending on the edge location (e.g., fewer processing resources being available at consumer end point devices than at a base station or a central office data center). As a general design principle, edge computing attempts to minimize the number of resources needed for network services, through the distribution of more resources that are located closer both geographically and in terms of in-network access time.



FIG. 2 depicts examples of computing hardware provided among respective deployment tiers in a distributed edge computing environment. Here, one tier at an on-premise edge system is an intelligent sensor or gateway tier 210, which operates network devices with low power and entry-level processors and low-power accelerators. Another tier at an on-premise edge system is an intelligent edge tier 220, which operates edge nodes with higher power limitations and may include a high-performance storage.


Further in the network, a network edge tier 230 operates servers including form factors optimized for extreme conditions (e.g., outdoors). A data center edge tier 240 operates additional types of edge nodes such as servers, and includes increasingly powerful or capable hardware and storage technologies. Still further in the network, a core data center tier 250 and a public cloud tier 260 operate compute equipment with the highest power consumption and largest configuration of processors, acceleration, storage/memory devices, and highest throughput network.


In each of these tiers, various forms of Intel® processor lines are depicted for purposes of illustration; it will be understood that other brands and manufacturers of hardware will be used in real-world deployments. Additionally, it will be understood that additional features or functions may exist among multiple tiers. One such example is connectivity and infrastructure management that enable a distributed IPU architecture, that can potentially extend across all of tiers 210, 220, 230, 240, 250, 260. Other relevant functions that may extend across multiple tiers may relate to security features, domain or group functions, and the like.



FIG. 3 depicts additional characteristics of respective deployment tiers in a distributed edge computing environment, based on the tiers discussed with reference to FIG. 2. This figure depicts additional network latencies at each of the tiers 210, 220, 230, 240, 250, 260, and the gradual increase in latency in the network as the compute is located at a longer distance from the edge endpoints. Additionally, this figure depicts additional power and form factor constraints, use cases, and key performance indicators (KPIs).


With these variations and service features in mind, edge computing within the edge cloud 110 may provide the ability to serve and respond to multiple applications of the use cases in real-time or near real-time and meet ultra-low latency requirements. As systems have become highly-distributed, networking has become one of the fundamental pieces of the architecture that allow achieving scale with resiliency, security, and reliability. Networking technologies have evolved to provide more capabilities beyond pure network routing capabilities, including to coordinate quality of service, security, multi-tenancy, and the like. This has also been accelerated by the development of new smart network adapter cards and other type of network derivatives that incorporated capabilities such as ASICs (application-specific integrated circuits) or FPGAs (field programmable gate arrays) to accelerate some of those functionalities (e.g., remote attestation).


In these contexts, networked processing units have begun to be deployed at network cards (e.g., smart NICs), gateways, and the like, which allow direct processing of network workloads and operations. One example of a networked processing unit is an infrastructure processing unit (IPU), which is a programmable network device that can be extended to provide compute capabilities with far richer functionalities beyond pure networking functions. Another example of a network processing unit is a data processing unit (DPU), which offers programmable hardware for performing infrastructure and network processing operations. The following discussion refers to functionality applicable to an IPU configuration, such as that provided by an Intel® line of IPU processors. However, it will be understood that functionality will be equally applicable to DPUs and other types of networked processing units provided by ARM®, Nvidia®, and other hardware OEMs.



FIG. 4 depicts an example compute system architecture that includes a compute platform 420 and a network processing platform comprising an IPU 410. This architecture—and in particular the IPU 410—can be managed, coordinated, and orchestrated by the functionality discussed below, including with the functions described with reference to FIG. 6.


The main compute platform 420 is composed by typical elements that are included with a computing node, such as one or more CPUs 424 that may or may not be connected via a coherent domain (e.g., via Ultra Path Interconnect (UPI) or another processor interconnect); one or more memory units 425; one or more additional discrete devices 426 such as storage devices, discrete acceleration cards (e.g., a field-programmable gate array (FPGA), a visual processing unit (VPU), etc.); a baseboard management controller 421; and the like. The compute platform 420 may operate one or more containers 422 (e.g., with one or more microservices), within a container runtime 423 (e.g., Docker containers). The IPU 410 operates as a networking interface and is connected to the compute platform 420 using an interconnect (e.g., using either PCIe or CXL). The IPU 410, in this context, can be observed as another small compute device that has its own: (1) Processing cores (e.g., provided by low-power cores 417), (2) operating system (OS) and cloud native platform 414 to operate one or more containers 415 and a container runtime 416; (3) Acceleration functions provided by an ASIC 411 or FPGA 412; (4) Memory 418; (5) Network functions provided by network circuitry 413; etc.


From a system design perspective, this arrangement provides important functionality. The IPU 410 is seen as a discrete device from the local host (e.g., the OS running in the compute platform CPUs 424) that is available to provide certain functionalities (networking, acceleration etc.). Those functionalities are typically provided via Physical or Virtual PCIe functions. Additionally, the IPU 410 is seen as a host (with its own IP etc.) that can be accessed by the infrastructure to setup an OS, run services, and the like. The IPU 410 sees all the traffic going to the compute platform 420 and can perform actions—such as intercepting the data or performing some transformation—as long as the correct security credentials are hosted to decrypt the traffic. Traffic going through the IPU goes to all the layers of the Open Systems Interconnection model (OSI model) stack (e.g., from physical to application layer). Depending on the features that the IPU has, processing may be performed at the transport layer only. However, if the IPU has capabilities to perform traffic intercept, then the IPU also may be able to intercept traffic at the traffic layer (e.g., intercept CDN traffic and process it locally).


Some of the use cases being proposed for IPUs and similar networked processing units include: to accelerate network processing; to manage hosts (e.g., in a data center); or to implement quality of service policies. However, most of functionalities today are focused at using the IPU at the local appliance level and within a single system. These approaches do not address how the IPUs could work together in a distributed fashion or how system functionalities can be divided among the IPUs on other parts of the system. Accordingly, the following introduces enhanced approaches for enabling and controlling distributed functionality among multiple networked processing units. This enables the extension of current IPU functionalities to work as a distributed set of IPUs that can work together to achieve stronger features such as, resiliency, reliability, etc.


Distributed Architectures of IPUs



FIG. 5 depicts an IPU arrangement operating as a distributed network processing platform within network and data center edge settings. In a first deployment model of a computing environment 510, workloads or processing requests are directly provided to an IPU platform, such as directly to IPU 514. In a second deployment model of the computing environment 510, workloads or processing requests are provided to some intermediate processing device 512, such as a gateway or NUC (next unit of computing) device form factor, and the intermediate processing device 512 forwards the workloads or processing requests to the IPU 514. It will be understood that a variety of other deployment models involving the composability and coordination of one or more IPUs, compute units, network devices, and other hardware may be provided.


With the first deployment model, the IPU 514 directly receives data from use cases 502A. The IPU 514 operates one or more containers with microservices to perform processing of the data. As an example, a small gateway (e.g., a NUC type of appliance) may connect multiple cameras to an edge system that is managed or connected by the IPU 514. The IPU 514 may process data as a small aggregator of sensors that runs on the far edge, or may perform some level of inline or preprocessing and that sends payload to be further processed by the IPU or the system that the IPU connects.


With the second deployment model, the intermediate processing device 512 provided by the gateway or NUC receives data from use cases 502B. The intermediate processing device 512 includes various processing elements (e.g., CPU cores, GPUs), and may operate one or more microservices for servicing workloads from the use cases 502B. However, the intermediate processing device 512 invokes the IPU 514 to complete processing of the data.


In either the first or the second deployment model, the IPU 514 may connect with a local compute platform, such as that provided by a CPU 516 (e.g., Intel® Xeon CPU) operating multiple microservices. The IPU may also connect with a remote compute platform, such as that provided at a data center by CPU 540 at a remote server. As an example, consider a microservice that performs some analytical processing (e.g., face detection on image data), where the CPU 516 and the CPU 540 provide access to this same microservice. The IPU 514, depending on the current load of the CPU 516 and the CPU 540, may decide to forward the images or payload to one of the two CPUs. Data forwarding or processing can also depend on other factors such as SLA for latency or performance metrics (e.g., perf/watt) in the two systems. As a result, the distributed IPU architecture may accomplish features of load balancing.


The IPU in the computing environment 510 may be coordinated with other network-connected IPUs. In an example, a Service and Infrastructure orchestration manager 530 may use multiple IPUs as a mechanism to implement advanced service processing schemes for the user stacks. This may also enable implementing of system functionalities such as failover, load balancing etc.


In a distributed architecture example, IPUs can be arranged in the following non-limiting configurations. As a first configuration, a particular IPU (e.g., IPU 514) can work with other IPUs (e.g., IPU 520) to implement failover mechanisms. For example, an IPU can be configured to forward traffic to service replicas that runs on other systems when a local host does not respond.


As a second configuration, a particular IPU (e.g., IPU 514) can work with other IPUs (e.g., IPU 520) to perform load balancing across other systems. For example, consider a scenario where CDN traffic targeted to the local host is forwarded to another host in case that I/O or compute in the local host is scarce at a given moment.


As a third configuration, a particular IPU (e.g., IPU 514) can work as a power management entity to implement advanced system policies. For example, consider a scenario where the whole system (e.g., including CPU 516) is placed in a C6 state (a low-power/power-down state available to a processor) while forwarding traffic to other systems (e.g., IPU 520) and consolidating it.


As will be understood, fully coordinating a distributed IPU architecture requires numerous aspects of coordination and orchestration. The following examples of system architecture deployments provide discussion of how edge computing systems may be adapted to include coordinated IPUs, and how such deployments can be orchestrated to use IPUs at multiple locations to expand to the new envisioned functionality.


Distributed IPU Functionality


An arrangement of distributed IPUs offers a set of new functionalities to enable IPUs to be service focused. FIG. 6 depicts functional components of an IPU 610, including services and features to implement the distributed functionality discussed herein. It will be understood that some or all of the functional components provided in FIG. 6 may be distributed among multiple IPUs, hardware components, or platforms, depending on the particular configuration and use case involved.


In the block diagram of FIG. 6, a number of functional components are operated to manage requests for a service running in the IPU (or running in the local host). As discussed above, IPUs can either run services or intercept requests arriving to services running in the local host and perform some action. In the latter case, the IPU can perform the following types of actions/functions (provided as a non-limiting examples).


Peer Discovery. In an example, each IPU is provided with Peer Discovery logic to discover other IPUs in the distributed system that can work together with it. Peer Discovery logic may use mechanisms such as broadcasting to discover other IPUs that are available on a network. The Peer Discovery logic is also responsible to work with the Peer Attestation and Authentication logic to validate and authenticate the peer IPU's identity, determine whether they are trustworthy, and whether the current system tenant allows the current IPU to work with them. To accomplish this, an IPU may perform operations such as: retrieve a proof of identity and proof of attestation; connect to a trusted service running in a trusted server; or, validate that the discovered system is trustworthy. Various technologies (including hardware components or standardized software implementations) that enable attestation, authentication, and security may be used with such operations.


Peer Attestation. In an example, each IPU provides interfaces to other IPUs to enable attestation of the IPU itself. IPU Attestation logic is used to perform an attestation flow within a local IPU in order to create the proof of identity that will be shared with other IPUs. Attestation here may integrate previous approaches and technologies to attest a compute platform. This may also involve the use of trusted attestation service 640 to perform the attestation operations.


Functionality Discovery. In an example, a particular IPU includes capabilities to discover the functionalities that peer IPUs provide. Once the authentication is done, the IPU can determine what functionalities that the peer IPUs provide (using the IPU Peer Discovery Logic) and store a record of such functionality locally. Examples of properties to discover can include: (i) Type of IPU and functionalities provided and associated KPIs (e.g. performance/watt, cost etc.); (ii) Available functionalities as well as possible functionalities to execute under secure enclaves (e.g., enclaves provided by Intel® SGX or TDX technologies); (iii) Current services that are running on the IPU and on the system that can potentially accept requests forwarded from this IPU; or (iv) Other interfaces or hooks that are provided by an IPU, such as: Access to remote storage; Access to a remote VPU; Access to certain functions. In a specific example, service may be described by properties such as: UUID; Estimated performance KPIs in the host or IPU; Average performance provided by the system during the N units of time (or any other type of indicator); and like properties.


Service Management. The IPU includes functionality to manage services that are running either on the host compute platform or in the IPU itself. Managing (orchestration) services includes performance service and resource orchestration for the services that can run on the IPU or that the IPU can affect. Two type of usage models are envisioned:


External Orchestration Coordination. The IPU may enable external orchestrators to deploy services on the IPU compute capabilities. To do so, an IPU includes a component similar to K8 compatible APIs to manage the containers (services) that run on the IPU itself. For example, the IPU may run a service that is just providing content to storage connected to the platform. In this case, the orchestration entity running in the IPU may manage the services running in the IPU as it happens in other systems (e.g. keeping the service level objectives).


Further, external orchestrators can be allowed to register to the IPU that services are running on the host may require to broker requests, implement failover mechanisms and other functionalities. For example, an external orchestrator may register that a particular service running on the local compute platform is replicated in another edge node managed by another IPU where requests can be forwarded.


In this latter use case, external orchestrators may provide to the Service/Application Intercept logic the inputs that are needed to intercept traffic for these services (as typically is encrypted). This may include properties such as a source and destination traffic of the traffic to be intercepted, or the key to use to decrypt the traffic. Likewise, this may be needed to terminate TLS to understand the requests that arrive to the IPU and that the other logics may need to parse to take actions. For example, if there is a CDN read request the IPU may need to decrypt the packet to understand that network packet includes a read request and may redirect it to another host based on the content that is being intercepted. Examples of Service/Application Intercept information is depicted in table 620 in FIG. 6.


External Orchestration Implementation. External orchestration can be implemented in multiple topologies. One supported topology includes having the orchestrator managing all the IPUs running on the backend public or private cloud. Another supported topology includes having the orchestrator managing all the IPUs running in a centralized edge appliance. Still another supported topology includes having the orchestrator running in another IPU that is working as the controller or having the orchestrator running distributed in multiple other IPUs that are working as controllers (master/primary node), or in a hierarchical arrangement.


Functionality for Broker requests. The IPU may include Service Request Brokering logic and Load Balancing logic to perform brokering actions on arrival for requests of target services running in the local system. For instance, the IPU may decide to see if those requests can be executed by other peer systems (e.g., accessible through Service and Infrastructure Orchestration 630). This can be caused, for example, because load in the local systems is high. The local IPU may negotiate with other peer IPUs for the possibility to forward the request. Negotiation may involve metrics such as cost. Based on such negotiation metrics, the IPU may decide to forward the request.


Functionality for Load Balancing requests. The Service Request Brokering and Load Balancing logic may distribute requests arriving to the local IPU to other peer IPUs. In this case, the other IPUs and the local IPU work together and do not necessarily need brokering. Such logic acts similar to a cloud native sidecar proxy. For instance, requests arriving to the system may be sent to the service X running in the local system (either IPU or compute platform) or forwarded to a peer IPU that has another instance of service X running. The load balancing distribution can be based on existing algorithms such as based on the systems that have lower load, using round robin, etc.


Functionality for failover, resiliency and reliability. The IPU includes Reliability and Failover logic to monitor the status of the services running on the compute platform or the status of the compute platform itself. The Reliability and Failover logic may require the Load Balancing logic to transiently or permanently forward requests that aim specific services in situations such as where: i) The compute platform is not responding; ii) The service running inside the compute node is not responding; and iii) The compute platform load prevents the targeted service to provide the right level of service level objectives (SLOs). Note that the logic must know the required SLOs for the services. Such functionality may be coordinated with service information 650 including SLO information.


Functionality for executing parts of the workloads. Use cases such as video analytics tend to be decomposed in different microservices that conform a pipeline of actions that can be used together. The IPU may include a workload pipeline execution logic that understands how workloads are composed and manage their execution. Workloads can be defined as a graph that connects different microservices. The load balancing and brokering logic may be able to understand those graphs and decide what parts of the pipeline are executed where. Further, to perform these and other operations, Intercept logic will also decode what requests are included as part of the requests.


Resource Management


A distributed network processing configuration may enable IPUs to perform important role for managing resources of edge appliances. As further shown in FIG. 6, the functional components of an IPU can operate to perform these and similar types of resource management functionalities.


As a first example, an IPU can provide management or access to external resources that are hosted in other locations and expose them as local resources using constructs such as Compute Express Link (CXL). For example, the IPU could potentially provide access to a remote accelerator that is hosted in a remote system via CXL.mem/cache and IO. Another example includes providing access to remote storage device hosted in another system. In this latter case, the local IPU could work with another IPU in the storage system and expose the remote system as PCIE VF/PF (virtual functions/physical functions) to the local host.


As a second example, an IPU can provide access to IPU-specific resources. Those IPU resource may be physical (such as storage or memory) or virtual (such as a service that provides access to random number generation).


As a third example, an IPU can manage local resources that are hosted in the system where it belongs. For example, the IPU can manage power of the local compute platform.


As a fourth example, an IPU can provide access to other type of elements that relate to resources (such as telemetry or other types of data). In particular, telemetry provides useful data for something that is needed to decide where to execute things or to identify problems.


I/O Management. Because the IPU is acting as a connection proxy between the external peers (compute systems, remote storage etc.) resources and the local compute, the IPU can also include functionality to manage I/O from the system perspective.


Host Virtualization and XPU Pooling. The IPU includes Host Virtualization and XPU Pooling logic responsible to manage the access to resources that are outside the system domain (or within the IPU) and that can be offered to the local compute system. Here, “XPU” refers to any type of a processing unit, whether CPU, GPU, VPU, an acceleration processing unit, etc. The IPU logic, after discovery and attestation, can agree with other systems to share external resources with the services running in the local system. IPUs may advertise to other peers available resources or can be discovered during discovery phase as introduced earlier. IPUs may request to other IPUS to those resources. For example, an IPU on system A may request access to storage on system B manage by another IPU. Remote and local IPUs can work together to establish a connection between the target resources and the local system.


Once the connection and resource mapping is completed, resources can be exposed to the services running in the local compute node using the VF/PF PCIE and CXL Logic. Each of those resources can be offered as VF/PF. The IPU logic can expose to the local host resources that are hosted in the IPU. Examples of resources to expose may include local accelerators, access to services, and the like.


Power Management. Power management is one of the key features to achieve favorable system operational expenditures (OPEXs). IPU is very well positioned to optimize power consumption that the local system is consuming. The Distributed and local power management unit Is responsible to meter the power that the system is consuming, the load that the system is receiving and track the service level agreements that the various services running in the system are achieving for the arriving requests. Likewise, when power efficiencies (e.g., power usage effectiveness (PUE)) are not achieving certain thresholds or the local compute demand is low, the IPU may decide to forward the requests to local services to other IPUs that host replicas of the services. Such power management features may also coordinate with the Brokering and Load Balancing logic discussed above. As will be understood, IPUs can work together to decide where requests can be consolidated to establish higher power efficiency as system. When traffic is redirected, the local power consumption can be reduced in different ways. Example operations that can be performed include: changing the system to C6 State; changing the base frequencies; performing other adaptations of the system or system components.


Telemetry Metrics. The IPU can generate multiple types of metrics that can be interesting from services, orchestration or tenants owning the system. In various examples, telemetry can be accessed, including: (i) Out of band via side interfaces; (ii) In band by services running in the IPU; or (iii) Out of band using PCIE or CXL from the host perspective. Relevant types of telemetries can include: Platform telemetry; Service Telemetry; IPU telemetry; Traffic telemetry; and the like.


System Configurations for Distributed Processing


Further to the examples noted above, the following configurations may be used for processing with distributed IPUs:


1) Local IPUs connected to a compute platform by an interconnect (e.g., as shown in the configuration of FIG. 4);


2) Shared IPUs hosted within a rack/physical network—such as in a virtual slice or multi-tenant implementation of IPUs connected via CXL/PCI-E (local), or extension via Ethernet/Fiber for nodes within a cluster;


3) Remote IPUs accessed via an IP Network, such as within certain latency for data plane offload/storage offloads (or, connected for management/control plane operations); or


4) Distributed IPUs providing an interconnected network of IPUs, including as many as hundreds of nodes within a domain.


Configurations of distributed IPUs working together may also include fragmented distributed IPUs, where each IPU or pooled system provides part of the functionalities, and each IPU becomes a malleable system. Configurations of distributed IPUs may also include virtualized IPUs, such as provided by a gateway, switch, or an inline component (e.g., inline between the service acting as IPU), and in some examples, in scenarios where the system has no IPU.


Other deployment models for IPUs may include IPU-to-IPU in the same tier or a close tier; IPU-to-IPU in the cloud (data to compute versus compute to data); integration in small device form factors (e.g., gateway IPUs); gateway/NUC+IPU which connects to a data center; multiple GW/NUC (e.g. 16) which connect to one IPU (e.g. switch); gateway/NUC+IPU on the server; and GW/NUC and IPU that are connected to a server with an IPU.


The preceding distributed IPU functionality may be implemented among a variety of types of computing architectures, including one or more gateway nodes, one or more aggregation nodes, or edge or core data centers distributed across layers of the network (e.g., in the arrangements depicted in FIGS. 2 and 3). Accordingly, such IPU arrangements may be implemented in an edge computing system by or on behalf of a telecommunication service provider (“telco”, or “TSP”), internet-of-things service provider, cloud service provider (CSP), enterprise entity, or any other number of entities. Various implementations and configurations of the edge computing system may be provided dynamically, such as when orchestrated to meet service objectives. Such edge computing systems may be embodied as a type of device, appliance, computer, or other “thing” capable of communicating with other edge, networking, or endpoint components.



FIG. 7 depicts a block diagram of example components in a computing device 750 which can operate as a distributed network processing platform. The computing device 750 may include any combinations of the components referenced above, implemented as integrated circuits (ICs), as a package or system-on-chip (SoC), or as portions thereof, discrete electronic devices, or other modules, logic, instruction sets, programmable logic or algorithms, hardware, hardware accelerators, software, firmware, or a combination thereof adapted in the computing device 750, or as components otherwise incorporated within a larger system. Specifically, the computing device 750 may include processing circuitry comprising one or both of a network processing unit 752 (e.g., an IPU or DPU, as discussed above) and a compute processing unit 754 (e.g., a CPU).


The network processing unit 752 may provide a networked specialized processing unit such as an IPU, DPU, network processing unit (NPU), or other “xPU” outside of the central processing unit (CPU). The processing unit may be embodied as a standalone circuit or circuit package, integrated within an SoC, integrated with networking circuitry (e.g., in a SmartNIC), or integrated with acceleration circuitry, storage devices, or AI or specialized hardware, consistent with the examples above.


The compute processing unit 754 may provide a processor as a central processing unit (CPU) microprocessor, multi-core processor, multithreaded processor, an ultra-low voltage processor, an embedded processor, or other forms of a special purpose processing unit or specialized processing unit for compute operations.


Either the network processing unit 752 or the compute processing unit 754 may be a part of a system on a chip (SoC) which includes components formed into a single integrated circuit or a single package. The network processing unit 752 or the compute processing unit 754 and accompanying circuitry may be provided in a single socket form factor, multiple socket form factor, or a variety of other formats.


The processing units 752, 754 may communicate with a system memory 756 (e.g., random access memory (RAM)) over an interconnect 755 (e.g., a bus). In an example, the system memory 756 may be embodied as volatile (e.g., dynamic random access memory (DRAM), etc.) memory. Any number of memory devices may be used to provide for a given amount of system memory. A storage 758 may also couple to the processor 752 via the interconnect 755 to provide for persistent storage of information such as data, applications, operating systems, and so forth. In an example, the storage 758 may be implemented as non-volatile storage such as a solid-state disk drive (SSD).


The components may communicate over the interconnect 755. The interconnect 755 may include any number of technologies, including industry-standard architecture (ISA), extended ISA (EISA), peripheral component interconnect (PCI), peripheral component interconnect extended (PCIx), PCI express (PCIe), Compute Express Link (CXL), or any number of other technologies. The interconnect 755 may couple the processing units 752, 754 to a transceiver 766, for communications with connected edge devices 762.


The transceiver 766 may use any number of frequencies and protocols. For example, a wireless local area network (WLAN) unit may implement Wi-Fi® communications in accordance with the Institute of Electrical and Electronics Engineers (IEEE) 802.11 standard, or a wireless wide area network (WWAN) unit may implement wireless wide area communications according to a cellular, mobile network, or other wireless wide area protocol. The wireless network transceiver 766 (or multiple transceivers) may communicate using multiple standards or radios for communications at a different range. A wireless network transceiver 766 (e.g., a radio transceiver) may be included to communicate with devices or services in the edge cloud 110 or the cloud 130 via local or wide area network protocols.


The communication circuitry (e.g., transceiver 766, network interface 768, external interface 770, etc.) may be configured to use any one or more communication technology (e.g., wired or wireless communications) and associated protocols (e.g., a cellular networking protocol such a 3GPP 4G or 5G standard, a wireless local area network protocol such as IEEE 802.11/Wi-Fi®, a wireless wide area network protocol, Ethernet, Bluetooth®, Bluetooth Low Energy, an IoT protocol such as IEEE 802.15.4 or ZigBee®, Matter®, low-power wide-area network (LPWAN) or low-power wide-area (LPWA) protocols, etc.) to effect such communication. Given the variety of types of applicable communications from the device to another component or network, applicable communications circuitry used by the device may include or be embodied by any one or more of components 766, 768, or 770. Accordingly, in various examples, applicable means for communicating (e.g., receiving, transmitting, etc.) may be embodied by such communications circuitry.


The computing device 750 may include or be coupled to acceleration circuitry 764, which may be embodied by one or more AI accelerators, a neural compute stick, neuromorphic hardware, an FPGA, an arrangement of GPUs, one or more SoCs, one or more CPUs, one or more digital signal processors, dedicated ASICs, or other forms of specialized processors or circuitry designed to accomplish one or more specialized tasks. These tasks may include AI processing (including machine learning, training, inferencing, and classification operations), visual data processing, network data processing, object detection, rule analysis, or the like. Accordingly, in various examples, applicable means for acceleration may be embodied by such acceleration circuitry.


The interconnect 755 may couple the processing units 752, 754 to a sensor hub or external interface 770 that is used to connect additional devices or subsystems. The devices may include sensors 772, such as accelerometers, level sensors, flow sensors, optical light sensors, camera sensors, temperature sensors, global navigation system (e.g., GPS) sensors, pressure sensors, pressure sensors, and the like. The hub or interface 770 further may be used to connect the edge computing node 750 to actuators 774, such as power switches, valve actuators, an audible sound generator, a visual warning device, and the like.


In some optional examples, various input/output (I/O) devices may be present within or connected to, the edge computing node 750. For example, a display or other output device 784 may be included to show information, such as sensor readings or actuator position. An input device 786, such as a touch screen or keypad may be included to accept input. An output device 784 may include any number of forms of audio or visual display, including simple visual outputs such as LEDs or more complex outputs such as display screens (e.g., LCD screens), with the output of characters, graphics, multimedia objects, and the like being generated or produced from the operation of the edge computing node 750.


A battery 776 may power the edge computing node 750, although, in examples in which the edge computing node 750 is mounted in a fixed location, it may have a power supply coupled to an electrical grid, or the battery may be used as a backup or for temporary capabilities. A battery monitor/charger 778 may be included in the edge computing node 750 to track the state of charge (SoCh) of the battery 776. The battery monitor/charger 778 may be used to monitor other parameters of the battery 776 to provide failure predictions, such as the state of health (SoH) and the state of function (SoF) of the battery 776. A power block 780, or other power supply coupled to a grid, may be coupled with the battery monitor/charger 778 to charge the battery 776.


In an example, the instructions 782 on the processing units 752, 754 (separately, or in combination with the instructions 782 of the machine-readable medium 760) may configure execution or operation of a trusted execution environment (TEE) 790. In an example, the TEE 790 operates as a protected area accessible to the processing units 752, 754 for secure execution of instructions and secure access to data. Other aspects of security hardening, hardware roots-of-trust, and trusted or protected operations may be implemented in the edge computing node 750 through the TEE 790 and the processing units 752, 754.


The computing device 750 may be a server, appliance computing devices, and/or any other type of computing device with the various form factors discussed above. For example, the computing device 750 may be provided by an appliance computing device that is a self-contained electronic device including a housing, a chassis, a case, or a shell.


In an example, the instructions 782 provided via the memory 756, the storage 758, or the processing units 752, 754 may be embodied as a non-transitory, machine-readable medium 760 including code to direct the processor 752 to perform electronic operations in the edge computing node 750. The processing units 752, 754 may access the non-transitory, machine-readable medium 760 over the interconnect 755. For instance, the non-transitory, machine-readable medium 760 may be embodied by devices described for the storage 758 or may include specific storage units such as optical disks, flash drives, or any number of other hardware devices. The non-transitory, machine-readable medium 760 may include instructions to direct the processing units 752, 754 to perform a specific sequence or flow of actions, for example, as described with respect to the flowchart(s) and block diagram(s) of operations and functionality discussed herein. As used herein, the terms “machine-readable medium”, “machine-readable storage”, “computer-readable storage”, and “computer-readable medium” are interchangeable.


In further examples, a machine-readable medium also includes any tangible medium that is capable of storing, encoding, or carrying instructions for execution by a machine and that cause the machine to perform any one or more of the methodologies of the present disclosure or that is capable of storing, encoding or carrying data structures utilized by or associated with such instructions. A “machine-readable medium” thus may include but is not limited to, solid-state memories, and optical and magnetic media. The instructions embodied by a machine-readable medium may further be transmitted or received over a communications network using a transmission medium via a network interface device utilizing any one of a number of transfer protocols (e.g., HTTP).


A machine-readable medium may be provided by a storage device or other apparatus which is capable of hosting data in a non-transitory format. In an example, information stored or otherwise provided on a machine-readable medium may be representative of instructions, such as instructions themselves or a format from which the instructions may be derived. This format from which the instructions may be derived may include source code, encoded instructions (e.g., in compressed or encrypted form), packaged instructions (e.g., split into multiple packages), or the like. The information representative of the instructions in the machine-readable medium may be processed by processing circuitry into the instructions to implement any of the operations discussed herein. For example, deriving the instructions from the information (e.g., processing by the processing circuitry) may include: compiling (e.g., from source code, object code, etc.), interpreting, loading, organizing (e.g., dynamically or statically linking), encoding, decoding, encrypting, unencrypting, packaging, unpackaging, or otherwise manipulating the information into the instructions.


In an example, the derivation of the instructions may include assembly, compilation, or interpretation of the information (e.g., by the processing circuitry) to create the instructions from some intermediate or preprocessed format provided by the machine-readable medium. The information, when provided in multiple parts, may be combined, unpacked, and modified to create the instructions. For example, the information may be in multiple compressed source code packages (or object code, or binary executable code, etc.) on one or several remote servers.


In further examples, a software distribution platform (e.g., one or more servers and one or more storage devices) may be used to distribute software, such as the example instructions discussed above, to one or more devices, such as example processor platform(s) and/or example connected edge devices noted above. The example software distribution platform may be implemented by any computer server, data facility, cloud service, etc., capable of storing and transmitting software to other computing devices. In some examples, the providing entity is a developer, a seller, and/or a licensor of software, and the receiving entity may be consumers, users, retailers, OEMs, etc., that purchase and/or license the software for use and/or re-sale and/or sub-licensing.


Turning now to FIGS. 8-20, these illustrate various mechanisms for accelerator pooling and exposing accelerators as a service. In a distributed IPU environment (e.g., IPU mesh), one or more IPU elements may provide accelerated functions. The IPU-managed accelerators may be pooled into a resource pool from which other compute platforms can tap into and borrow from. IPU and non-IPU elements may acts as either producers or consumers in such a resource-sharing environment in order to provide or use accelerator services. In addition to contributing its own surplus accelerator resources into a pool from which others can borrow, a particular IPU may also borrow accelerator resources from the pool to which others (either local or remote compute platforms) contribute. Further, the IPU may provide management of such pooling, so that it acts flexibly as producer, consumer, aggregator, and broker of accelerator resources between providers and consumers.


As part of the role of orchestrator, an IPU may allow for smart chaining of tasks based on various attributes. Such attributes may identify the “locality” of the task. This locality may be used to estimate how much overhead may be incurred when transferring data and control from one task to another. Tasks on CPUs and XPUs that request accelerator operations at an IPU may also use a channel application programming interface (API) at the IPU to route data from one accelerator operation to another acceleration operation. Tasks may be implemented as microservices, serverless functions, deployable artifacts, or the like. Thus, the IPU removes the need to move data across links when the channel being established is local to accelerator slices at the same IPU, and it also removes the need to route data across sidecars, or other intermediaries because the chained flows can be set up from accelerator to accelerator, whether by local CXL or some other high-speed fabric, by memory pooling, or over network transport.


When acting as an orchestrator or scheduler, an IPU may operate based on characteristics of service level policies (e.g., SLAs or SLOs). For instance, the data flows from task_1 to task_2 by the IPU selecting a scheduling policy to maximize effectiveness of local caching of the producer-to-consumer streams, instead of flows between tasks that require data to be spilled-to and filled-from memory or storage. The indirect reads and writes to memory or storage are relatively slow and increase latency along with increasing memory requirements.


The aggregation of accelerator functions may be both at a hardware level in provisioning and switching setup, and at a higher, software/driver level where the IPU acts as a capacity aggregator to dispatch accelerator invocations transparently between software and pooled versions of hardware.


The IPU may expose accelerator functions as a service (XFaas).


In such an implementation, the IPU offloads from the CPU the responsibility of mapping an event to an accelerated function, so that the event is simply forwarded to the IPU. The IPU then handles the invocation of the corresponding accelerator function. This is possible since there is no long duration state with a function (unlike a stateful service on a CPU which in general may not be movable to an IPU).


The accelerator functions may be virtualized. In such an implementation, the user of an accelerator is able to invoke a standard interface and discover or direct the accelerator capability using a simple, parameterized call instead of having to invoke the accelerator function in low level manner. The IPU maps the parameters in the call to the type of acceleration that needs to be performed. This is synergistic with pooled use of acceleration resource and XFaaS because it hides the low-level minutiae of how the acceleration is being performed, where it is being performed, etc., from the invoker.



FIG. 8 is a block diagram illustrating the general flow for providing acceleration as a service, according to an example. To provide acceleration as a service, there are three main phases: capture the task dependencies in a graph (operation 802), map the graph to available acceleration functions (operation 804), and orchestrate the flow of data between CPU and non-CPU based functions (operation 806).


At 802, the logical organization of a solution is captured in the form of a graph of computations (tasks). These computations (tasks) may be serverless functions, stateful serverless, or microservices.


At 804, the graph of computations is mapped to available accelerator functions at some fine granularity of a cluster. For instance, a clique comprising between 1K and 10K cores may be used as an accelerator service where execution can be considered to be tightly coupled from a scheduling perspective.


At 806, the flow of data between the CPU based and non-CPU based functions or microservices is scheduled and optimized, with inter-operable networking capabilities from IPUs in the clique, so that clique computations flow seamlessly between CPU-XPU implemented microservices with very light mediation by CPU based software.



FIG. 9 depicts the decomposition or refactoring of a compound service into its component microservices, functions, etc., shown as circles R, S, . . . , Z. Here, FIG. 9 is used as a common example through the remainder of this writeup. Unless it is necessary to be specific, the term “Task” will refer to these component functions, microservices, etc. Performing multiple tasks in a prescribed sequence may be used to fulfil or complete a job or workload instance.



FIG. 10 depicts a number of “events” or triggers (a, b, . . . , g) that either cause the various tasks R-Z to be triggered, resumed from a waiting state, or interrupted in order to respond to some event. Triggers or events that cause one task to be resumed or be otherwise affected in some way may be generated externally or may arise from the execution of one of the other tasks R-Z. Thus, trigger g causes tasks S and W to be notified and consequently activate or begin processing in some way, trigger b causes task Y to be notified and consequently activate or begin processing.



FIG. 11 depicts various data that is produced, consumed, or produced and consumed by the different tasks R-Z in a job. Data produced by one or more of these tasks may be stored in a distributed datastore (e.g., a database, a datalake, a datastream, etc.). Similarly, data used by one of these tasks as input may be retrieved from the distributed data store.



FIG. 12 depicts the execution of a task that may generate triggers that affect other tasks (shown by dashed arrows). Similarly, FIG. 12 shows the actual data dependencies (e.g., production/consumption relationships) between tasks with the solid thick arrows.


In FIG. 11, each task was sourcing or sinking data that it respectively consumed or produced into a datalake, a datastream, etc., and FIG. 12 shows the actual logical producer-consumer relationships between tasks in which the datalakes, datastreams etc. of FIG. 11 are carriers of the data transiting between the producers and consumers. Not all data that is produced or consumed needs to come from the execution of another task. For example as shown in FIG. 12, any data produced by the execution of tasks Y, V, Z, S, and U is not consumed by any of the other tasks, and tasks X, U, and S do not consume data produced by any of the other tasks. FIG. 12 depicts the dataflow and execution dependencies of the tasks as shown graphically, which may be called a flowgraph. A flowgraph is a representation, using graph notation, of all paths that may be traversed through a program during its execution.



FIG. 13 depicts a process 1300 of flow optimization, according to an example. A flowgraph is accessed (operation 1302). The flowgraph may be specified to IPUs through an interface that is supported by either hardware or software logic at one or more IPUs (operation 1304). Those tasks that are able to be accelerated by available acceleration resources, along with tasks that are able to be alternatively implemented in classic (traditional) CPU-based software logic, are then seamlessly initiated in a clique wide distribution by the IPU based acceleration-as-a-service logic (operation 1306). The operation of this logic also achieves flow optimization (operation 1308). Optimization may attempt to achieve various goals, such as minimizing the amount of data movement, minimizes the latency, maximizing resource utilization, or maximizing capacity of acceleration available resources. Other goals may be used in determining what is considered a local optimization. For instance, data flow optimization may mean either eliminating or reducing to a minimum the amount of data that is moved from producing tasks into a datalake/datastore/datastream of FIG. 13, when that data is just ephemeral and is consumed by one of the other tasks as shown by the dataflow relationships in FIG. 12 which are described to the service logic by the specifying of the flowgraph in FIG. 13.


Next for convenience of description, a subset of the flowgraph illustrated in FIG. 12 is depicted in FIG. 14 to show some of the additional mechanisms and data structures. In particular, FIG. 14 shows that subset of FIG. 12, in which fewer tasks and edges are shown. For each task such as R, T, Y, and W, a chart shown in FIG. 15 contains a correspondence between the logical ID of the task and the corresponding logical IDs of available accelerator implementations for that task. Thus, for a logical task R, three possible implementations (artifacts) exist. The first is R0, which is a CPU-based software function (e.g., source, binary, or intermediate form). The second is R1, which is in the form of GPU-oriented software function (e.g., in one of source/binary/intermediate forms). The third is R2 and is in the form of an FPGA bitstream. The IDs R0, R1, and R2 may be URIs, URLs, UUIDs, etc. that help distinguish, identify, and locate the artifacts. The artifacts may be optionally replicated for easy distributed availability so that the service can launch a standard (unaccelerated) version of R as R0 on a CPU or an accelerated version of R as version R1 on a GPU. Alternatively, the task may be implemented on an FPGA version R2 on one or more FPGAs it can obtain from a distributed clique orchestration service such as K8s, Openstack, etc. Generally, artifacts include files used to create and run an application. As such, an artifact may refer to compiled code, an executable file, a bytecode file, a configuration binary, a bitstream, or the like.


When referring to Java, an artifact may be identified by name, version, and scope. Scope may indicate whether it is a user artifact that exists in a namespace and cannot be accessed by other namespaces, or a system scope artifact, which can be accessed by any namespace. In Kubernetes, artifacts represent resources that are updated as a part of a deployment pipeline.



FIG. 15 also shows for example, that task Y can only be run as a CPU-based software logic using a software artifact Y0. Conversely, task T may run in the form of a traditional CPU-based program using artifact TO, on an FPGA in the form of a bitstream artifact T1, or on a special purpose ASIC in the form of an artifact T2. Artifacts may be in the form of a bitstream, bit file, programming file, or executable file, binary file, other configuration file used to configure an FPGA, CGRA, ASIC, or general CPU to execute an acceleration operation.


A registry of artifacts may be stored in a distributed database, a centralized database, or otherwise available to one or more IPUs that are orchestrating and scheduling tasks. The registry may include identifiers of the artifacts, their location, and other metadata about the artifacts, such as reliability, capability, security features, geographical location, service costs, and the like. The registry may be stored in a datastore, datalake, database, or the like, such as illustrated and described in FIGS. 11, 17, and 18.



FIG. 16 depicts an undirected dataflow graph of the tasks R, T, Y, and W, according to an example. In particular, FIG. 16 illustrates a logical dataflow relationship in the form of an undirected edge adjacency list for each task. The undirected dataflow graph can be used to schedule and orchestrate when tasks should be set into execution, along with where and how ephemeral data should flow between the executing instances of those tasks in order to optimize the latencies, schedules, and resources (e.g. memory, storage, network allocations, etc.) at the respective execution resources hosting the accelerated/unaccelerated versions of their artifacts. In general for each logical task, there may be multiple different logical artifacts, and each logical artifact may be capable of being instanced (i.e., set up as an executing instance) on hardware assets like CPUs, FPGAs, GPUs, ASICs, etc. on different hosts. This information is available to resources databases such as etcd (see https://etcd.io/) or others in an orchestration system. The flow optimization step referred to in operation 1308 of FIG. 13.



FIG. 17 depicts the transformation from an unoptimized dataflow graph 1700A to an optimized version 1700B of acceleration as a service as implemented by the agency of IPUs, for the subset graph shown in FIG. 14.


For the purposes of this discussion, each IPU illustrated in FIG. 17 is indicated with a numerical index identifier, such as “IPU4.” The numerical index identifier corresponds to a host on which the IPU is colocated. So, in the case of IPU4, it is considered to be on “host 4.”



FIG. 17 illustrates an unoptimized model 1700A where task R runs as R0 1702 software in host 1, with IPU1 1704. Note that from the chart in FIG. 15, it is understood that the artifact R0 1702 is a CPU software artifact and as such, runs on one or more CPUs at host 1.


Further consulting the table in FIG. 15 in combination with the dataflow graph 1700A, task T runs as artifact T2 1712 on an ASIC available in host 4, with IPU4 1714, task Y runs as Y0 1722 software on CPUs in host 1, with IPU1 1704, and task W runs as FPGA-accelerated W1 1732 in host 7, with IPU7 1734.


The dataflow graph 1700A illustrates how data is received by an IPU (e.g., IPU1 1704), provided as input to a task (e.g., R0 1702), and how the resultant data is then returned to the IPU from the task, which may transmit the resultant data back to the datastore, database, datastream, or other storage. Consequently, in the unoptimized model 1700A, each execution consumes its data from the datastore/datastream 1740 (available as a distributed storage service) and produces its results and pushes them into the datastore/datastream for use by other tasks.


Among other services, one service that the respective IPUs in the different hosts perform is that of sourcing or sinking the data into that datastore/datastream, so that the execution logic of each instance is not burdened with this responsibility, and also so that some other CPU(s) in each of the hosts does not have to be interrupted in order to tend to these data movement operations.


A more optimized or streamlined flow 1700B achieved by computing placement and routing is illustrated in FIG. 17. In this optimized version 1700B, R0 1702 and Y0 1722 are moved to host 4 (they were executing on host 1) to be closer to T2 1712 which is also in host 4. Other flow actions are also undertaken so that: 1) the output of R0 1702 is conveyed to T2 1712 by IPU4 1714; 2) the output of T2 1701 is also conveyed to Y0 1722 by IPU4 1714, and directly to IPU7 1734 by IPU4 1714; 3) the output of Y0 1722 is conveyed to datastore/datastream 1740 directly by the host software on host 4 which executes Y0 1722, because there is no latency criticality and the CPUs already have the data available for streaming into the datastore; and 4) the output of W1 1732 (the FPGA-based implementation) in host 7 is fed into IPU4 1714 by IPU7 1734, for input to R0 1702.


Secondary types of flow optimizations are possible but are not shown in the above example for simplicity. They include, for example, choosing power efficient alternative implementations, choosing latency-optimized implementations, choosing performance/watt or performance/$/watt and other such criteria (and essentially mapping in different resource-artifact combinations into the flow optimization formulation) based on any additional requirements specified in the task. These are considered as using layering of an optimization strategy that targets an evolving number of cost functions according to specified cost parameters between an application cohort and the IPU-XaaS service.



FIG. 18 illustrates a database of information (also referred to as a registry) that stores each instance and type of logical artifact such as TO, T1, and T2 for a given task T. These instances (i.e., execution capable resources) support the execution of the task. The registry database contains various information about the instances, such as throughput (capacity) of the implementation, the current running average of the utilization of those resources, and many other secondary flow optimization objectives referred to in the previous paragraph. This registry database is also referred to as XDB (for accelerator database) and may be stored as a distributed eventually consistent datastore available to the various orchestration modules that run on CPUs or IPUs in the clique for looking up and making decisions about flow optimization.



FIG. 19 depicts various functional components of an IPU 1900, according to an example. Although the functions are illustrated as being contained within a single IPU, in some examples, the functions are provided collectively across multiple IPUs in a clique of hosts through a common API that is implemented between the CPUs and the IPUs of the hosts. In an example, a CPU contains a software version of an IPU (essentially a virtualized IPU) so that in hosts that do not contain a discrete IPU, the IPU capabilities may be made available in a virtual form by the CPU. As such, various functionalities described with respect to IPU 1900 may be provided in software in a virtual IPU 1950 executed on a CPU 1952 on a host 1954.


The IPU 1900 includes a transfer logic 1902, a repossess logic 1904, an aggregate logic 1906, a disaggregate logic 1908, an allocate logic 1910, a deallocate logic 1912, an acquire logic 1914, a release logic 1916, a flow route selection logic 1918, and a software implementation offload logic 1920. Additional IPU functions may be provided by auxiliary logic 1922. The various logic described in FIG. 19 read and store information and data in an accelerator database (XDB) 1940. A registry of artifacts may be stored in the XDB 1940. Further, the XDB 1940 may be used to store telemetry data, service level agreements (SLAs), service level objectives (SLOs), and other information used to optimize workflow, distribute tasks to resources, assign or instantiate artifacts at resources, lend and retrieve resources, aggregate resources, and handle offloading of CPU tasks to IPUs.


The transfer logic 1902 may also be referred to as a lending logic in that it provides the functionality to transfer or lend resources from the IPU to other CPUs or IPUs. The repossess logic 1904 may also be referred to as a “retake” logic or recover logic, in that it provides the functionality to repossess control of resources that were lent or transferred. Together, the transfer logic 1902 and repossess logic 1904 provide the borrowing or lending of resources available at a remote peer in order to achieve scaling of an acceleration-based service that can be scaled by creating multiple smaller capacity instances between which the IPUs perform load balancing. Such borrowing or lending may be scheduled in advance and extended as needed. Additionally, the repossess logic 1904 may be configured to retake previously lent resources when they are either returned early, or when the time duration of lending closes and the borrower implicitly or explicitly vacates the use of the resource.


The aggregate logic 1906 is configured to relate resources together as an aggregated resource. In a corresponding manner, the disaggregate logic 1908 removes the relationships between individual resources. Together, the aggregate logic 1906 disaggregate logic 1908 provide local aggregation of available resources for acceleration in order to scale up local capacity for a given task. Such aggregation may be homogeneous (e.g., aggregation is across identical artifact ID and using multiple tiles of the hardware resource for that artifact), or heterogeneous (e.g., different artifacts of the same logical task are combined across different hardware resources such as CPUs, GPUs, and FPGAs). The local aggregation/disaggregation also handles hot plug capability so that acceleration capabilities may be dynamically furnished to a clique through hardware upgrades, or CXL-based access to resource pools in which new acceleration capabilities are made available by bringing more racks, pods, etc. online.


The allocate logic 1910 is configured to allocate resources of an accelerator artifact to a task. In a corresponding manner, the deallocate logic 1912 removes associations between resources and tasks. In particular, together, the allocate logic 1910 and deallocate logic 1912 provide the allocation or binding of resources such as CPUs, GPUs, FPGAs, ASICs, etc., in order to produce a running (executing) version of an artifact on the allocated resources, and releasing the resources upon completion or suspension of the task.


The acquire logic 1914 is configured to find and assign resources. Correspondingly, release logic 1916 is configured to release any assignment of resources. Together, this logic are used to acquire and release locally-available resources for short term purposes, such as when a task is just a serverless function and it is not necessary to allocate and deallocate resources for a long duration.


The flow route selection logic 1918 is used to select a flowgraph and provision an optimized implementation. An example of the functionality of the flow route selection logic 1918 is provided in FIGS. 17 and 18 above. The flowgraph and implementation control flows may be stored in the XDB 1940.


The software implementation offload logic 1920 is used to offload a workload from a CPU-based artifact to an IPU-based artifact. This is sometimes desirable not just for managing the CPU's burden, but also to streamline communication and dataflow when a task is very light and it may save data transfers between the IPU and a CPU by just having the IPU perform the CPU's work instead.


The IPU 1900 provides a multitude of technical advantages including 1) transparent merging of acceleration resources through an as-a-service consumption format; 2) optimized dataflows and low latencies; 3) low or no burden on CPUs for managing data flows between different types of acceleration resources that are pressed into service; and streamlined software because application logic does not have to be concerned with physical movements of data.


Additionally, IPU and acceleration pooling may leverage 5G and other wireless architectures. Therefore, pooling may be included as part of the UPF for performing traffic steering. This functionality can be leveraged to expose remote accelerators via UPF.


The IPU and acceleration pooling can also include security aspects that can be used to perform the chain of tasks, this may include attestation, trust, etc. Such aspects may apply when there are pooled resources that are distributed from edge to cloud and the security and trust boundaries are different.


Finally, the IPU and acceleration pooling can also include a mapping into a K8s or cloud native architecture. For instance, in an example where K8s plugins and operators can manage accelerators that are managed/exposed by the IPUs, to enable many more K8s construct types.



FIG. 20 is a flowchart illustrating a method for orchestrating acceleration functions in a network compute mesh, according to an example. A network compute mesh includes a plurality of compute nodes, where each node includes at least a central processing unit (CPU) or set of CPUs. Some compute nodes in the network compute mesh may include network-addressable processing units or networked processing units, such as IPUs or DPUs.


A network-addressable processing unit (also referred to as a networked processing unit) is a processing unit that has a unique network address and is able to process network traffic. A network-addressable processing unit may work in concert with other processing units on a compute node. For instance, a network-addressable processing unit may be integrated with a network interface card (NIC) and process network traffic for a general CPU. Although a network-addressable processing unit may provide network management facilities, a network-addressable processing unit may also be used to offload workloads from a CPU, expose accelerator functions to other network-addressable processing units, and orchestrate workflows between CPUs and network-addressable processing units on various compute nodes in the network compute mesh. In some implementations, a network-addressable processing unit may have a distinct separate network address from the host that the network-addressable processing unit is installed within so that the network-addressable processing unit is separately addressable from the host and can process network traffic that is not for the host.


Compute nodes in a network compute mesh may be organized into cliques. A clique is a group of two or more compute nodes where each of the compute nodes in the clique is adjacent to every other node in the clique. This tight communication coupling allows for lightweight workflow administration.


At 2002, the method 2000 includes accessing a flowgraph, the flowgraph including data producer-consumer relationships between a plurality of tasks in a workload.


At 2004, the method 2000 includes identifying available artifacts and resources to execute the artifacts to complete each of the plurality of tasks, where an artifact is an instance of a function to perform a task of the plurality of tasks.


In an embodiment, the artifact comprises a bitstream to program a field-programmable gate array (FPGA). In an embodiment, the artifact comprises an executable file to execute on a central processing unit (CPU). In an embodiment, the artifact comprises a binary file to configure a coarse-grained reconfigurable array (CGRA).


In an embodiment, the resources comprise a central processing unit. In an embodiment, the resources comprise a network-accessible processing unit. In an embodiment, the resources comprise a graphics processing unit. In an embodiment, the resources comprise an application specific integrated circuit (ASIC). In an embodiment, the resources comprise a field-programmable gate array (FPGA). In an embodiment, the resources comprise a coarse-grained reconfigurable array (CGRA).


At 2006, the method 2000 includes determining a configuration assigning artifacts and resources to each of the plurality of tasks in the flowgraph.


In an embodiment, determining the configuration includes analyzing a service level objective (SLO) and assigning artifacts and resources to each of the plurality of tasks to satisfy the SLO.


In an embodiment, determining the configuration includes performing one or more of minimizing an amount of data movement between the plurality of tasks and a storage device, minimizing latency of workload execution, maximizing resource utilization for the workload, maximizing capacity of acceleration available resources, minimizing the power consumption of the workload, or optimizing the perf/watt or perf/S/watt metric for the workload.


At 2008, the method 2000 includes scheduling, based on the configuration, the plurality of tasks to execute using the assigned artifacts and resources. In an embodiment, scheduling the plurality of tasks includes communicating from a first network-accessible processing unit to a second network-accessible processing unit via an application programming interface (API), to schedule a task of the plurality of tasks to execute using an artifact executing on a resource managed by the second network-accessible processing unit. In a further embodiment, scheduling the plurality of task includes lending or transferring resources from the first network-accessible processing unit to the second network-accessible processing unit for use when executing the artifact.


In an embodiment, a task of the plurality of tasks produces a data result, which is stored in a distributed databased accessible by at least one other task of the plurality of tasks.


Although these implementations have been described concerning specific exemplary aspects, it will be evident that various modifications and changes may be made to these aspects without departing from the broader scope of the present disclosure. Many of the arrangements and processes described herein can be used in combination or in parallel implementations that involve terrestrial network connectivity (where available) to increase network bandwidth/throughput and to support additional edge services. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense. The accompanying drawings that form a part hereof show, by way of illustration, and not of limitation, specific aspects in which the subject matter may be practiced. The aspects illustrated are described in sufficient detail to enable those skilled in the art to practice the teachings disclosed herein. Other aspects may be utilized and derived therefrom, such that structural and logical substitutions and changes may be made without departing from the scope of this disclosure. This Detailed Description, therefore, is not to be taken in a limiting sense, and the scope of various aspects is defined only by the appended claims, along with the full range of equivalents to which such claims are entitled.


Such aspects of the inventive subject matter may be referred to herein, individually and/or collectively, merely for convenience and without intending to voluntarily limit the scope of this application to any single aspect or inventive concept if more than one is disclosed. Thus, although specific aspects have been illustrated and described herein, it should be appreciated that any arrangement calculated to achieve the same purpose may be substituted for the specific aspects shown. This disclosure is intended to cover any adaptations or variations of various aspects. Combinations of the above aspects and other aspects not specifically described herein will be apparent to those of skill in the art upon reviewing the above description.


Embodiments may be implemented in one or a combination of hardware, firmware, and software. Embodiments may also be implemented as instructions stored on a machine-readable storage device, which may be read and executed by at least one processor to perform the operations described herein. A machine-readable storage device may include any non-transitory mechanism for storing information in a form readable by a machine (e.g., a computer). For example, a machine-readable storage device may include read-only memory (ROM), random-access memory (RAM), magnetic disk storage media, optical storage media, flash-memory devices, and other storage devices and media.


Examples, as described herein, may include, or may operate on, logic or a number of components, such as modules, intellectual property (IP) blocks or cores, or mechanisms. Such logic or components may be hardware, software, or firmware communicatively coupled to one or more processors in order to carry out the operations described herein. Logic or components may be hardware modules (e.g., IP block), and as such may be considered tangible entities capable of performing specified operations and may be configured or arranged in a certain manner In an example, circuits may be arranged (e.g., internally or with respect to external entities such as other circuits) in a specified manner as an IP block, IP core, system-on-chip (SoC), or the like.


In an example, the whole or part of one or more computer systems (e.g., a standalone, client or server computer system) or one or more hardware processors may be configured by firmware or software (e.g., instructions, an application portion, or an application) as a module that operates to perform specified operations. In an example, the software may reside on a machine-readable medium. In an example, the software, when executed by the underlying hardware of the module, causes the hardware to perform the specified operations. Accordingly, the term hardware module is understood to encompass a tangible entity, be that an entity that is physically constructed, specifically configured (e.g., hardwired), or temporarily (e.g., transitorily) configured (e.g., programmed) to operate in a specified manner or to perform part or all of any operation described herein.


Considering examples in which modules are temporarily configured, each of the modules need not be instantiated at any one moment in time. For example, where the modules comprise a general-purpose hardware processor configured using software; the general-purpose hardware processor may be configured as respective different modules at different times. Software may accordingly configure a hardware processor, for example, to constitute a particular module at one instance of time and to constitute a different module at a different instance of time. Modules may also be software or firmware modules, which operate to perform the methodologies described herein.


An IP block (also referred to as an IP core) is a reusable unit of logic, cell, or integrated circuit. An IP block may be used as a part of a field programmable gate array (FPGA), application-specific integrated circuit (ASIC), programmable logic device (PLD), system on a chip (SoC), or the like. It may be configured for a particular purpose, such as digital signal processing or image processing. Example IP cores include central processing unit (CPU) cores, integrated graphics, security, input/output (I/O) control, system agent, graphics processing unit (GPU), artificial intelligence, neural processors, image processing unit, communication interfaces, memory controller, peripheral device control, platform controller hub, or the like.


In some examples, the instructions are stored on storage devices of the software distribution platform in a particular format. A format of computer readable instructions includes, but is not limited to a particular code language (e.g., Java, JavaScript, Python, C, C#, SQL, HTML, etc.), and/or a particular code state (e.g., uncompiled code (e.g., ASCII), interpreted code, linked code, executable code (e.g., a binary), etc.). In some examples, the computer readable instructions stored in the software distribution platform are in a first format when transmitted to an example processor platform(s). In some examples, the first format is an executable binary in which particular types of the processor platform(s) can execute. However, in some examples, the first format is uncompiled code that requires one or more preparation tasks to transform the first format to a second format to enable execution on the example processor platform(s). For instance, the receiving processor platform(s) may need to compile the computer readable instructions in the first format to generate executable code in a second format that is capable of being executed on the processor platform(s). In still other examples, the first format is interpreted code that, upon reaching the processor platform(s), is interpreted by an interpreter to facilitate execution of instructions.


Use Cases and Additional Examples

An IPU can be hosted in any of the tiers that go from device to cloud. Any compute platform that needs connectivity can potentially include an IPU. Some examples of places where IPUs can be placed are: Vehicles; Far Edge; Data center Edge; Cloud; Smart Cameras; Smart Devices.


Some of the use cases for a distributed IPU may include the following.


1) Service orchestrator (local, shared, remote, or distributed): Power, Workload perf, ambient temp prediction and optimization tuning and service orchestration not just locally but across distributed Edge Cloud


2) Infrastructure offload (for local machine)—same as traditional IPU use-cases to offload network, storage, host virtualization etc. but additional Edge Network Security Edge specific usages, Storage Edge specific usages, Virtualization Edge specific usages


3) IPU as a host to augment compute capacity (using ARM/x86 cores) for running edge specific “functions” on demand, integrated as API/Service or running as K8s worker node for certain types of services, side car proxies, security attestation services, scrubbing traffic for SASE/L7 inspection Firewall, Load balancer/Forward or reverse Proxy, Service Mesh side cars (for each POD running on local host) etc. 5G UPF and other RAN offloads Etc.


Additional examples of the presently described method, system, and device embodiments include the following, non-limiting implementations. Each of the following non-limiting examples may stand on its own or may be combined in any permutation or combination with any one or more of the other examples provided below or throughout the present disclosure.


Example 1 is a system for orchestrating acceleration functions in a network compute mesh, comprising: a memory device configured to store instructions; and a processor subsystem, which when configured by the instructions, is operable to: access a flowgraph, the flowgraph including data producer-consumer relationships between a plurality of tasks in a workload; identify available artifacts and resources to execute the artifacts to complete each of the plurality of tasks, wherein an artifact is an instance of a function to perform a task of the plurality of tasks; determine a configuration assigning artifacts and resources to each of the plurality of tasks in the flowgraph; and schedule, based on the configuration, the plurality of tasks to execute using the assigned artifacts and resources.


In Example 2, the subject matter of Example 1 includes, wherein the artifact comprises a bitstream to program a field-programmable gate array (FPGA).


In Example 3, the subject matter of Examples 1-2 includes, wherein the artifact comprises an executable file to execute on a central processing unit (CPU).


In Example 4, the subject matter of Examples 1-3 includes, wherein the artifact comprises a binary file to configure a coarse-grained reconfigurable array (CGRA).


In Example 5, the subject matter of Examples 1˜4 includes, wherein the resources comprise a central processing unit.


In Example 6, the subject matter of Examples 1-5 includes, wherein the resources comprise a network-accessible processing unit.


In Example 7, the subject matter of Examples 1-6 includes, wherein the resources comprise a graphics processing unit.


In Example 8, the subject matter of Examples 1-7 includes, wherein the resources comprise an application specific integrated circuit (ASIC).


In Example 9, the subject matter of Examples 1-8 includes, wherein the resources comprise a field-programmable gate array (FPGA).


In Example 10, the subject matter of Examples 1-9 includes, wherein the resources comprise a coarse-grained reconfigurable array (CGRA).


In Example 11, the subject matter of Examples 1-10 includes, wherein determining the configuration comprises: analyzing a service level objective (SLO); and assigning artifacts and resources to each of the plurality of tasks to satisfy the SLO.


In Example 12, the subject matter of Examples 1-11 includes, wherein determining the configuration comprises performing one or more of: minimizing an amount of data movement between the plurality of tasks and a storage device, minimizing latency of workload execution, maximizing resource utilization for the workload, maximizing capacity of acceleration available resources, minimizing the power consumption of the workload, or optimizing the perf/watt or perf/S/watt metric for the workload.


In Example 13, the subject matter of Examples 1-12 includes, wherein scheduling the plurality of tasks comprises: communicating from a first network-accessible processing unit to a second network-accessible processing unit via an application programming interface (API), to schedule a task of the plurality of tasks to execute using an artifact executing on a resource managed by the second network-accessible processing unit.


In Example 14, the subject matter of Example 13 includes, wherein scheduling the plurality of task comprises: lending or transferring resources from the first network-accessible processing unit to the second network-accessible processing unit for use when executing the artifact.


In Example 15, the subject matter of Examples 1-14 includes, wherein a task of the plurality of tasks produces a data result, which is stored in a distributed databased accessible by at least one other task of the plurality of tasks.


Example 16 is a method for orchestrating acceleration functions in a network compute mesh, comprising: accessing a flowgraph, the flowgraph including data producer-consumer relationships between a plurality of tasks in a workload; identifying available artifacts and resources to execute the artifacts to complete each of the plurality of tasks, wherein an artifact is an instance of a function to perform a task of the plurality of tasks; determining a configuration assigning artifacts and resources to each of the plurality of tasks in the flowgraph; and scheduling, based on the configuration, the plurality of tasks to execute using the assigned artifacts and resources.


In Example 17, the subject matter of Example 16 includes, wherein the artifact comprises a bitstream to program a field-programmable gate array (FPGA).


In Example 18, the subject matter of Examples 16-17 includes, wherein the artifact comprises an executable file to execute on a central processing unit (CPU).


In Example 19, the subject matter of Examples 16-18 includes, wherein the artifact comprises a binary file to configure a coarse-grained reconfigurable array (CGRA).


In Example 20, the subject matter of Examples 16-19 includes, wherein the resources comprise a central processing unit.


In Example 21, the subject matter of Examples 16-20 includes, wherein the resources comprise a network-accessible processing unit.


In Example 22, the subject matter of Examples 16-21 includes, wherein the resources comprise a graphics processing unit.


In Example 23, the subject matter of Examples 16-22 includes, wherein the resources comprise an application specific integrated circuit (ASIC).


In Example 24, the subject matter of Examples 16-23 includes, wherein the resources comprise a field-programmable gate array (FPGA).


In Example 25, the subject matter of Examples 16-24 includes, wherein the resources comprise a coarse-grained reconfigurable array (CGRA).


In Example 26, the subject matter of Examples 16-25 includes, wherein determining the configuration comprises: analyzing a service level objective (SLO); and assigning artifacts and resources to each of the plurality of tasks to satisfy the SLO.


In Example 27, the subject matter of Examples 16-26 includes, wherein determining the configuration comprises performing one or more of: minimizing an amount of data movement between the plurality of tasks and a storage device, minimizing latency of workload execution, maximizing resource utilization for the workload, maximizing capacity of acceleration available resources, minimizing the power consumption of the workload, or optimizing the perf/watt or perf/$/watt metric for the workload.


In Example 28, the subject matter of Examples 16-27 includes, wherein scheduling the plurality of tasks comprises: communicating from a first network-accessible processing unit to a second network-accessible processing unit via an application programming interface (API), to schedule a task of the plurality of tasks to execute using an artifact executing on a resource managed by the second network-accessible processing unit.


In Example 29, the subject matter of Example 28 includes, wherein scheduling the plurality of task comprises: lending or transferring resources from the first network-accessible processing unit to the second network-accessible processing unit for use when executing the artifact.


In Example 30, the subject matter of Examples 16-29 includes, wherein a task of the plurality of tasks produces a data result, which is stored in a distributed databased accessible by at least one other task of the plurality of tasks.


Example 31 is at least one machine-readable medium including instructions for orchestrating acceleration functions in a network compute mesh, which when executed by a machine, cause the machine to: access a flowgraph, the flowgraph including data producer-consumer relationships between a plurality of tasks in a workload; identify available artifacts and resources to execute the artifacts to complete each of the plurality of tasks, wherein an artifact is an instance of a function to perform a task of the plurality of tasks; determine a configuration assigning artifacts and resources to each of the plurality of tasks in the flowgraph; and schedule, based on the configuration, the plurality of tasks to execute using the assigned artifacts and resources.


In Example 32, the subject matter of Example 31 includes, wherein the artifact comprises a bitstream to program a field-programmable gate array (FPGA).


In Example 33, the subject matter of Examples 31-32 includes, wherein the artifact comprises an executable file to execute on a central processing unit (CPU).


In Example 34, the subject matter of Examples 31-33 includes, wherein the artifact comprises a binary file to configure a coarse-grained reconfigurable array (CGRA).


In Example 35, the subject matter of Examples 31-34 includes, wherein the resources comprise a central processing unit.


In Example 36, the subject matter of Examples 31-35 includes, wherein the resources comprise a network-accessible processing unit.


In Example 37, the subject matter of Examples 31-36 includes, wherein the resources comprise a graphics processing unit.


In Example 38, the subject matter of Examples 31-37 includes, wherein the resources comprise an application specific integrated circuit (ASIC).


In Example 39, the subject matter of Examples 31-38 includes, wherein the resources comprise a field-programmable gate array (FPGA).


In Example 40, the subject matter of Examples 31-39 includes, wherein the resources comprise a coarse-grained reconfigurable array (CGRA).


In Example 41, the subject matter of Examples 31-40 includes, wherein the instructions to determine the configuration comprise the instructions to: analyze a service level objective (SLO); and assign artifacts and resources to each of the plurality of tasks to satisfy the SLO.


In Example 42, the subject matter of Examples 31-41 includes, wherein the instructions to determine the configuration comprises instructions to performing one or more of minimizing an amount of data movement between the plurality of tasks and a storage device, minimizing latency of workload execution, maximizing resource utilization for the workload, maximizing capacity of acceleration available resources, minimizing the power consumption of the workload, or optimizing the perf/watt or perf/$/watt metric for the workload.


In Example 43, the subject matter of Examples 31-42 includes, wherein the instructions to schedule the plurality of tasks comprise instructions to: communicate from a first network-accessible processing unit to a second network-accessible processing unit via an application programming interface (API), to schedule a task of the plurality of tasks to execute using an artifact executing on a resource managed by the second network-accessible processing unit.


In Example 44, the subject matter of Example 43 includes, wherein scheduling the plurality of task comprises: lending or transferring resources from the first network-accessible processing unit to the second network-accessible processing unit for use when executing the artifact.


In Example 45, the subject matter of Examples 31-44 includes, wherein a task of the plurality of tasks produces a data result, which is stored in a distributed databased accessible by at least one other task of the plurality of tasks.


Example 46 is at least one machine-readable medium including instructions that, when executed by processing circuitry, cause the processing circuitry to perform operations to implement of any of Examples 1-45.


Example 47 is an apparatus comprising means to implement of any of Examples 1-45.


Example 48 is a system to implement of any of Examples 1-45.


Example 49 is a method to implement of any of Examples 1-45.


The above detailed description includes references to the accompanying drawings, which form a part of the detailed description. The drawings show, by way of illustration, specific embodiments that may be practiced. These embodiments are also referred to herein as “examples.” Such examples may include elements in addition to those shown or described. However, also contemplated are examples that include the elements shown or described. Moreover, also contemplated are examples using any combination or permutation of those elements shown or described (or one or more aspects thereof), either with respect to a particular example (or one or more aspects thereof), or with respect to other examples (or one or more aspects thereof) shown or described herein.


Publications, patents, and patent documents referred to in this document are incorporated by reference herein in their entirety, as though individually incorporated by reference. In the event of inconsistent usages between this document and those documents so incorporated by reference, the usage in the incorporated reference(s) are supplementary to that of this document; for irreconcilable inconsistencies, the usage in this document controls.


In this document, the terms “a” or “an” are used, as is common in patent documents, to include one or more than one, independent of any other instances or usages of “at least one” or “one or more.” In this document, the term “or” is used to refer to a nonexclusive or, such that “A or B” includes “A but not B,” “B but not A,” and “A and B,” unless otherwise indicated. In the appended claims, the terms “including” and “in which” are used as the plain-English equivalents of the respective terms “comprising” and “wherein.” Also, in the following claims, the terms “including” and “comprising” are open-ended, that is, a system, device, article, or process that includes elements in addition to those listed after such a term in a claim are still deemed to fall within the scope of that claim. Moreover, in the following claims, the terms “first,” “second,” and “third,” etc. are used merely as labels, and are not intended to suggest a numerical order for their objects.


The above description is intended to be illustrative, and not restrictive. For example, the above-described examples (or one or more aspects thereof) may be used in combination with others. Other embodiments may be used, such as by one of ordinary skill in the art upon reviewing the above description. The Abstract is to allow the reader to quickly ascertain the nature of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. Also, in the above Detailed Description, various features may be grouped together to streamline the disclosure. However, the claims may not set forth every feature disclosed herein as embodiments may feature a subset of said features. Further, embodiments may include fewer features than those disclosed in a particular example. Thus, the following claims are hereby incorporated into the Detailed Description, with a claim standing on its own as a separate embodiment. The scope of the embodiments disclosed herein is to be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims
  • 1. A system for orchestrating acceleration functions in a network compute mesh, comprising: a memory device configured to store instructions; anda processor subsystem, which when configured by the instructions, is operable to: access a flowgraph, the flowgraph including data producer-consumer relationships between a plurality of tasks in a workload;identify available artifacts and resources to execute the artifacts to complete each of the plurality of tasks, wherein an artifact is an instance of a function to perform a task of the plurality of tasks;determine a configuration assigning artifacts and resources to each of the plurality of tasks in the flowgraph; andschedule, based on the configuration, the plurality of tasks to execute using the assigned artifacts and resources.
  • 2. The system of claim 1, wherein the artifact comprises a bitstream to program a field-programmable gate array (FPGA).
  • 3. The system of claim 1, wherein the artifact comprises an executable file to execute on a central processing unit (CPU).
  • 4. The system of claim 1, wherein the artifact comprises a binary file to configure a coarse-grained reconfigurable array (CGRA).
  • 5. The system of claim 1, wherein the resources comprise a central processing unit.
  • 6. The system of claim 1, wherein the resources comprise a network-accessible processing unit.
  • 7. The system of claim 1, wherein the resources comprise a graphics processing unit.
  • 8. The system of claim 1, wherein the resources comprise an application specific integrated circuit (ASIC).
  • 9. The system of claim 1, wherein the resources comprise a field-programmable gate array (FPGA).
  • 10. The system of claim 1, wherein the resources comprise a coarse-grained reconfigurable array (CGRA).
  • 11. The system of claim 1, wherein determining the configuration comprises: analyzing a service level objective (SLO); andassigning artifacts and resources to each of the plurality of tasks to satisfy the SLO.
  • 12. The system of claim 1, wherein determining the configuration comprises performing one or more of minimizing an amount of data movement between the plurality of tasks and a storage device, minimizing latency of workload execution, maximizing resource utilization for the workload, maximizing capacity of acceleration available resources, minimizing the power consumption of the workload, or optimizing the perf/watt or perf/$/watt metric for the workload.
  • 13. The system of claim 1, wherein scheduling the plurality of tasks comprises: communicating from a first network-accessible processing unit to a second network-accessible processing unit via an application programming interface (API), to schedule a task of the plurality of tasks to execute using an artifact executing on a resource managed by the second network-accessible processing unit.
  • 14. The system of claim 13, wherein scheduling the plurality of task comprises: lending or transferring resources from the first network-accessible processing unit to the second network-accessible processing unit for use when executing the artifact.
  • 15. The system of claim 1, wherein a task of the plurality of tasks produces a data result, which is stored in a distributed databased accessible by at least one other task of the plurality of tasks.
  • 16. A method for orchestrating acceleration functions in a network compute mesh, comprising: accessing a flowgraph, the flowgraph including data producer-consumer relationships between a plurality of tasks in a workload;identifying available artifacts and resources to execute the artifacts to complete each of the plurality of tasks, wherein an artifact is an instance of a function to perform a task of the plurality of tasks;determining a configuration assigning artifacts and resources to each of the plurality of tasks in the flowgraph; andscheduling, based on the configuration, the plurality of tasks to execute using the assigned artifacts and resources.
  • 17. The method of claim 16, wherein determining the configuration comprises: analyzing a service level objective (SLO); andassigning artifacts and resources to each of the plurality of tasks to satisfy the SLO.
  • 18. The method of claim 16, wherein determining the configuration comprises performing one or more of minimizing an amount of data movement between the plurality of tasks and a storage device, minimizing latency of workload execution, maximizing resource utilization for the workload, maximizing capacity of acceleration available resources, minimizing the power consumption of the workload, or optimizing the perf/watt or perf/$/watt metric for the workload.
  • 19. The method of claim 16, wherein scheduling the plurality of tasks comprises: communicating from a first network-accessible processing unit to a second network-accessible processing unit via an application programming interface (API), to schedule a task of the plurality of tasks to execute using an artifact executing on a resource managed by the second network-accessible processing unit.
  • 20. The method of claim 19, wherein scheduling the plurality of task comprises: lending or transferring resources from the first network-accessible processing unit to the second network-accessible processing unit for use when executing the artifact.
  • 21. The method of claim 16, wherein a task of the plurality of tasks produces a data result, which is stored in a distributed databased accessible by at least one other task of the plurality of tasks.
  • 22. At least one machine-readable medium including instructions for orchestrating acceleration functions in a network compute mesh, which when executed by a machine, cause the machine to: access a flowgraph, the flowgraph including data producer-consumer relationships between a plurality of tasks in a workload;identify available artifacts and resources to execute the artifacts to complete each of the plurality of tasks, wherein an artifact is an instance of a function to perform a task of the plurality of tasks;determine a configuration assigning artifacts and resources to each of the plurality of tasks in the flowgraph; andschedule, based on the configuration, the plurality of tasks to execute using the assigned artifacts and resources.
  • 23. The at least one machine-readable medium of claim 22, wherein the instructions to determine the configuration comprise the instructions to: analyze a service level objective (SLO); andassign artifacts and resources to each of the plurality of tasks to satisfy the SLO.
  • 24. The at least one machine-readable medium of claim 22, wherein the instructions to determine the configuration comprises instructions to performing one or more of minimizing an amount of data movement between the plurality of tasks and a storage device, minimizing latency of workload execution, maximizing resource utilization for the workload, maximizing capacity of acceleration available resources, minimizing the power consumption of the workload, or optimizing the perf/watt or perf/$/watt metric for the workload.
  • 25. The at least one machine-readable medium of claim 22, wherein the instructions to schedule the plurality of tasks comprise instructions to: communicate from a first network-accessible processing unit to a second network-accessible processing unit via an application programming interface (API), to schedule a task of the plurality of tasks to execute using an artifact executing on a resource managed by the second network-accessible processing unit.
PRIORITY CLAIM

This application claims the benefit of priority to U.S. Provisional Patent Application No. 63/425,857, filed Nov. 16, 2022, and titled “COORDINATION OF DISTRIBUTED NETWORKED PROCESSING UNITS”, which is incorporated herein by reference in its entirety.

Provisional Applications (1)
Number Date Country
63425857 Nov 2022 US