The present disclosure pertains in general to data processing systems and in particular to technology for retaining data in cache until it has been read.
To handle input/output (IO) traffic in network communications, a data processing system may use the technology from Intel Corp. known as “Intel® Data Direct I/O Technology” or “Intel® DDIO.” For instance, when a network interface controller (NIC) in such a data processing system receives incoming data, the NIC may use Intel® DDIO to write that data directly to cache in the data processing system, thereby avoiding costly writes to and reads from memory. Other types of data processing systems may use other technologies to allow NICs or other components to write directly to cache. For purposes of this disclosure, any IO that is written directly to cache may be referred to in general as “direct to cache” (DTC) IO. Likewise, the term “DTC” may be used in general to refer to Intel® DDIO and to similar technologies from other suppliers.
Oftentimes, DTC IO is first in first out (FIFO) in nature. For instance, when a producer such as a NIC writes a sequence of IO data items to the cache, oftentimes the consumer (e.g., a processing core) will read those items in the same order as that in which they were written.
However, a data processing system may include a caching agent that uses a least recently used (LRU) algorithm or a pseudo-LRU (PLRU) algorithm to manage cache. Such algorithms or policies tend to keep recently used data in the cache. Consequently, such policies may be well suited for temporal reuse traffic. However, such policies may not be well suited to handle data traffic of a FIFO nature. For instance, if the cache is too small to hold all of the IO data supplied by the producer before the consumer begins reading that data, as the producer keeps writing data to the cache, the PLRU policy may cause the writes to wrap around the cache and overwrite some or all of the data that has not yet been read by the consumer. Consequently, some or all of the data from the producer may be evicted from the cache to memory before it has been read by the consumer. In other words, the consumer may get relatively few or no cache hits when the cache is too small to hold the data for the time between write and read.
Features and advantages of the present invention will become apparent from the appended claims, the following detailed description of one or more example embodiments, and the corresponding figures, in which:
When compared with IO technology which sends writes to memory, DTC IO may deliver lower latency of access, reduced interconnect and memory bandwidth usage, and reduced power usage by placing data directly in the cache hierarchy instead of requiring the data to first go through the memory. These benefits may be realized by central processing units (CPUs) and systems on chips (SoCs) with large monolithic dies with large caches that are shared across cores and IO, and also by architectures with disaggregated dies with separate smaller caches in core dies and in IO dies.
As indicated above, a PLRU cache management policy may lead to sub-optimal performance for traffic that is FIFO in nature, such as DTC IO. Typically, a DTC IO workload has a circular data buffer which is written and read by both IO components (such as a NIC, an infrastructure processing unit (IPU), or an accelerator) and processing cores in a FIFO manner. This pattern of processing networking packets is significantly different from the temporal reuse and locality characteristic of core-bound compute traffic for applications such as machine learning, artificial intelligence (AI), databases, etc. Cache partitioning may be used to separate out ways for DTC IO traffic versus core compute/temporal reuse traffic. However, the same PLRU cache management policy may be used across the entire cache. For instance, the caching agent may use a PLRU cache management policy that uses two status bits to denote four different ages, and that policy may be referred to as “PLRU with 2-bit (2b) quad age.” A data processing system may include a large last level cache that is shared across all cores and IO components. Such a cache may be referred to as an “aggregated/monolithic large last level cache.” A caching agent may create a partition in that cache for DTC IO traffic. Such a partition may be referred to as a “DTC IO partition.” If the DTC IO partition is large enough (e.g., if it includes a sufficient number of ways), the PLRU cache management policy may handle the DTC IO traffic well enough.
However, if the DTC IO partition is not large enough to hold all of the data that a producer (e.g., a NIC or an accelerator) writes to the partition before a consumer (e.g., a processing core or another accelerator) reads the data, the producer may end up overwriting data in the cache before the consumer has had a chance to read it. In other words, if the DTC IO partition is undersized, the producer may end up overwriting data in that partition before the consumer has had a chance to read that data. Moreover, such overwrites will cause the overwritten data to be written to memory, to subsequently be read back from memory, thereby reducing or eliminating the benefit of writing the data directly to the cache in the first place.
According to Little's Law, the average number of items “L” in a queuing system equals the average arrival rate “A” of items to the system, multiplied by the average waiting time “W” of an item in the system. In other words, L=A*W. Accordingly, if the cache partition size (L) is smaller than that required for the desired bandwidth (data arrival rate A) given the produce-to-consume latency (waiting time W) on the system, then IO data written into a cache partition is likely to get evicted from that cache partition (and replaced with new IO writes) before the original IO data has been read.
Furthermore, a data processing system may include a processor package with disaggregated dies, possibly including one or more IO dies that each have their own IO caches. Such IO caches may be used not only for DTC traffic that is FIFO in nature (e.g., NIC traffic) but also by components which may have more of a temporal locality behavior, such as CPU IO stack components such as IO memory management units (IOMMUs) or features that enable virtualization of IO resource, on-CPU accelerators, or other agents which may use, for example, the interconnect known by the name or trademark of “Compute Express Link” (CXL). Consequently, a data processing system with a disaggregated die architecture may feature separate smaller caches in processing cores and in IO dies, yet that data processing system may be required to handle a mix of traffic, including some traffic with FIFO behaviors and other traffic with temporal reuse behaviors. In particular, the disaggregated IO caches may be used to hold a mix of traffic, including some traffic with FIFO behaviors and other traffic with temporal reuse behaviors. Hence, the same issue discussed above with monolithic large caches shared across IO and cores, such as undersized cache partition for FIFO traffic, is also applicable to disaggregated IO caches in a processor package with a disaggregated architecture.
The present disclosure describes a caching agent which implements a cache management policy that is suitable for FIFO traffic, in that it may prevent data from being evicted from the cache when that data has not yet been read by the consumer, even though the cache is full and yet the producer is still providing more data. In other words, the cache management policy provides for FIFO traffic retention (FTR). Accordingly, for purposes of this disclosure, this type of cache management policy may be referred to as an “FTR policy.” In particular, as described in greater detail below, the cache management policy provides for the aging of cache ways that contain DTC data, and for retaining the data until the data has been read or until the way containing the data has reached a predetermined maximum age.
By implementing an FTR policy, the caching agent may prevent new lines from replacing old lines before they have been read by a consumer, thereby providing a chance for old lines to be read after a longer latency from fill than that covered by the cache size for a given bandwidth. Consequently, the FTR policy may enable the caching agent to realize at least some of the cache hit benefits of DTC IO. By contrast, in an undersized cache scenario, DTC IO into cache (or a cache partition) that is managed with the traditional PLRU policy may produce no cache hit benefits.
Additionally, the caching agent may divide the cache into multiple partitions, and the caching agent may use different caching algorithms for different partitions. For instance, the caching agent may divide the cache into one or more PLRU partitions and one or more FTR partitions, the caching agent may use a PLRU algorithm to manage the PLRU partitions, and the caching agent may use an FTR algorithm to manage the FTR partitions.
For purposes of this disclosure, all of the ways in all of the sets of a cache which belong to a particular partition are collectively referred to as a “global partition.” In addition, depending on context, the term “partition” may be used to refer to a global partition or to the portion of a global partition that resides in a cache set. For instance, the phrase “partition in a set” may be used to refer to the portion of a global partition that resides in a set. Similarly, all of the ways in a particular set which belong to the same global partition may be referred to collectively as a “partition.”
In the embodiment of
In the embodiment of
In an alternative embodiment, a processor package includes one or more processing cores, one or more IO units, and a monolithic uncore. The uncore includes a cache cluster with a large monolithic cache that is shared across all of the cores and IO units. The uncore also includes a caching agent which uses an FTR policy to manage one or more FTR partitions in the large monolithic cache. A processor package with such a caching agent is described in greater detail below (e.g., with regard to
In the embodiment of
In addition to caching agent 36, IO cluster 30 also includes an IO cache 32. Caching agent 36 may configure IO cache 32 according to various cache configuration settings, such as FTR settings 18, some or all of which may reside in any suitable location or locations within (or outside of) data processing system 10, as indicated above. IO cluster 30 may also include one or more accelerators and a home agent/memory controller. In an alternative embodiment, one or more accelerators and/or a home agent/memory controller reside in one or more compute clusters.
In one embodiment or scenario, FTR settings 18 specify the number and type of global partitions to be implemented or instantiated within IO cache 32. In particular, FTR settings 18 may specify how many global pseudo-LRU (PLRU) partitions are to be instantiated and how many global FTR partitions are to be instantiated. FTR settings 18 may also specify how many ways are to be assigned to each partition. Caching agent 36 may configure IO cache 32 to include one or more global PLRU cache partitions and one or more global FTR cache partitions, according to FTR settings 18.
In the embodiment of
In the embodiment of
In addition, each line (or way) in each FTR partition is associated with (or is connected to, or contains) status bits which serve as an age attribute, and the caching agent does not evict a line (or way) from the cache until that line (or way) has either been read or has reached a maximum age. In addition, for PLRU partitions, caching agent may use those two status bits to denote four different ages for a cache management policy such as PLRU with 2b quad age. In the embodiment of
Also, caching agent 36 may use a different state machine to manage each partition. For instance, caching agent 36 may use state machine 38 to manage PLRU partition 34, state machine 38A to manage FTR partition A, and state machine 38B to manage FTR partition B. Additionally, the age attributes (or simply “ages”) of each way in a partition may be part of a state machine for that partition. In particular, caching agent 36 uses a state machine to manage the age attributes for each FTR partition according to a predetermined policy or algorithm. Such a state machine, and a process for implementing such a policy or algorithm, are described below respectively, with regard to
In an example scenario, FTR settings 18 specify that IO cache 32 is to include a PLRU partition with four ways and two FTR partitions with 4 ways each, and that the max age for each FTR partition is to be TWICE-AGED. Accordingly, at block 110, caching agent 36 configures IO cache 32 with PLRU partition 34 and FTR partitions A and B, and caching agent instantiates respective state machines 38, 38A, and 38B. And when instantiating the FTR state machines (e.g., state machine 38A), caching agent 36 may initialize the age of each way in the partition to FREE (which may be represented by the value 0, for instance). Caching agent 36 may then wait for read or write operations involving any of the partitions, as shown at block 112. For the purpose of illustration, the description below focuses on operations involving set 0 in FTR partition A. However, caching agent 36 may perform the same kinds of operations with the other sets in FTR partition A and with FTR partition B.
After caching agent 36 has configured IO cache 32 and the corresponding state machines, when a producer (e.g., an IO component such as a NIC or an accelerator) writes to an FTR partition, caching agent 36 may write that data into a FREE way (if one exists), while incrementing the age attribute of that way to NEW (which may be represented by the value 1, for instance). In particular, as shown at blocks 120, 130, and 132, on a write miss (“no” branch from block 130) when a way in the corresponding set in FTR partition A has the age of FREE (“yes” branch from block 140), caching agent 36 updates (or writes to) that way with the data from the IO write, and caching agent 36 updates the age of that way to NEW. In
Also, as described in greater detail below, if a producer performs a write operation when FTR partition A is full, if the write is a miss and any corresponding ways are NEW, caching agent 36 updates one of those ways to AGED (which may be represented by the value 2, for instance). However, caching agent 36 will not write the data to the cache, but will instead write the data to memory, thereby allowing older data to stay in the cache. And caching agent 36 may respond to additional write misses similarly. Furthermore, caching agent 36 may provide for a maximum age, and once all of the ways of a partition in a set have reached the maximum age, the caching agent may reset the age attribute for all of those ways to “FREE,” to prevent lines from being retained perpetually.
However, in an alternative embodiment, the caching agent implements an FTR policy by designating one of the ways in the partition of each set as a reserved way or a staging way. And when the ways in a partition in a set are full, instead of writing new data to memory, the caching agent writes the new data to the staging way for that set, thereby causing the old data from the staging way to get evicted to memory, while allowing the data in all of the other ways in the partition in the set to be retained.
However, referring again to
However, as shown at block 142 of
Referring again to the embodiment of
However, as shown at block 160, if none of the ways (in the relevant partition in the relevant set) have the age of NEW, caching agent 36 determines whether any of the ways have the age of AGED. If any of the ways has the age of AGED, caching agent 36 updates the age of one of those ways to TWICE-AGED (which may be represented by the value 3, for instance), as shown at block 162. In
Thus, when the partition is full, caching agent 36 ages the ways in that partition, incrementing the age of one way in the partition each time the data from a write miss gets redirected to memory (or, in an alternative embodiment, to the staging way). After caching agent 36 sets the age of a way to TWICE-AGED, the process may then pass through page connector C to
However, referring again to block 160, on a write miss, if none of the ways have the age of AGED (or FREE or NEW), then caching agent 36 may conclude that all of ways have an age of TWICE-AGED. And as indicated earlier, in the example scenario, FTR settings 18 specify a max age of TWICE-AGED. Alternatively, as shown at blocks 162 and 170, caching agent 36 may check whether all of the ways have the max age each time caching agent 36 sets one of the ways to TWICE-AGED. As shown at block 172, if all of the ways are at the max age, caching agent 36 updates the ages for all of the ways to FREE. In
The process of
Also, as shown at block 120, if caching agent 36 has not received a write operation, the process may pass through page connector B to block 210, and caching agent 36 may determine whether it has received a read operation. And if caching agent 36 has received a read operation, caching agent 36 may determine whether the read hits any of the ways in FTR partition A, as shown at block 220. On a read hit, caching agent 36 may read the data from the indicated way, and caching agent may update the age of that way to FREE, as shown at block 222. Also, in
However, referring again to block 220 of
Also, in
However, referring again block 210, if caching agent 36 has not received a read request, caching agent 36 may determine whether it has received a reconfiguration request, as shown at block 212. For instance, caching agent 36 may receive such a reconfiguration request from OS 17 or from an external OOB management agent. If reconfiguration has not been requested, the process may return to block 120 via page connector A. However, if reconfiguration has been requested, the process may return to block 110 via page connector D, and caching agent 36 may then modify the configuration of IO cache 32 and any corresponding state machines according to the new FTR settings or other new caching parameters associated with the reconfiguration request. Caching agent 36 may then continue to process reads, writes, and reconfiguration requests as indicated above, but in accordance with the new parameters.
However, as indicated above, in another embodiment or scenario, FTR settings 18 may specify a different max age, such as ONCE-AGED or simply AGED. In that case, caching agent 36 may perform operations like those described above but modified according to the specified max age. For instance, when max age is AGED, caching agent 36 may use a state machine like the one in
Furthermore, as indicated above, in another embodiment, the caching agent implements an FTR policy that processes a write miss to a full FTR partition by writing the new data to a predetermined staging way in the partition, rather than writing the new data to memory. In particular, when configuring the IO cache to include an FTR partition, the caching agent reserves one of the ways in each set in the FTR partition to serve as a staging way. Accordingly, the reserved way in such a partition may be referred to as the “staging way,” and the other ways in the partition may be referred to as the “regular ways.” Subsequently, when processing write operations, if the partition is full, the caching agent writes data from write misses to the staging way, instead of writing them to memory. However, the caching agent still retains the data in the regular ways until a way has been read or until all of the ways have reached the max age. For purposes of this disclosure, such a cache management policy may be referred to as an “FTR with staging policy.” Such a cache management policy may be beneficial in a data processing system that does not provide the caching agent with a mechanism or path for redirecting DTC writes to memory, or in a data processing system with a caching agent that (a) makes the decision to allocate a write into cache before the way age is known and that (b) can't then cancel the allocation and divert the new write to memory. An example embodiment of a data processing system with a caching agent that implements a cache management policy of FTR with staging is described in greater detail below with regard to
For the purpose of illustration, the description below focuses on operations involving set 0 of FTR partition C. However, the caching agent may perform the same kinds of operations on the other sets in FTR partition C and on FTR partition D.
When processing reads, the caching agent may operate like caching agent 36. For instance, for a read hit on any of the ways in FTR partition C, the caching agent may read the data from the indicated way and reset that way to FREE. And on a read miss, the caching agent may satisfy the read from another source (e.g., from RAM or from another cache) without writing that data to the IO cache.
When processing a write hit to any of the ways in FTR partition C, the caching agent may update the data in that way with the data from the write request, and the caching agent may update the age of that way to NEW.
And when processing a write miss involving FTR partition C, if none of the ways in the relevant set are FREE, the caching agent may write the data to the staging way for that set (e.g., staging way 338C), thereby retaining the data in regular ways for that set (e.g., regular ways 336C). And in conjunction with that write to the staging way, the caching agent may update the age of the staging way to be the max age, and the caching agent may increment the age of a regular way in that set, if any of those ways is less than the max age (starting with a NEW way, or if there are none of them, an AGED way). And when all of the ways in the set reach the max age, the caching agent may reset all of those ways to FREE.
The FTR policy illustrated in
Also, for any of the cache management policies described above, the caching agent may always use ways that are marked as INVALID as the top priority for replacement. In other words, when processing a write miss, the caching agent may always fill to an INVALID way before filling to any VALID way, without regard to the ages of the ways. Accordingly, when a way is invalidated for some reason, the caching agent may not update the age of that way to FREE (or 0), because INVALID ways will be top priority. However, if there is no INVALID way in the relevant set in an FTR partition, the caching agent may then consider the ages of the ways in that set and make decisions according to the policy set forth above with regard to
For the basic FTR policy with a max age of TWICE-AGED (or 3), the meanings of the different ages may be summarized as follows:
Similarly, the basic FTR policy with A max age of TWICE-AGED (or 3) may be summarized as follows:
Similarly, the state transitions for the basic FTR policy may be summarized as follows:
The following example scenarios describe the performance of a caching agent under various different system configurations, each of which involves an optimum cache partition size for DTC FTR cache partition of 18 MB, according to Little's Law (throughput*latency):
Table 1 below illustrates how ages change in an example scenario involving a 1-set 8-way FTR partition in an IO cache (IO$), where all the ways reach the max age of three and all the way ages get reset. This scenario is like Scenario 4 above, in that an eight-element cache provides a maximum of three times coverage with max age of three. For instance, when no read happens after 24 writes, older lines start getting replaced even though they have not been read yet. If the latency is so high that reads only happen beyond this point, there will be about a zero percent hit rate. Also, in this scenario, when there are multiple candidate ways for replacement, the caching agent chooses the lowest way number.
Table 2 below illustrates how ages change in an example scenario involving a 1-set 8-way FTR partition in an IO$ with a max age of two. Also, in this scenario, an IO agent starts reading the FIFO traffic writes before the full set (i.e., all ways in the set) hits max age. For instance, the read to the 1st write happens after 12 writes. Had the caching agent been using a PLRU policy, the 9th write would have replaced the 1st write, which would have resulted in a miss for the read for the 1st write. And misses would have happened for the next two reads as well, with a PLRU policy. But because the FTR policy retains the first eight lines written longer, the read now gets a hit in the FTR partition. And the reads cause the caching agent to reset the ages of the ways as they are read, enabling new writes to go into those ways rather than memory.
In some embodiments, the caching agent decides whether or not to allocate (or fill) after determining the way age. However, in other embodiments, this allocation decision may be made much earlier in the pipeline, such as based on the transaction opcode type, and the way age may be determined much later in the pipeline. In such embodiments, other implementation possibilities may be used. For example, if it is possible to cancel the decision to allocate later (e.g., after the way age is determined and it indicates no fill), the caching agent may cancel the cache allocation and send the write to memory at that point. Also, this sending of the new write to memory could be faked by substituting the new write as though an eviction happened due to allocation.
Alternatively, as indicated above, a caching agent may implement a policy of FTR with staging, with a reserved way in the cache for staging new writes to memory. Tables 3 and 4 below illustrate how the FTR with staging policy works for the scenarios illustrated above in Tables 1 and 2, respectively, with way 7 being the reserved way. As shown, the reserved way reduces the effective cache size, causing the reset to happen earlier in the example in Table 3, compared to Table 1. And one fewer line is retained at the end of the series in the example in Table 4, compared to Table 2. However, the caching agent continues to get some hits for the example Table 4, which is an improvement relative to the PLRU policy.
As has been described, a processor package includes a cache and caching agent that manages an FTR partition in the cache using an FTR policy. Accordingly, the caching agent may retain data in a set of the FTR partition when that set is full and the data in the cache has not yet been read by the consumer, and yet the producer is still providing more data. Rather than replacing data in the set with the new data from the producer, the caching agent may redirect the new data to memory, or the caching agent may direct the new data to a reserved way in the set. Thus, the caching agent may receive a first sequence of DTC write operations that are write misses to a partition in a set in a cache. In response to receiving that sequence of DTC write operations, the caching agent may write the data from the first sequence of DTC write operations to the partition in the set. In one scenario, the partition in the set comprises W ways, where W is greater than two, and the first sequence of DTC write operations includes enough data to fill the partition in the set. Accordingly, the operation of writing data from the first sequence of DTC write operations to the partition in the set comprises writing data to each of the ways in the partition in the set (i.e., filling the partition in the set). Furthermore, before any data from the partition in the set has been read, the caching agent may receive a second sequence of at least two DTC write operations that are write misses to the partition in the set. In response, the caching agent may complete the second sequence of DTC write operations while retaining the data from the first sequence of DTC write operations in at least W−1 of the ways in the partition in the set. For instance, if W is eight, the caching agent may retain the data from the first sequence of DTC write operations in at least seven of those ways. For instance, the caching agent may complete the DTC write operations in the second set by redirecting the writes to memory, thereby retaining the data from the first set in all eight of the ways, or by directing all of the writes from the second set to a particular reserved way among the eight ways, thereby retaining the data from the first set in the other seven ways.
Example Computer Architectures.
Detailed below are descriptions of example computer architectures. Other system designs and configurations known in the arts for laptop, desktop, and handheld personal computers (PC)s, personal digital assistants, engineering workstations, servers, disaggregated servers, network devices, network hubs, switches, routers, embedded processors, digital signal processors (DSPs), graphics devices, video game devices, set-top boxes, micro controllers, cell phones, portable media players, hand-held devices, and various other electronic devices, are also suitable. In general, a variety of systems or electronic devices capable of incorporating a processor and/or other execution logic as disclosed herein are generally suitable.
Processors 510 and 520 are shown including integrated memory controller (IMC) circuitry 512 and 522, respectively. Processor 510 also includes interface circuits 514 and 516; similarly, second processor 520 includes interface circuits 524 and 526. Processors 510 and 520 may exchange information via the interface 502 using interface circuits 516, 526. IMCs 512 and 522 couple the processors 510, 520 to respective memories, namely a memory 530 and a memory 540, which may be portions of main memory locally attached to the respective processors.
Processors 510, 520 may each exchange information with a network interface (NW I/F) 550 via individual interfaces 511, 521 using interface circuits 514, 556, 524, 558. The network interface 550 (e.g., one or more of an interconnect, bus, and/or fabric, and in some examples is a chipset) may optionally exchange information with a coprocessor 560 via an interface circuit 552. In some examples, the coprocessor 560 is a special-purpose processor, such as, for example, a high-throughput processor, a network or communication processor, compression engine, graphics processor, general purpose graphics processing unit (GPGPU), neural-network processing unit (NPU), embedded processor, or the like.
A shared cache (not shown) may be included in either processor 510, 520 or outside of both processors, yet connected with the processors via an interface such as P-P interconnect, such that either or both processors' local cache information may be stored in the shared cache if a processor is placed into a low power mode.
Network interface 550 may be coupled to a first interface 562 via interface circuit 554. In some examples, first interface 562 may be an interface such as a Peripheral Component Interconnect (PCI) interconnect, a PCI Express interconnect or another I/O interconnect. In some examples, first interface 562 is coupled to a power control unit (PCU) 563, which may include circuitry, software, and/or firmware to perform power management operations with regard to the processors 510, 520 and/or co-processor 560. PCU 563 provides control information to a voltage regulator (not shown) to cause the voltage regulator to generate the appropriate regulated voltage. PCU 563 also provides control information to control the operating voltage generated. In various examples, PCU 563 may include a variety of power management logic units (circuitry) to perform hardware-based power management. Such power management may be wholly processor controlled (e.g., by various processor hardware, and which may be triggered by workload and/or power, thermal or other processor constraints) and/or the power management may be performed responsive to external sources (such as a platform or power management source or system software).
PCU 563 is illustrated as being present as logic separate from the processor 510 and/or processor 520. In other cases, PCU 563 may execute on a given one or more of cores (not shown) of processor 510 or 520. In some cases, PCU 563 may be implemented as a microcontroller (dedicated or general-purpose) or other control logic configured to execute its own dedicated power management code, sometimes referred to as P-code. In yet other examples, power management operations to be performed by PCU 563 may be implemented externally to a processor, such as by way of a separate power management integrated circuit (PMIC) or another component external to the processor. In yet other examples, power management operations to be performed by PCU 563 may be implemented within BIOS or other system software.
Various I/O devices 564 may be coupled to first interface 562, along with a bus bridge 565 which couples first interface 562 to a second interface 570. In some examples, one or more additional processor(s) 566, such as coprocessors, high throughput many integrated core (MIC) processors, GPGPUs, accelerators (such as graphics accelerators or digital signal processing (DSP) units), field programmable gate arrays (FPGAs), or any other processor, are coupled to first interface 562. In some examples, second interface 570 may be a low pin count (LPC) interface. Various devices may be coupled to second interface 570 including, for example, a keyboard and/or mouse 572, communication devices 573 and storage circuitry 574. Storage circuitry 574 may be one or more non-transitory machine-readable storage media as described below, such as a disk drive or other mass storage device which may include instructions/code and data 575 and may implement the storage 16 in some examples. Further, an audio I/O 576 may be coupled to second interface 570. Note that other architectures than the point-to-point architecture described above are possible. For example, instead of the point-to-point architecture, a system such as multiprocessor system 500 may implement a multi-drop interface or other such architecture.
Example Core Architectures, Processors, and Computer Architectures.
Processor cores may be implemented in different ways, for different purposes, and in different processors. For instance, implementations of such cores may include: 1) a general purpose in-order core intended for general-purpose computing; 2) a high-performance general purpose out-of-order core intended for general-purpose computing; 3) a special purpose core intended primarily for graphics and/or scientific (throughput) computing. Implementations of different processors may include: 1) a CPU including one or more general purpose in-order cores intended for general-purpose computing and/or one or more general purpose out-of-order cores intended for general-purpose computing; and 2) a coprocessor including one or more special purpose cores intended primarily for graphics and/or scientific (throughput) computing. Such different processors lead to different computer system architectures, which may include: 1) the coprocessor on a separate chip from the CPU; 2) the coprocessor on a separate die in the same package as a CPU; 3) the coprocessor on the same die as a CPU (in which case, such a coprocessor is sometimes referred to as special purpose logic, such as integrated graphics and/or scientific (throughput) logic, or as special purpose cores); and 4) a system on a chip (SoC) that may be included on the same die as the described CPU (sometimes referred to as the application core(s) or application processor(s)), the above described coprocessor, and additional functionality. Example core architectures are described next, followed by descriptions of example processors and computer architectures.
Thus, different implementations of the processor 600 may include: 1) a CPU with the special purpose logic 608 being integrated graphics and/or scientific (throughput) logic (which may include one or more cores, not shown), and the cores 602(A)-(N) being one or more general purpose cores (e.g., general purpose in-order cores, general purpose out-of-order cores, or a combination of the two); 2) a coprocessor with the cores 602(A)-(N) being a large number of special purpose cores intended primarily for graphics and/or scientific (throughput); and 3) a coprocessor with the cores 602(A)-(N) being a large number of general purpose in-order cores. Thus, the processor 600 may be a general-purpose processor, coprocessor or special-purpose processor, such as, for example, a network or communication processor, compression engine, graphics processor, GPGPU (general purpose graphics processing unit), a high throughput many integrated core (MIC) coprocessor (including 30 or more cores), embedded processor, or the like. The processor 600 may be implemented on one or more chips, e.g., as part of a processor package. The processor 600 may be a part of and/or may be implemented on one or more substrates using any of a number of process technologies, such as, for example, complementary metal oxide semiconductor (CMOS), bipolar CMOS (BiCMOS), P-type metal oxide semiconductor (PMOS), or N-type metal oxide semiconductor (NMOS).
A memory hierarchy includes one or more levels of cache unit(s) circuitry 604(A)-(N) within the cores 602(A)-(N), a set of one or more shared cache unit(s) circuitry 606, and external memory (not shown) coupled to the set of integrated memory controller unit(s) circuitry 614. The set of one or more shared cache unit(s) circuitry 606 may include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, such as an LLC, and/or combinations thereof. While in some examples interface network circuitry 612 (e.g., a ring interconnect) interfaces the special purpose logic 608 (e.g., integrated graphics logic), the set of shared cache unit(s) circuitry 606, and the system agent unit circuitry 610, alternative examples use any number of well-known techniques for interfacing such units. In some examples, coherency is maintained between one or more of the shared cache unit(s) circuitry 606 and cores 602(A)-(N). In some examples, interface controller unit(s) circuitry 616 couple the cores 602 to one or more other devices 618 such as one or more I/O devices, storage, one or more communication devices (e.g., wireless networking, wired networking, etc.), etc.
In one embodiment, the shared cache unit(s) 606 is a large monolithic LLC (e.g., an L3 cache) that is shared across all cores and IO in processor package 600. Accordingly, the processor 600 may be referred to as a “monolithic processor.” Also, one or more of the I/O devices use DTC IO to write to the shared cache unit(s) 606. Also, the processor 600 includes a caching agent 615 which uses techniques like those described above with regard to caching agent 36 of
Alternatively, as indicated above, in other embodiments caching agents which create and manage FTR partitions may be implemented within a disaggregated processor that includes a core cluster and an IO cluster that is separate from the core cluster.
Referring again to
The cores 602(A)-(N) may be homogenous in terms of instruction set architecture (ISA). Alternatively, the cores 602(A)-(N) may be heterogeneous in terms of ISA; that is, a subset of the cores 602(A)-(N) may be capable of executing an ISA, while other cores may be capable of executing only a subset of that ISA or another ISA.
Example Core Architectures—In-order and out-of-order core block diagram.
In
By way of example, the example register renaming, out-of-order issue/execution architecture core of
The front-end unit circuitry 730 may include branch prediction circuitry 732 coupled to instruction cache circuitry 734, which is coupled to an instruction translation lookaside buffer (TLB) 736, which is coupled to instruction fetch circuitry 738, which is coupled to decode circuitry 740. In one example, the instruction cache circuitry 734 is included in the memory unit circuitry 770 rather than the front-end circuitry 730. The decode circuitry 740 (or decoder) may decode instructions, and generate as an output one or more micro-operations, micro-code entry points, microinstructions, other instructions, or other control signals, which are decoded from, or which otherwise reflect, or are derived from, the original instructions. The decode circuitry 740 may further include address generation unit (AGU, not shown) circuitry. In one example, the AGU generates an LSU address using forwarded register ports, and may further perform branch forwarding (e.g., immediate offset branch forwarding, LR register branch forwarding, etc.). The decode circuitry 740 may be implemented using various different mechanisms. Examples of suitable mechanisms include, but are not limited to, look-up tables, hardware implementations, programmable logic arrays (PLAs), microcode read only memories (ROMs), etc. In one example, the core 790 includes a microcode ROM (not shown) or other medium that stores microcode for certain macroinstructions (e.g., in decode circuitry 740 or otherwise within the front-end circuitry 730). In one example, the decode circuitry 740 includes a micro-operation (micro-op) or operation cache (not shown) to hold/cache decoded operations, micro-tags, or micro-operations generated during the decode or other stages of the processor pipeline 700. The decode circuitry 740 may be coupled to rename/allocator unit circuitry 752 in the execution engine circuitry 750.
The execution engine circuitry 750 includes the rename/allocator unit circuitry 752 coupled to retirement unit circuitry 754 and a set of one or more scheduler(s) circuitry 756. The scheduler(s) circuitry 756 represents any number of different schedulers, including reservations stations, central instruction window, etc. In some examples, the scheduler(s) circuitry 756 can include arithmetic logic unit (ALU) scheduler/scheduling circuitry, ALU queues, address generation unit (AGU) scheduler/scheduling circuitry, AGU queues, etc. The scheduler(s) circuitry 756 is coupled to the physical register file(s) circuitry 758. Each of the physical register file(s) circuitry 758 represents one or more physical register files, different ones of which store one or more different data types, such as scalar integer, scalar floating-point, packed integer, packed floating-point, vector integer, vector floating-point, status (e.g., an instruction pointer that is the address of the next instruction to be executed), etc. In one example, the physical register file(s) circuitry 758 includes vector registers unit circuitry, writemask registers unit circuitry, and scalar register unit circuitry. These register units may provide architectural vector registers, vector mask registers, general-purpose registers, etc. The physical register file(s) circuitry 758 is coupled to the retirement unit circuitry 754 (also known as a retire queue or a retirement queue) to illustrate various ways in which register renaming and out-of-order execution may be implemented (e.g., using a reorder buffer(s) (ROB(s)) and a retirement register file(s); using a future file(s), a history buffer(s), and a retirement register file(s); using a register maps and a pool of registers; etc.). The retirement unit circuitry 754 and the physical register file(s) circuitry 758 are coupled to the execution cluster(s) 760. The execution cluster(s) 760 includes a set of one or more execution unit(s) circuitry 762 and a set of one or more memory access circuitry 764. The execution unit(s) circuitry 762 may perform various arithmetic, logic, floating-point or other types of operations (e.g., shifts, addition, subtraction, multiplication) and on various types of data (e.g., scalar integer, scalar floating-point, packed integer, packed floating-point, vector integer, vector floating-point). While some examples may include a number of execution units or execution unit circuitry dedicated to specific functions or sets of functions, other examples may include only one execution unit circuitry or multiple execution units/execution unit circuitry that all perform all functions. The scheduler(s) circuitry 756, physical register file(s) circuitry 758, and execution cluster(s) 760 are shown as being possibly plural because certain examples create separate pipelines for certain types of data/operations (e.g., a scalar integer pipeline, a scalar floating-point/packed integer/packed floating-point/vector integer/vector floating-point pipeline, and/or a memory access pipeline that each have their own scheduler circuitry, physical register file(s) circuitry, and/or execution cluster—and in the case of a separate memory access pipeline, certain examples are implemented in which only the execution cluster of this pipeline has the memory access unit(s) circuitry 764). It should also be understood that where separate pipelines are used, one or more of these pipelines may be out-of-order issue/execution and the rest in-order.
In some examples, the execution engine unit circuitry 750 may perform load store unit (LSU) address/data pipelining to an Advanced Microcontroller Bus (AMB) interface (not shown), and address phase and writeback, data phase load, store, and branches.
The set of memory access circuitry 764 is coupled to the memory unit circuitry 770, which includes data TLB circuitry 772 coupled to data cache circuitry 774 coupled to level 2 (L2) cache circuitry 776. In one example, the memory access circuitry 764 may include load unit circuitry, store address unit circuitry, and store data unit circuitry, each of which is coupled to the data TLB circuitry 772 in the memory unit circuitry 770. The instruction cache circuitry 734 is further coupled to the level 2 (L2) cache circuitry 776 in the memory unit circuitry 770. In one example, the instruction cache 734 and the data cache 774 are combined into a single instruction and data cache (not shown) in L2 cache circuitry 776, level 3 (L3) cache circuitry (not shown), and/or main memory. The L2 cache circuitry 776 is coupled to one or more other levels of cache and eventually to a main memory.
The core 790 may support one or more instructions sets (e.g., the x86 instruction set architecture (optionally with some extensions that have been added with newer versions); the MIPS instruction set architecture; the ARM instruction set architecture (optionally with optional additional extensions such as NEON)), including the instruction(s) described herein. In one example, the core 790 includes logic to support a packed data instruction set architecture extension (e.g., AVX1, AVX2), thereby allowing the operations used by many multimedia applications to be performed using packed data.
Example Execution Unit(s) Circuitry.
Example Register Architecture.
In some examples, the register architecture 1000 includes writemask/predicate registers 1015. For example, in some examples, there are 8 writemask/predicate registers (sometimes called k0 through k7) that are each 16-bit, 32-bit, 64-bit, or 128-bit in size. Writemask/predicate registers 1015 may allow for merging (e.g., allowing any set of elements in the destination to be protected from updates during the execution of any operation) and/or zeroing (e.g., zeroing vector masks allow any set of elements in the destination to be zeroed during the execution of any operation). In some examples, each data element position in a given writemask/predicate register 1015 corresponds to a data element position of the destination. In other examples, the writemask/predicate registers 1015 are scalable and consists of a set number of enable bits for a given vector element (e.g., 8 enable bits per 64-bit vector element).
The register architecture 1000 includes a plurality of general-purpose registers 1025. These registers may be 16-bit, 32-bit, 64-bit, etc. and can be used for scalar operations. In some examples, these registers are referenced by the names RAX, RBX, RCX, RDX, RBP, RSI, RDI, RSP, and R8 through R15.
In some examples, the register architecture 1000 includes scalar floating-point (FP) register file 1045 which is used for scalar floating-point operations on 32/64/80-bit floating-point data using the x87 instruction set architecture extension or as MMX registers to perform operations on 64-bit packed integer data, as well as to hold operands for some operations performed between the MMX and XMM registers.
One or more flag registers 1040 (e.g., EFLAGS, RFLAGS, etc.) store status and control information for arithmetic, compare, and system operations. For example, the one or more flag registers 1040 may store condition code information such as carry, parity, auxiliary carry, zero, sign, and overflow. In some examples, the one or more flag registers 1040 are called program status and control registers.
Segment registers 1020 contain segment points for use in accessing memory. In some examples, these registers are referenced by the names CS, DS, SS, ES, FS, and GS.
Machine specific registers (MSRs) 1035 control and report on processor performance. Most MSRs 1035 handle system-related functions and are not accessible to an application program. Machine check registers 1060 consist of control, status, and error reporting MSRs that are used to detect and report on hardware errors.
One or more instruction pointer register(s) 1030 store an instruction pointer value. Control register(s) 1055 (e.g., CR0-CR4) determine the operating mode of a processor (e.g., processor 510, 520, 560, 566, and/or 600) and the characteristics of a currently executing task. Debug registers 1050 control and allow for the monitoring of a processor or core's debugging operations.
Memory (mem) management registers 1065 specify the locations of data structures used in protected mode memory management. These registers may include a global descriptor table register (GDTR), interrupt descriptor table register (IDTR), task register, and a local descriptor table register (LDTR) register.
Alternative examples may use wider or narrower registers. Additionally, alternative examples may use more, less, or different register files and registers. The register architecture 1000 may, for example, be used in physical register file(s) circuitry 758.
Example A1 is a method for managing cache. The method comprises, in response to receiving a first sequence of DTC write operations that are write misses to a partition in a set in a cache, writing data from the first sequence of DTC write operations to the partition in the set, wherein the partition in the set comprises W ways, W is greater than two, and the operation of writing data from the first sequence of DTC write operations to the partition in the set comprises writing data from the first sequence of DTC write operations to all W ways in the partition in the set. The method also comprises, after writing data from the first sequence of DTC write operations to all W ways in the partition in the set, and before any data from the partition in the set has been read, receiving a second sequence of at least two DTC write operations that are write misses to the partition in the set, and in response to receiving the second sequence of DTC write operations, completing the second sequence of DTC write operations while retaining the data from the first sequence of DTC write operations in at least W−1 of the ways in the partition in the set.
Example A2 is a method according to Example A1, wherein the operation of completing the second sequence of DTC write operations while retaining the data from the first sequence of DTC write operations in at least W−1 of the ways in the partition in the set comprises writing data from the second sequence of DTC write operations to memory.
Example A3 is a method according to Example A1, wherein the ways in the partition in the set comprise a staging way, and the operation of completing the second sequence of DTC write operations while retaining the data from the first sequence of DTC write operations in at least W−1 of the ways in the partition in the set comprises writing data from the second sequence of DTC write operations to the staging way.
Example A4 is a method according to Example A1, wherein the operation of writing data from the first sequence of DTC write operations to the partition in the set comprises determining whether the partition in the set comprises a way with an age attribute of FREE, and if the partition in the set comprises a way with an age attribute of FREE, (a) writing the data from the IO component to that way and (b) updating the age attribute of that way to NEW. Example A4 may also include the features of Example A2 or Example A3.
Example A5 is a method according to Example A4, wherein the operation of completing the second sequence of DTC write operations while retaining the data from the first sequence of DTC write operations in at least W−1 of the ways in the partition in the set comprises: if the partition in the set does not comprise a way with an age attribute of FREE, determining whether the partition in the set comprises a way with an age attribute of NEW; and if the partition in the set comprises a way with an age attribute of NEW, (a) updating the age attribute of that way to AGED and (b) completing an individual DTC write operation from the second sequence of DTC write operations without writing the data from that individual DTC write operation to that way.
Example A6 is a method according to Example A4, and further comprising, in response to a read operation that hits one of the ways in the partition in the set, updating the age attribute of that way to FREE. Example A6 may also include the features of Example A5.
Example A7 is a method according to Example A6, and further comprising: determining whether all of the ways in the partition in the set have age attributes at a maximum age; and in response to determining that all of the ways in the partition in the set have age attributes at the maximum age, updating the age attributes for all of the ways in the partition in the set to FREE.
Example A8 is a method according to Example A7, wherein the maximum age is based on a maximum age parameter, and the method further comprises, when the maximum age parameter specifies a maximum age of TWICE-AGED, in response to an individual DTC write operation from the second sequence of DTC write operations, if all of the ways in the partition in the set have age attributes of AGED, updating the age attribute for one of the ways in the partition in the set to TWICE-AGED.
Example A9 is a method according to Example A1, wherein the operations of (i) writing data from the first sequence of DTC write operations to the partition in the set and (ii) completing the second sequence of DTC write operations while retaining the data from the first sequence of DTC write operations in at least W−1 of the ways in the partition in the set are performed by a caching agent in a processor package in a data processing system. Also, the DTC write operations from the first and second sequences involve data from a NIC in the data processing system. Example A9 may also include the features of any one or more of Examples A2-A8.
Example B1 is a processor package comprising an integrated circuit, a cache in the integrated circuit, and a caching agent in the integrated circuit. Also, the caching agent is operable to, in response to receiving a first sequence of DTC write operations that are write misses to a partition in a set in the cache, write data from the first sequence of DTC write operations to the partition in the set, wherein the partition in the set comprises W ways, W is greater than two, and the operation of writing data from the first sequence of DTC write operations to the partition in the set comprises writing data from the first sequence of DTC write operations to all W ways in the partition in the set. In addition, the caching agent is operable to, after writing data from the first sequence of DTC write operations to all W ways in the partition in the set, and before any data from the partition in the set has been read, receive a second sequence of at least two DTC write operations that are write misses to the partition in the set, and in response to receiving the second sequence of DTC write operations, complete the second sequence of DTC write operations while retaining the data from the first sequence of DTC write operations in at least W−1 of the ways in the partition in the set.
Example B2 is a processor package according to Example B1, wherein the operation of completing the second sequence of DTC write operations while retaining the data from the first sequence of DTC write operations in at least W−1 of the ways in the partition in the set comprises writing data from the second sequence of DTC write operations to memory.
Example B3 is a processor package according to Example B1, wherein the operation of completing the second sequence of DTC write operations while retaining the data from the first sequence of DTC write operations in at least W−1 of the ways in the partition in the set comprises writing data from the second sequence of DTC write operations to a staging way among the ways in the partition in the set.
Example B4 is a processor package according to Example B1, wherein the operation of writing data from the first sequence of DTC write operations to the partition in the set comprises determining whether the partition in the set comprises a way with an age attribute of FREE, and if the partition in the set comprises a way with an age attribute of FREE, (a) writing the data from the IO component to that way and (b) updating the age attribute of that way to NEW. Example B4 may also include the features of Example B2 or Example B3.
Example B5 is a processor package according to Example B4, wherein the operation of completing the second sequence of DTC write operations while retaining the data from the first sequence of DTC write operations in at least W−1 of the ways in the partition in the set comprises: if the partition in the set does not comprise a way with an age attribute of FREE, determining whether the partition in the set comprises a way with an age attribute of NEW; and if the partition in the set comprises a way with an age attribute of NEW, (a) updating the age attribute of that way to AGED and (b) completing an individual DTC write operation from the second sequence of DTC write operations without writing the data from that individual DTC write operation to that way.
Example B6 is a processor package according to Example B4, wherein the caching agent is operable to perform further operations comprising, in response to a read operation that hits one of the ways in the partition in the set, updating the age attribute of that way to FREE. Example B6 may also include the features of Example B5.
Example B7 is a processor package according to Example B6, wherein the caching agent is operable to perform further operations comprising: determining whether all of the ways in the partition in the set have age attributes at a maximum age; and in response to determining that all of the ways in the partition in the set have age attributes at the maximum age, updating the age attributes for all of the ways in the partition in the set to FREE.
Example B8 is a processor package according to Example B7, the maximum age is based on a maximum age parameter, and the caching agent is operable to perform further operations comprising, when the maximum age parameter specifies a maximum age of TWICE-AGED, in response to an individual DTC write operation from the second sequence of DTC write operations, if all of the ways in the partition in the set have age attributes of AGED, updating the age attribute for one of the ways in the partition in the set to TWICE-AGED.
Example C1 is a data processing system comprising RAM, a processor package in communication with the RAM, an integrated circuit in the processor package, a cache in the integrated circuit, and a caching agent in the integrated circuit. Also, the caching agent is operable to, in response to receiving a first sequence of DTC write operations that are write misses to a partition in a set in the cache, writing data from the first sequence of DTC write operations to the partition in the set, wherein the partition in the set comprises W ways, W is greater than two, and the operation of writing data from the first sequence of DTC write operations to the partition in the set comprises writing data from the first sequence of DTC write operations to all W ways in the partition in the set. In addition, the caching agent is operable to, after writing data from the first sequence of DTC write operations to all W ways in the partition in the set, and before any data from the partition in the set has been read, receiving a second sequence of at least two DTC write operations that are write misses to the partition in the set, and in response to receiving the second sequence of DTC write operations, completing the second sequence of DTC write operations while retaining the data from the first sequence of DTC write operations in at least W−1 of the ways in the partition in the set.
Example C2 is a data processing system according to Example C1, wherein the operation of completing the second sequence of DTC write operations while retaining the data from the first sequence of DTC write operations in at least W−1 of the ways in the partition in the set comprises writing data from the second sequence of DTC write operations to the RAM.
Example C3 is a data processing system according to Example C1, wherein the operation of completing the second sequence of DTC write operations while retaining the data from the first sequence of DTC write operations in at least W−1 of the ways in the partition in the set comprises writing data from the second sequence of DTC write operations to a staging way among the ways in the partition in the set.
Example C4 is a data processing system according to example C1, wherein the operation of writing data from the first sequence of DTC write operations to the partition in the set comprises determining whether the partition in the set comprises a way with an age attribute of FREE, and if the partition in the set comprises a way with an age attribute of FREE, (a) writing the data from the IO component to that way and (b) updating the age attribute of that way to NEW. Example C4 may also include the features of Example C2 or Example C3.
Example C5 is a data processing system according to example C4, wherein the operation of completing the second sequence of DTC write operations while retaining the data from the first sequence of DTC write operations in at least W−1 of the ways in the partition in the set comprises: if the partition in the set does not comprise a way with an age attribute of FREE, determining whether the partition in the set comprises a way with an age attribute of NEW; and if the partition in the set comprises a way with an age attribute of NEW, (a) updating the age attribute of that way to AGED and (b) completing an individual DTC write operation from the second sequence of DTC write operations without writing the data from that individual DTC write operation to that way.
Example C6 is a data processing system according to Example C4, wherein the caching agent is operable to perform further operations comprising, in response to a read operation that hits one of the ways in the partition in the set, updating the age attribute of that way to FREE. Example C6 may also include the features of Example C5.
Example C7 is a data processing system according to Example C6, wherein the caching agent is operable to perform further operations comprising: determining whether all of the ways in the partition in the set have age attributes at a maximum age; and in response to determining that all of the ways in the partition in the set have age attributes at the maximum age, updating the age attributes for all of the ways in the partition in the set to FREE.
Example C8 is a data processing system according to Example C7, wherein the maximum age is based on a maximum age parameter, and the caching agent is operable to perform further operations comprising, when the maximum age parameter specifies a maximum age of TWICE-AGED, in response to an individual DTC write operation from the second sequence of DTC write operations, if all of the ways in the partition in the set have age attributes of AGED, updating the age attribute for one of the ways in the partition in the set to TWICE-AGED.
Example C9 is a data processing system according to Example C8, further comprising a NIC, and the DTC write operations from the first and second sequences involve data from the NIC.
Example D is a processor package comprising means to perform a method as recited in any of Examples A1-A9.
Example E is a data processing system comprising means to perform the method as recited in any of Examples A1-A9.
References to “one example,” “an example,” etc., indicate that the example described may include a particular feature, structure, or characteristic, but every example may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same example. Further, when a particular feature, structure, or characteristic is described in connection with an example, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other examples whether or not explicitly described.
Moreover, in the various examples described above, unless specifically noted otherwise, disjunctive language such as the phrase “at least one of A, B, or C” or “A, B, and/or C” is intended to be understood to mean either A, B, or C, or any combination thereof (i.e. A and B, A and C, B and C, and A, B and C).
The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. It will, however, be evident that various modifications and changes may be made thereunto without departing from the broader spirit and scope of the disclosure as set forth in the claims.