Patch management is a commonly used phrase to describe a process of distributing and applying updates (i.e., patches) to software. The updates may add features, fix bugs, remedy vulnerabilities, or change the software in some other manner. A Wide Area Network (WAN) is capable of connecting large numbers of computing systems operated/controlled by an entity (e.g., business). There is risk involved when software is updated because the software being updated, and any other software that relies upon the software being updated, will be unavailable during the update. Accounting for software being unavailable on a computing system can be difficult, if not impossible, when large numbers of computing systems are involved and need to be updated. Therefore, handling patch management on those computing systems can still lead to undesired outages affecting the performance of activities by the computing systems in the aggregate.
The technology disclosed herein enables deployment of updates to computing systems connected to a Wide Area Network (WAN). In a particular example, a method includes identifying an update to be performed on computing systems connected to a Wide Area Network (WAN) and identifying first attributes of a first computing system of the computing systems. The method further includes determining an update time in which the update should be performed at the first computing system based on the first attributes relative to other attributes of other ones of the computing systems. The method also includes performing the update at the first computing system at the update time.
In other examples, an apparatus performs the above-recited method and computer readable storage media directs a processing system to perform the above-recited method.
The update services described herein optimize the deployment of software updates across computing systems connected to a WAN. Specifically, the deployment of a software update is optimized to reduce risk associated with downtime during the software update. The risk of a particular update relates to an impact of the downtime on activities (e.g., services) being performed by the affected software component(s). For instance, the computing systems of a particular entity may be providing a computing service over the WAN. If all the systems are updated at once, the entire service will be unavailable until the software update is complete. Thus, the update services herein performs software updates in a manner that avoids complete outage of any service/activity being performed by the computing systems being updated (e.g., update a subset of the computing systems while another subset remains active). If it is not possible to avoid a complete outage (either systemwide or in a particular region), then the update service performs the update when the outage will have less of an impact (e.g., when usage of the software is relatively low).
Moreover, it is prudent to update the systems in phases rather than all at once to avoid disruptions caused by a bad update (e.g., an update that includes a bug). For instance, if all systems providing a particular service are updated and that update prevents the systems from effectively providing the service, then the service is unavailable until the issue caused by the update is resolved. Thus, only updating a portion of the systems enables other systems providing the service to continue operating having not received the update. Should the update prove to be reliable, then the other systems can be updated as well. Similarly, updates to systems supporting more important activities (i.e., higher risk systems) may be performed after the updates have already proven themselves to be stable.
In operation, computing systems 102 execute software to perform activities thereon that involve communications over WAN 103 on behalf of an operating entity. For instance, computing system 102-1 may host a network application that is accessible over WAN 103. As with most modern software, the software executing on computing systems 102 will require updating at some point. The software may include applications, operating systems, virtualization software (e.g., hypervisors, container engines, container orchestration platforms, etc.), device firmware, or any other type of program instructions executing on computing systems 102 that can be updated over a network. While only two of computing systems 102 are shown, any number of computing systems may be included in computing systems 102. Also, each of computing systems 102 may be identical or computing systems 102 may include different types of computing systems (e.g., computing systems with different purposes, from different vendors, etc.). Update service 101 is implemented by one or more computing systems that may be similar in architecture to those of computing systems 102. Update service 101 may communicate with computing systems 102 over WAN 103, as shown, or may communicate over an out-of-band channel. Update service 101 is also operated by, or otherwise controlled by, the entity and optimizes updates to the software on computing systems 102 to reduce or remove any impact downtime caused by the updates. Operation 200 is performed by update service 101 to optimize when updates to computing systems 102 occur.
Update service 101 also identifies attributes of computing system 102-1 (202). The attributes may include computing system 102-1's system type (e.g., hardware specs, manufacturer, model number, etc.), an amount of WAN 103 bandwidth managed by computing system 102-1, an average utilization of that bandwidth by computing system 102-1, the geographic location of computing system 102-1, a business unit associated with computing system 102-1 or software executing thereon that will be affected by the update (i.e., subject to the update itself or will otherwise be down during the update), the current patch level of the software being updated, status of workloads executing on computing system 102-1 (e.g., activities being performed by the software being updated or by other software that will be affected by the update), or any other type of information that may be pertinent to when update service 101 determines the update should be applied. In some examples, the attributes may be what indicate which of computing systems 102 are executing the software to which the update will be applied but update service 101 may identify those computing systems in some other manner. A portion of the attributes may be received from computing system 102-1 itself via some other system monitoring computing system 102-1. For example, computing system 102-1 may report on the status of workloads thereon or another system controlling the workloads may report the status. A portion of the attributes may be calculated by update service 101 itself based on previous attributes received (e.g., to calculate average bandwidth utilization). A portion of the attributes may be received from other components managing computing systems 102 or information received from a user. For instance, update service 101 may be a component of a control plane that manages computing systems 102 and maintains information about computing systems 102, such as their business unit, geographic location, etc. Update service 101 need not wait until an update is identified to receive the attributes. Rather, attributes may be received periodically to ensure they are up to date whenever an update is received.
Update service 101 determines an update time in which the update should be performed at computing system 102-1 based on the attributes of computing system 102-1 relative to other attributes of other ones of computing system 102-1 (203). The attributes of the other computing systems may be similar to those described above for the attributes of computing system 102-1 and may, likewise, be received by update service 101 in a similar manner. The attributes of the other computing systems are considered to better ensure computing system 102-1 is updated at a time that causes a minimal impact to the operations of computing systems 102 in the aggregate. For instance, computing system 102-1 may be one of two or more computing systems that provide a service in a particular geographic region (e.g., in a particular country). Update service 101 may choose an update time that ensures computing system 102-1 is not updated at the same time as the other computing systems in region so that all of the computing systems providing the service to the region are not down at once (e.g., if they all were down at the same time, the service may either not be provided in that region during the downtime or the service may be provided by systems outside of the region, which may introduce undesired latency or improper regional-specific service). In a similar example, update service 101 may determine a timeframe when the load on computing system 102-1 is at a relatively low level and then select that update time from within that timeframe. Other factors and combinations of factors may be considered in other examples.
In some examples, update service 101 may factor other updates that have yet to be performed on computing system 102-1 when determining the update time. For instance, the update of this present operation may be at a higher level of the software stack executing on computing system 102-1 than a second update for which update service 101 is also determining an update time. When update service 101 performs the second update, the software to be updated by the present update will also be affected. So that software does not have to be affected twice (i.e., due to the performance of the two updates at different times), update service 101 may determine a single update time to perform both updates. While only software/firmware updates have been discussed thus far, update service 101 may also be able to schedule update times for hardware components of computing system 102-1 (e.g., may receive notification from a technician that a component will be changed and respond to the technician with a time to take computing system 102-1 offline to perform the change). Similarly, even if update service 101 is not used to schedule a hardware update, update service 101 may determine when an update has occurred (e.g., by receiving updated computing system 102-1 attributes) and perform the update when computing system 102-1 is back online (i.e., to extend the downtime that has already occurred due to the hardware change).
In some examples, the update time for computing system 102-1 may be dependent upon a risk score determined for computing system 102-1 by update service 101. For instance, the risk score may represent risk on a numeric scale (e.g., 1-10) with one end of the scale representing higher risk and the other representing lower risk. The risk score may be determined from at least a subset of the attributes of computing system 102-1. The risk score indicates an impact on computing systems 102 should the operation of computing system 102-1 be disrupted by the update (e.g., the update may include a bug that prevents computing system 102-1 from performing properly). For example, if the update affects a workload on computing system 102-1 that is very important to a critical business unit or is critical to the operations of other workloads on computing systems 102, then the risk score may be set relatively high compared to less important workloads or workloads for less important business units.
Update service 101 similarly determines a risk score for others of computing systems 102 that will also be updated. Update service 101 determines a relative risk of computing system 102-1 relative to those other computing systems. The relative risk indicates how risky it is to update computing system 102-1 relative to how risky it is to update others of computing systems 102. Generally, update service 101 will apply the update to computing systems with lower relative risk before those having higher relative risk. Update service 101 may apply an update policy that specifically defines when computing systems should be updated based on risk. For example, the policy may define that a certain percentage of the computing systems having the lowest relative risk should be updated first followed by one or more groups of higher relative risk computing systems. It is also possible for the relative risk to be equal across computing systems. Update service 101 may still stagger updates to those computing systems to ensure a defective update does not take down all systems at once, as already discussed above. While the update policy may define update groups for computing systems 102 to stagger performance of the update, update service 101 still determines a best fit update time for computing system 102-1 within whichever group the policy places computing system 102-1. For instance, update service 101 may determine that computing system 102-1 has a low relative risk and is in the first update group. All computing systems in that first update group may be updated over the course of a day, with update service 101 selecting an update time for computing system 102-1 within that day, and then update service 101 may allow a few days to pass (e.g., to allow time to identify bugs) before starting updates to a second update group defined by the policy.
After update service 101 has determined the update time for computing system 102-1, update service 101 performs the update at computing systems 102-1 at the update time (204). To perform the update, update service 101 may transfer the program instructions comprising the update to computing system 102-1 (e.g., in a file) and instruct computing system 102-1 to implement the update. Alternatively, update service 101 may instruct computing system 102-1 to perform the update at the update time in examples where computing system 102-1 retrieves the update program instructions itself. Other manners of performing the update at computing system 102-1 may also be used at the time instructed by update service 101. Update service 101 similarly performs the update at other ones of computing systems 102 at update times determined for each respective one of those other systems. By coordinating the update times across all computing systems 102, the impact of the updated software being down during the update is minimized by update service 101.
In operation, SD-WAN edges 302 may be operated by an entity. A typical SD-WAN setup enables those of SD-WAN edges 302 located at branch sites of the entity to communicate with one or more of SD-WAN edges 302 located at a data center for the entity. SD-WAN edges 302 are also able to communicate with cloud services (not shown) over WAN 303. Each of SD-WAN edges 302 determines which of their respective wired/wireless links 313 should be used for any given traffic (e.g., one link may be down, overloaded, or otherwise unable to properly send data and the other link may be used instead). SD-WAN edges 302 may simply be gateways to WAN 303 for other computing systems at their respective sites or may execute processes thereon that communicate over WAN 303 (e.g., provide a service over WAN 303). Other examples may include more than just two redundant connections to WAN 303. Similarly, in some examples, SD-WAN edges 302 may include connections that do not travel through WAN 303. For instance, a Multi-Protocol Label Switching (MPLS) may connect one or more of the SD-WAN edges 302 located at a branch site to one of SD-WAN edges 302 located at the data center. Those of SD-WAN edges 302 at branch sites may then also select the MPLS link to data being sent to the data center instead of sending the data over one of wired/wireless links 313. Like out-of-band link 331 and out-of-band links 332, the MPLS links may be carried by networking hardware also used by WAN 303.
In this example, SD-WAN edge 302-1 executing containerized applications 411-414, which themselves work independently, or together, to provide one or more services over WAN 303. SD-WAN edge 302-1 may be dedicated to supporting the activities of containerized applications 411-414 or may also act as a gateway to WAN 303 for other computing systems co-located with SD-WAN edge 302-1. Containerized applications are application executing within containers provided by container engine 403. Container engine 403 is software that virtualizes operating system resources between the containerized applications executing thereon, which are containerized applications 411-414 in this example. While this example focuses on containerized applications, it should be understood that other types of applications may also be updated using similar mechanisms. For example, a computing system may execute applications in virtual machines or natively on operating system 402.
Update service 301 also receives attributes from SD-WAN edges 302 (502). At least a portion of the attributes that will be used by update service 301 to determine an update time may already be accessible to update service 301 (e.g., stored by update service 301 or otherwise accessible through control plane 321). The attributes may be received at any time even though they are shown as being received after the updates are identified. In some examples, all attributes needed by update service 301 to determine update times may already be available to update service 301 and retrieval from SD-WAN edges 302 is unnecessary. Similarly, in some examples, update service 301 may receive updates to attributes that may change over time to ensure the attributes being considered for update times are up to date. For example, the WAN traffic bandwidth being handled by SD-WAN edge 302-1 over time may change (e.g., while SD-WAN edge 302-1 may have been handling a large amount of bandwidth before an update may show that the amount has fallen off recently).
In operation 500, rather than converting the attributes directly into an update time, update service 301 generates risk scores for SD-WAN edges 302 from the attributes (503). Indicates the impact on the entity, or a division within the entity (e.g., business unit), should software affected by the update (i.e., the software being updated or other software that will also not function during the update), or the edge as a whole, be adversely affected by the update. As an example, if SD-WAN edge 302-1 provides a vital service on behalf of a business unit, then update service 301 may consider SD-WAN edge 302-1 a high-risk edge and provide an appropriate risk score for that level of risk. For instance, update service 301 may assign SD-WAN edge 302-1 a risk score of 8 out of 10, with 10 being the riskiest, due to SD-WAN edge 302-1's importance but also not going closer to 10 because the attributes of SD-WAN edges 302 may have indicated a backup to SD-WAN edge 302-1 in SD-WAN edges 302 that may be able to handle duties while SD-WAN edge 302-1 is down for updating.
Update service 301 organizes SD-WAN edges 302 into update group based on their respective risk scores (504). Update service 301 may apply an update policy to form the groups. The update policy may use any convention for organization but, preferably, the lowest risk of SD-WAN edge 302-1 are updated first (e.g., those closest to 1 on the above-mentioned scale from 1-10). If an issue occurs during the update or is caused by the update (e.g., a bug in the update), updating lower risk edges first reduces impact to the entity from that which may have occurred had higher risk edges been updated first. Similarly, the update policy may update a relatively small number of edges first to minimize the impact of any potential update-related issues.
In this example, prior to determining update for each of the edges in a first update group, update service 301 refreshes the attributes of those edges in the first update group (505). Refreshing the attributes again ensures the most up to date information is being used to determine an optimal update time for each edge. Since the first group may be updated close enough in time to when the attributes were received above, refreshing the attributes at this point may not be necessary in some cases. Update service 301 uses the attributes to determine when the first group of SD-WAN edges 302 should be updated (506). In some examples, update service 301 may determine that every edge in the group can be updated at the same time. In other examples, update service 301 may determine different update times for at least a portion of the edges in the group. For instance, while the attributes of one edge may enable it to be updated during the day, another edge may need to be updated overnight. In another example, update service 301 may recognize that applying the update to all edges in the first group at the same time will cause all instances of containerized application 412 in a particular geographic region to go down at the same time. Since that may not be desirable (e.g., in accordance with an update policy defining allowed/disallowed update conditions), update service 301 may stagger the updates to ensure a required number of containerized application 412 instances remain operational at any given time. Update service 301 performs the updates at each of the first group of SD-WAN edges 302 at the respective update times determined above (507). In some examples, all three of the updates may not be performed at exactly the same time due to certain updating needing to be performed sequentially. For instance, it may not be possible for containerized applications to be updated at the same time as container engine 403. Update service 301 may then update containerized application 412 and containerized application 413 just before or just after container engine 403 to minimize the instances in which those applications go down.
After performing the updates, update service 301 determines whether all groups of SD-WAN edges 302 have been updated (508). If yes, update service 301 ends operation 500 and waits until another update is identified to restart the process. In this case, since update service 301 has only updated the first group, update service 301 repeats step 505 on the next group of edges to refresh the attributes of the edges in the next group prior to determining update times for edges in that group. In some examples, update service 301 may wait a period of time before refreshing attributes due to a desired waiting period (e.g., indicated by an update policy) for determining whether issues occurred due to the update in the previously updated group. If an issue did occur, update service 301 may be notified to hold off on performing further updates until a solution is found. In those examples, update service 301 may proceed with determining update times for software not affected (e.g., if the issue was caused in containerized application 412's update, update service 301 may continue its update process for the updates to container engine 403 and containerized application 413). Waiting to refresh the attributes means that, when the time comes to determine update times for the next group, the attributes on which those update times are based will be up to date. After refreshing the attributes, update service 301 determines update times for the next group (506) and updates the edges in the group at those times (507). Update service 301 again determines whether any groups remain to be updated and, if there are groups, update service 301 repeats steps 505-508 according.
Also, while there is only one overlap period in each of the update-time graphs above, multiple overlaps may exist. For instance, update timeframe 902 may be split into multiple segments rather than being one contiguous period. It should further be understood, that when determining the update timeframes for each update, update service 301 also considers the impact on other software not being updated. In the case of SD-WAN edge 302-1, containerized application 411 and containerized application 414 are not being updated. Update service 301 also factors in attributes of those applications, which will have to go down when container engine 403 is updated. While the above examples only discuss updates to container engine 403 and containerized applications thereon, other examples may involve fewer/additional or different layers, such as operating system 402, a container orchestration platform, firmware, or physical hardware 401. In any example, update service 301 will always be aware of the effect updating one layer will have on another.
Communication interface 1160 comprises components that communicate over communication links, such as network cards, ports, radio frequency (RF), processing circuitry and software, or some other communication devices. Communication interface 1160 may be configured to communicate over metallic, wireless, or optical links. Communication interface 1160 may be configured to use Time Division Multiplex (TDM), Internet Protocol (IP), Ethernet, optical networking, wireless protocols, communication signaling, or some other communication format—including combinations thereof. Communication interface 1160 may be configured to communicate with one or more web servers and other computing systems via one or more networks.
Processing system 1150 comprises microprocessor and other circuitry that retrieves and executes operating software from storage system 1145. Storage system 1145 may include volatile and nonvolatile, removable, and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. Storage system 1145 may be implemented as a single storage device but may also be implemented across multiple storage devices or sub-systems. Storage system 1145 may comprise additional elements, such as a controller to read operating software from the storage systems. Examples of storage media include random access memory, read only memory, magnetic disks, optical disks, and flash memory, as well as any combination or variation thereof, or any other type of storage media. In some implementations, the storage media may be a non-transitory storage media. In some instances, at least a portion of the storage media may be transitory. In no examples would storage media of storage system 1145, or any other storage medium herein, be considered a transitory form of signal transmission (often referred to as “signals per se”), such as a propagating electrical or electromagnetic signal or carrier wave.
Processing system 1150 is typically mounted on a circuit board that may also hold the storage system. The operating software of storage system 1145 comprises computer programs, firmware, or some other form of machine-readable program instructions. The operating software of storage system 1145 comprises update service 1130. The operating software on storage system 1145 may further include an operating system, utilities, drivers, network interfaces, applications, or some other type of software. When read and executed by processing system 1150 the operating software on storage system 1145 directs computing system 1100 to operate update services as described herein.
In at least one implementation, update service 1130 directs processing system 1150 to identify an update to be performed on computing systems connected to a Wide Area Network (WAN) and identify first attributes of a first computing system of the computing systems. Update service 1130 further directs processing system 1150 to determine an update time in which the update should be performed at the first computing system based on the first attributes relative to other attributes of other ones of the computing systems. Update service 1130 also directed processing system 1150 to perform the update at the first computing system at the update time.
The included descriptions and figures depict specific implementations to teach those skilled in the art how to make and use the best mode. For teaching inventive principles, some conventional aspects have been simplified or omitted. Those skilled in the art will appreciate variations from these implementations that fall within the scope of the invention. Those skilled in the art will also appreciate that the features described above can be combined in various ways to form multiple implementations. As a result, the invention is not limited to the specific implementations described above, but only by the claims and their equivalents.