Benefit is claimed under 35 U.S.C. 119(a)-(d) to Foreign Application Serial No. 202341045741 filed in India entitled “CERTIFICATE MANAGEMENT IN REMOTE COLLECTORS WITH HIGH AVAILABILITY”, on Jul. 7, 2023, by VMware, Inc., which is herein incorporated in its entirety by reference for all purposes.
The present disclosure relates to computing environments, and more particularly to methods, techniques, and systems for managing certificates in remote collectors with high availability in a computing environment.
In application/operating system (OS) monitoring environments, a management node that runs a monitoring tool (i.e., a monitoring application) may communicate with multiple endpoints (e.g., virtual computing instances (VCIs)) to monitor the endpoints via a remote collector (e.g., a cloud proxy). For example, an endpoint may be implemented in a physical computing environment, a virtual computing environment, or a cloud computing environment. Further, the endpoints may execute different applications via virtual machines (VMs), physical host computing systems, containers, and the like. In such environments, the endpoints may send performance data/metrics (e.g., application metrics, operating system metrics, and the like) from underlying operating system and/or services to the remote collector. Further, the remote collector may provide the performance metrics to the monitoring tool for storage and performance analysis (e.g., to detect and diagnose issues).
The drawings described herein are for illustrative purposes and are not intended to limit the scope of the present subject matter in any way.
Examples described herein may provide an enhanced computer-based and/or network-based method, technique, and system to manage certificates in remote collectors with high availability in a computing environment. The paragraphs [0010] to [0019] present an overview of the computing environment, existing methods to manage certificates in remote collectors, and drawbacks associated with the existing methods.
The computing environment may be a virtual computing environment (e.g., a cloud computing environment, a virtualized environment, and the like). The virtual computing environment may be a pool or collection of cloud infrastructure resources designed for enterprise needs. The resources may be a processor (e.g., a central processing unit (CPU)), memory (e.g., random-access memory (RAM)), storage (e.g., disk space), and networking (e.g., bandwidth). Further, the virtual computing environment may be a virtual representation of the physical data center, complete with servers, storage clusters, and networking components, all of which may reside in virtual space being hosted by one or more physical data centers. The virtual computing environment may include multiple physical computers (e.g., servers) executing different computing-instances or workloads (e.g., virtual machines, containers, and the like). The workloads may execute different types of applications or software products. Thus, the computing environment may include multiple endpoints such as physical host computing systems, virtual machines, software defined data centers (SDDCs), containers, and/or the like.
Further, performance monitoring of the endpoints has become increasingly important because performance monitoring may aid in troubleshooting (e.g., to rectify abnormalities or shortcomings, if any) the endpoints, provide better health of data centers, analyse the cost, capacity, and/or the like. An example performance monitoring tool or application or platform may be VMware® vRealize Operations (vROps), VMware Wavefront™, Grafana, and the like. Such performance monitoring tools may be used to monitor a datacentre on a private, public, and/or hybrid cloud.
In some examples, the endpoints may include monitoring agents (e.g., Telegraf™, Collectd, Micrometer, and the like) to collect the performance metrics from the respective endpoints and provide, via a network, the collected performance metrics to a remote collector (e.g., a Cloud Proxy (CP)). For example, a monitoring agent such as Telegraf™ agent running in an endpoint may collect metrics from the endpoint and publish them to a metrics receiver. In this example, an Apache HTTPD server serves as the metrics receiver in the CP. The Apache HTTPD server running in the CP may listen on a specific location directive on port 443 to receive the metrics from the Telegraf™ agent.
Further, the remote collector may receive the performance metrics from the monitoring agents and transmit the performance metrics to a monitoring tool or a monitoring application for metric analysis. A remote collector may refer to a service/program that is installed in an additional cluster node (e.g., a virtual machine). The remote collector may allow the monitoring application (e.g., vROps Manager) to gather objects into the remote collector's inventory for monitoring purposes. The remote collector collects the data from the endpoints and then forward the data to an application monitoring server that executes the monitoring application. For example, remote collectors may be deployed at remote location sites while the monitoring tool may be deployed at a primary location. In an example, vROps is a multi-node application that can monitor geographically distributed datacentres. In such a distributed environment, remote collectors are installed at each geo location to monitor and control endpoints at respective datacentres. These remote collectors act as communication medium between master node (i.e., the monitoring application) and the datacentre. Furthermore, the monitoring application may receive the performance metrics, analyse the received performance metrics, and display the analysis in a form of dashboards, for instance. The displayed analysis may facilitate in visualizing the performance metrics and diagnose a root cause of issues, if any.
In such examples, the monitoring application (e.g., vROps) may use the remote collector (e.g., a cloud proxy) to support application and operating system monitoring. The cloud proxy may install the agents on the endpoints to monitor applications and an operating system running in the endpoints. For example, the agents installed on the endpoints may include a monitoring agent (e.g., Telegraf™), a supporting agent (e.g., UCP-minion), and a configuration agent (e.g., salt-minion). In an example software-as-a-service (SaaS) platform, the cloud proxy includes a data plane provided by an Apache HTTPD web server via hypertext transfer protocol secure (HTTPS) protocol and a control plane provided via Salt. In such an example SaaS platform, each endpoint may host the monitoring agent (e.g., Telegraf Agent) for posting application and operating system metrics to the remote collector, the supporting agent for posting service discovery and health metrics to the remote collector, and the configuration agent for receiving control actions/commands from the remote collector. Further, the Telegraf agent and the UCP minion of the data plane may publish metrics to the Apache HTTPD web server running in the cloud proxy. Furthermore, the Salt minion of the control plane may communicate with the Salt master running in the remote collector. Further, control commands such as updating the agents, starting/stopping the agents, and the like may be performed via the Salt minions upon the request of the Salt master.
The remote collector may use Apache httpd service for data plane. The Apache httpd service may use certificate-based authentication for metrics being posted at the cloud proxy. In this example, as part of agent installation at the endpoint, client certificates or client authentication certificates (e.g., OpenSSL certificate) are placed at the endpoint. Client authentication for metrics being posted from the endpoints to the remote collector is being done by using remote collector's Certificate Authority (CA) certificate and the client certificates placed at the endpoints during agent install operation.
Further, the remote collector (e.g., the cloud proxy) may support high availability for application monitoring by deploying at least two remote collectors and linking them with a collector group. The collector group may be a virtual entity that allows the remote collectors to be grouped together. For example, cloud proxies may provide high availability within the cloud environment, in which two or more cloud proxies are grouped to form the collector group. The cloud proxy collector group may ensure that there is no single point of failure in the cloud environment. If one of the cloud proxies experiences a network interruption or becomes unavailable, the other cloud proxy from the collector group takes charge and ensures that there is no downtime. In the example of cloud proxy collector group, a “KeepaliveD” service may be utilized at the remote collector to support high availability within the collector group. The “KeepaliveD” service is a framework for both load balancing and high availability that implements a virtual router redundancy protocol (VRRP). The VRRP creates a virtual IP (or VIP, or floating IP) that acts as a gateway to route traffic from the monitored endpoints.
For example, when the cloud proxy belongs to the collector group, the KeepaliveD service acts as a receiver of the metrics from the endpoints being monitored by the cloud proxy. When the cloud proxy is not a member of the collector group, the Apache httpd service may be utilized to receive the metrics from the cloud proxy.
In the existing architecture, the remote collector can move-in or move-out of the collector group. In such a scenario, depending on whether the remote collector is a member of the collector group or not, the agents (e.g., the monitoring agent, the supporting agent, and the like) installed in the endpoints may have to be modified to post metrics to either the Apache httpd service or the KeepaliveD service.
The CA certificate is generated during a first bootstrap of the remote collector and the generated CA certificate may be used by httpd-south configuration to validate client requests to accept performance metric data. During the agent (e.g., the monitoring agent and the supporting agent) installation in the endpoint, the client certificate for the endpoint may be generated using the remote collector's CA certificate and is pushed to the endpoint to be used by the monitoring agent and the supporting agent while posting the performance metrics to the remote collector. If the collector group is used to accomplish high availability of application monitoring, in case of failover, the endpoint may fail to communicate to the next available remote collector in the collector group because the CA certificate differs from one remote collector to another remote collector, and the client certificate validation may not work with different remote collectors. Therefore, the cloud proxies in the collector group may have to share the same server (e.g., the cloud proxy) CA certificate to support a high-availability or load balancing operation.
Examples described herein may provide a management node to manage certificates in remote collectors with high availability. In an example, the management node may generate a collector group that supports a high availability operation. The collector group may include a first remote collector to receive metrics from a first endpoint and a second remote collector to receive metrics from a second endpoint. Further, the management node may generate a CA certificate for the collector group. Furthermore, the management node may assign the CA certificate to the first remote collector and the second remote collector. Upon assigning the CA certificate, the management node may enable the collector group to validate a request to accept metrics from the first endpoint and the second endpoint based on the CA certificate.
Thus, examples described herein may provide an approach to manage the certificates of the remote collectors in the collector group for application monitoring in case of high availability, load-balance, and the like. Further, this approach can withstand failure scenarios as the data may now be able to flow to another remote collector in the collector group without any intervention. Also, examples described herein may address the security concern for the remote collectors as the CA certificate is generated for each collector group.
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present techniques. However, the example apparatuses, devices, and systems, may be practiced without these specific details. Reference in the specification to “an example” or similar language means that a particular feature, structure, or characteristic described may be included in at least that one example but may not be in other examples.
Referring now to the figures,
For example, system 100 may be a data center that includes multiple endpoints 118A and 118B. In an example, an endpoint may include, but not limited to, a virtual machine, a physical host computing system, a container, a software defined data center (SDDC), or any other computing instance that executes different applications. The endpoint can be deployed either on an on-premises platform or an off-premises platform (e.g., a cloud managed SDDC). An SDDC may refer to a data center where infrastructure is virtualized through abstraction, resource pooling, and automation to deliver Infrastructure-as-a-service (IAAS). Further, the SDDC may include various components such as a host computing system, a virtual machine, a container, or any combinations thereof. An example host computing system may be a physical computer. The physical computer may be a hardware-based device (e.g., a personal computer, a laptop, or the like) including an operating system (OS). The virtual machine may operate with its own guest operating system on the physical computer using resources of the physical computer virtualized by virtualization software (e.g., a hypervisor, a virtual machine monitor, and the like). The container may be a data computer node that runs on top of the host's operating system without the need for the hypervisor or separate operating system.
Further, first endpoint 118A may include an application monitoring agent 134 to monitor applications, services, and/or programs running in endpoint 118A. In an example, application monitoring agent 134 may be installed in first endpoint 118A to fetch the metrics from various components of first endpoint 118A. For example, application monitoring agent 134 may real-time monitor first endpoint 118A to collect the metrics (e.g., telemetry data) associated with an application or an operating system running in first endpoint 118A. Example application monitoring agent 134 may be Telegraf agent, Collectd agent, or the like. Example metrics may include performance metric values associated with at least one of central processing unit (CPU), memory, storage, graphics, network traffic, applications, or the like.
Furthermore, first endpoint 118A may include a supporting agent 132 (e.g., a UCP-minion) and a configuration agent 130 (e.g., a salt-minion). For example, supporting agent 132 may obtain service discovery metrics including a list of services running in first endpoint 118A, health metrics of application monitoring agent 134, or a combination thereof. Further, configuration agent 130 may receive control commands from a configuration master 124 of remote collector 120. For example, configuration master 124 may run as part of a docker container on second endpoint 118B that executes remote collector 120. Thus, remote collector 120 may perform the control commands such as updating the agents, starting/stopping the agents, and the like on first endpoint 118A via configuration agent 130.
As shown in
In an example, remote collector 120 may include a validation unit 126 to establish a communication between first endpoint 118A and remote collector 120 based on client certificate 136. Upon establishing the communication, validation unit 126 may enable service 128 to receive the performance metrics from first endpoint 118A.
In some examples, first endpoint 118A, second endpoint 118B, and management node 102 may be communicatively connected via a network. An example network can be a managed Internet protocol (IP) network administered by a service provider. For example, the network may be implemented using wireless protocols and technologies, such as Wi-Fi, WiMAX, and the like. In other examples, the network can also be a packet-switched network such as a local area network, wide area network, metropolitan area network, Internet network, or other similar type of network environment. In yet other examples, the network may be a fixed wireless network, a wireless local area network (LAN), a wireless wide area network (WAN), a personal area network (PAN), a virtual private network (VPN), intranet or other suitable network system and includes equipment for receiving and transmitting signals.
In an example, system 100 may include a storage device 112 to store a second CA certificate 114 for collector group 116 that shares responsibility for a monitoring function to support high availability. In an example, the monitoring function that supports high availability for collector group 116 may be a high availability failover operation. In the high availability failover operation, if one of the remote collectors (e.g., cloud proxies) experiences a network interruption or becomes unavailable, the other remote collector from the collector group takes charge and ensures that there is no downtime. In another example, the monitoring function that supports high availability for collector group 116 may be a load balancing operation. The load balancing operation may refer to a process of distributing the incoming metrics from the endpoints (e.g., first endpoint 118A) among remote collectors in collector group 116.
During a process of creating collector group 116, second CA certificate 114 for collector group 116 is generated at management node (e.g., vROps cluster node) and the same is distributed to all remote collectors (e.g., cloud proxies) in collector group 116. In an example, second CA certificate 114 is stored in storage device 112 using an associated collector group identifier.
As shown in
Consider that remote collector 120 is not part of collector group 116. During operation, certificate management module 106 may receive a request to add second endpoint 118B to collector group 116. In response to receiving the request, certificate management module 106 may add second endpoint 118B to collector group 116. Further, certificate management module 106 may retrieve second CA certificate 114 from storage device 112, for instance, using the collector group identifier.
Further, certificate management module 106 may replace first CA certificate 122 of remote collector 120 with second CA certificate 114 of collector group 116. In an example, certificate management module 106 may restart service 128 running in remote collector 120 using second CA certificate 114 to enable collector group 116 to validate the request to accept the metrics from first endpoint 118A. Furthermore, certificate management module 106 may enable collector group 116 to validate a request to accept the metrics from first endpoint 118A based on second CA certificate 114. In an example, certificate management module 106 may enable one of remote collectors of collector group 116 that acts as a master to validate the request to accept the metrics from an agent (e.g., application monitoring agent 134 and supporting agent 132) running in first endpoint 118A based on second CA certificate 114.
In an example, when remote collector 120 is added to collector group 116, second CA certificate 114 of collector group 116 is used to generate a client certificate 152 for first endpoint 118A. In this example, client certificate 136 of first endpoint 118A (e.g., as shown in
In this example, validation unit 126 may establish a communication between first endpoint 118A and remote collector 120 based on client certificate 152 and second CA certificate 114. Upon establishing the communication, certificate management module 106 may enable service 128 to receive the performance metrics from first endpoint 118A.
In some examples, the functionalities described in
Further, the cloud computing environment illustrated in
At 208, user interface of monitoring application 202 may receive a request to add first cloud proxy 204A to collector group 206. At 210, monitoring application 202 may add first cloud proxy 204A to collector group 206. At 212, a CA certificate for collector group 206 may be generated and stored in a storage device associated with the monitoring application. Further at 214, a certificate of first cloud proxy 204A may be replaced by generated CA certificate. At 216, a service (e.g., a httpd-south service) may be restarted at first cloud proxy 204A using the CA certificate.
At 218, the user interface of monitoring application 202 may receive a request to add second cloud proxy 204B to collector group 206. At 220, monitoring application 202 may add second cloud proxy 204B to collector group 206. At 222, collector group 206 may retrieve the CA certificate from the storage device. Further at 224, a certificate of second cloud proxy 204B may be replaced by retrieved CA certificate. At 226, a service (e.g., a httpd-south service) may be restarted at second cloud proxy 204B using the CA certificate.
Thus, when a cloud proxy is added to the collector group and when the collector group is created for the first time, then the CA certificate may be generated at the collector group. Further, the generated CA certificate may be maintained or stored using a collector group identifier. Furthermore, the generated CA certificate may be updated in the cloud proxy. In another example, when a cloud proxy is added to the existing collector group, the stored CA certificate may be retrieved using the collector group identifier and updated in the cloud proxy. Furthermore, a service may be restarted with the new CA certificates in the cloud proxies to receive metrics from the corresponding endpoints.
Examples described herein may resolve communication issues in two different use cases such as a high availability in failover mode and a high availability in load-balancing and failover mode. In failover mode, the remote collector may offer resistance against the failure of the services in a data-plane path. In load-balancing and failover mode, the remote collector may offer resistance against the failure of the services in the data plane. Further, a load-balancing component such as ‘HAProxy’ may provide metric data load-balanced. The ‘HAProxy’ may distribute the data among all available HTTPD/controller services on the remote collectors in the collector group. This mode may also facilitate to horizontally scale the remote collector components.
At 302, a collector group that shares responsibility for a monitoring function to support high availability may be generated. In an example, the monitoring function that supports high availability for the collector group is a high availability failover operation, a load balancing operation, or a combination thereof.
At 304, a first certificate authority (CA) certificate may be generated for the collector group. At 306, a first request to add a first remote collector to the collector group may be received. In an example, the first remote collector may validate a request to accept metrics from a first endpoint based on a second CA certificate.
At 308, in response to receiving the first request, the processes in blocks 310, 312, 314, and 316 may be executed. At 310, the first remote collector may be added to the collector group. Further at 312, the first CA certificate associated with the collector group may be retrieved.
At 314, the second CA certificate of the first remote collector may be replaced with the retrieved first CA certificate. At 316, the collector group may be enabled to validate the request to accept the metrics from the first endpoint based on the first CA certificate. In an example, enabling the collector group to validate the request to accept the metrics from the first endpoint may include restarting a service running in the remote collector using the first CA certificate to enable the collector group to validate the request to accept metrics from the first endpoint.
In an example, a client certificate for the first endpoint may be generated using the first CA certificate of the collector group. The client certificate may be used by the first endpoint to post the metrics to the collector group.
Further, example method 300 may include receiving a second request to add a second remote collector to the collector group. The second remote collector may validate a request to accept metrics from a second endpoint based on a third CA certificate.
In response to receiving the second request, the second remote collector may be added to the collector group. Further, the first CA certificate associated with the collector group may be retrieved. Furthermore, the third CA certificate of the second remote collector may be replaced with the retrieved first CA certificate. Further, the collector group may be enabled to validate a request to accept the metrics from the second endpoint based on the retrieved first CA certificate.
In an example, enabling the collector group to validate the request to accept the metrics from the first endpoint and the second endpoint may include enabling one of the first remote collector and the second remote collector of the collector group that acts as a master to validate the request to accept the metrics from the first endpoint and the second endpoint based on the first CA certificate.
In the examples described herein, during the process of a collector group creation, the CA certificate for each collector group may be generated and the generated CA certificate is distributed to the remote collectors in the collector group. Further, a service (e.g., a httpd-south service) at the remote collector may be restarted using the common CA certificate. Also, the certificates may be maintained at the collector group so that they can be retrieved for future purposes.
Further at the endpoint, the application monitoring agent and the supporting agent may use the certificate for communicating with the remote collector. With this implementation, when the collector group is used for application monitoring in case of remote collector failover, the endpoint using the same certificate may be able to connect and post the metrics to the next available remote collector in the collector group.
Computer-readable storage medium 404 may store instructions 406, 408, 410, and 412. Instructions 406 may be executed by processor 402 to generate a collector group that supports a high availability operation. The collector group may include a first remote collector that receives metrics from a first endpoint and a second remote collector that receives metrics from a second endpoint. In an example, the high availability operation may include a load balancing operation in which the metrics from the first endpoint and the second endpoint are distributed among the first remote collector and the second remote collector for load balancing. In another example, the high availability operation may include a failover operation in which, during a failover of the master remote collector, the standby remote collector may be enabled to validate the request to accept the metrics from the first endpoint and the second endpoint using the CA certificate.
Instructions 408 may be executed by processor 402 to generate a certificate authority (CA) certificate for the collector group. Instructions 410 may be executed by processor 402 to assign the CA certificate to the first remote collector and the second remote collector.
Instructions 412 may be executed by processor 402 to enable the collector group to validate a request to accept metrics from the first endpoint and the second endpoint based on the CA certificate. In an example, instructions 412 to enable the collector group to validate the request may include instructions to restart a first service running in the first remote collector using the CA certificate to enable the collector group to validate the request to accept the metrics from the first endpoint and restart a second service running in the second remote collector using the CA certificate to enable the collector group to validate the request to accept the metrics from the second endpoint.
In an example, instructions 412 to enable the collector group to validate the request may include instructions to enable a master remote collector of the collector group to validate the request to accept the metrics from the first endpoint and the second endpoint based on the CA certificate. For example, one of the first remote collector and the second remote collector may act as the master remote collector while a remaining one of the first remote collector and the second remote collector is to act as a standby remote collector.
In an example, instructions 412 to enable the collector group to validate the request comprise instructions to generate a first client certificate and a second client certificate for the first endpoint and the second endpoint, respectively, using the CA certificate of the collector group. Further, the first client certificate and the second client certificate may be assigned to the first endpoint and the second endpoint, respectively. In response to receiving a request to accept the metrics from the first endpoint, the first endpoint may be validated based on the first client certificate and the CA certificate. Further in response to receiving a request to accept the metrics from the second endpoint, the second endpoint may be validated based on the second client certificate and the CA certificate. Furthermore, the collector group may be enabled to accept the metrics from the first endpoint and the second endpoint based on validating the first endpoint and the second endpoint.
The above-described examples are for the purpose of illustration. Although the above examples have been described in conjunction with example implementations thereof, numerous modifications may be possible without materially departing from the teachings of the subject matter described herein. Other substitutions, modifications, and changes may be made without departing from the spirit of the subject matter. Also, the features disclosed in this specification (including any accompanying claims, abstract, and drawings), and any method or process so disclosed, may be combined in any combination, except combinations where some of such features are mutually exclusive.
The terms “include,” “have,” and variations thereof, as used herein, have the same meaning as the term “comprise” or appropriate variation thereof. Furthermore, the term “based on”, as used herein, means “based at least in part on.” Thus, a feature that is described as based on some stimulus can be based on the stimulus or a combination of stimuli including the stimulus. In addition, the terms “first” and “second” are used to identify individual elements and may not meant to designate an order or number of those elements.
The present description has been shown and described with reference to the foregoing examples. It is understood, however, that other forms, details, and examples can be made without departing from the spirit and scope of the present subject matter that is defined in the following claims.
Number | Date | Country | Kind |
---|---|---|---|
202341045741 | Jul 2023 | IN | national |