Computing systems are currently in wide use. Some such computing systems are cloud-based computing systems or computing systems deployed in other remote server environments. Such computing systems may host applications or services that are accessed by a wide variety of different users. Some global cloud applications are composed of thousands of different components that each generate large volumes of network traffic.
In order to perform continuous traffic optimization control, a control system attempts to identify the contributors to the network traffic. However, identification of contributors to network traffic can be problematic. Some current systems attempt to use on-server traffic monitor systems, and other current systems attempt to use on-router traffic sampling systems.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.
A network traffic computing system obtains on-router traffic data, on-server traffic data and application log data. A data processing system extracts features from the data sources, splits the data based upon destination and source ports and performs component-level aggregation of the features. The aggregated data is used in monitoring and traffic control.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the background.
As discussed above, some computing systems use many components that each generate large volumes of network traffic. In order to preform network traffic optimization, the contributors to the network traffic are to be identified.
There are currently monitors that monitor network traffic for incidents and performance regressions and these types of monitors are based on availability or latency metrics which are generally insensitive to some types of traffic issues. For instance, because global cloud applications are often composed of many components developed and maintained by engineers from various teams, and because such components are sending a very large volume of traffic across data centers worldwide, any small defect or bug in even a single component may lead to large increases in internal traffic (traffic between components of the application). In addition, due to the large number of components in such applications, the shared bandwidth, which is shared among those components, can be easily consumed by low-priority traffic. These types of traffic issues can result in customers suffering long latency or even connection loss. Many current traffic monitors and performance analysis monitors are insensitive to these types of traffic issues.
Thus, unnecessary traffic may still be caused by hidden issues such as code bugs or configuration errors. Over time, these hidden issues may become extremely difficult to trace and may simply be accepted as necessary bandwidth requirements.
To accomplish network traffic optimization traffic measurements should be known at the component level. A component is, as one example, a set of service functionality that is maintained as a single unit, such as by a single engineering team. On-router monitor systems cannot identify components in the application layer of a computing system. Other monitor systems are on-server network analysis tools. These tools also cannot identify component-level traffic. Instead, the on-server types of tools can only observe the process that is sending the traffic, but multiple components can share a single process. The on-server monitoring tools cannot distinguish between traffic emitted from different components of a single service.
Further, to perform quick traffic control, the measurement data should be controlled to maintain low latency in querying the results. For instance, in order to draw an effective conclusion with respect to the measurement results, the results are often queried over relatively large time intervals. However, global cloud applications are constantly generating vast amounts of traffic data. For instance, some on-router monitors may measure more than 10 terabytes of traffic per day. On-server monitors and application logs may generate data on the order of petabytes per day. Running queries on these types of data sources may introduce an unacceptable latency. Further generation of measurement data in a production environment can consume large amounts of computing system resources so that the global application may not meet customer expectations.
The present description thus proceeds with respect to a data generated by on-server monitors, on-router monitors, and application logs to distinguish between traffic contributed by different components and to obtain component-level measurement results. The data size is reduced by performing feature extraction, data splitting and data aggregation so that results are relatively small (such as on the order of gigabytes per day). In addition, in order to reduce resource consumption in generating the measurement data, the data generation may be restricted to obtaining data from the top k ports, in terms of traffic volume.
In one example deployment, the present system was deployed on a global application architecture and generated a negligible impact on production servers (less than 1% increase in CPU and disk I/O operation). Further, the data processing cost utilized less than 0.01% of the processing cores utilized by the application. For user queries in which the traffic measurement data generated over a 60 day period was queried, the response was returned within 30 seconds. This is just one example of the results of an actual deployment.
In the example shown in
In the example shown in
More specifically with respect to the architecture 100 shown in
Location C includes a set of user identification servers which can serve requests for user and mailbox metadata. The servers use a plurality of different processes 144, each of which may have a plurality of different components 146 and application logs 148. A set of data stores 150 can also be deployed at location C, along with an on-server monitor system 152, and other user identification server functionality 154.
Location D includes both a set of frontend servers 156 and a set of backend servers 158, as well as one or more data stores 160. Frontend servers 156 can include a set of processes 162, which may have multiple components 164 and application logs 166. Frontend servers 156 may also have on-server monitor system 168 and other functionality 170. Backend servers 158 may include a set of processes 172, each of which may have a plurality of components 174, as well as application logs 176. Backend servers 158 may include on-server monitor system 178 and other items 180. In the example shown in
Component-based network traffic computing system 198 can obtain network traffic data from a variety of different data sources (such as the on-server monitor systems 140, 152, 168, 178, and 192, the on-router flow monitor system 108, application logs 136, 148, 166, 176, and 188, as well as other data sources) and generate a result data store of traffic measurement results. Those results, or a representations of those results, may be provided as an output 200 to other computer systems. Component-based network traffic computing system 198 is described in greater detail below with respect to
In order to provide an example of how traffic is generated in the computing system architecture 100 shown in
In one example, a component is, a functionally independent unit of functionality in a service that is deployed as a process or as a part of a process and is owned (maintained by) a single engineering team. One example of a component includes REST. A component performs a function and may receive some input and/or produce an output.
The on-router flow monitor system 108 samples packets with a certain probability and aggregates them into flows. In the present example, a flow is a sequence of packets with the same internet protocol 5-tuples which include: source/destination IP address, source/destination port, and protocol. Each flow is loaded by the on-router flow monitor system 108 into a centralized data store which may be located in system 108 or elsewhere.
The on-server monitor systems 140, 152, 168, 178, and 192, monitor the traffic usage of all processes on a particular machine. The results are also uploaded to the centralized data store. Thus, the on-router flow monitor system 108 collects traffic data on routers 104 while the on-server monitor systems collect system network events and provide traffic statistics for processes on servers.
The application logs 136, 148, 166, 176, and 188 are generated within the services and may be shared by the different engineering teams that debug the corresponding components. For each request that is logged by the application logs, the application logs store a record that includes the timestamp of the request, the particular component that serves the request, the local server name, the remote server name, the latency, the request and response sizes, the remote port that is used for the request, among other things. The measurement capabilities of each of these three sources of information (the on-router flow monitor system 108, the on-server monitor systems, and the application logs, are summarized in Table 1 below.
The checkmark in Table 1 indicates that the corresponding monitor system or application log collects the information while the X's indicate that the system or application log does not collect the information. In Table 1, the timestamp, and IP address pair (source and destination) and a port pair (source/destination) identify a unique flow. A differentiated services code point (DSCP) tag is used by a bandwidth broker for traffic quality of service classification. Packets with different DSCP tags are classified to different priority traffic tiers. The process entry identifies the processes that are sending and receiving the traffic. The on-router flow monitor system 108 obtains the IP address, port, identifier DSCP, and traffic size, but cannot obtain the process and component information which are available only on the servers. The on-server monitor systems 140, 152, 168, 178, and 192 obtain the timestamp, IP address and port identifier as well as the process identifier and traffic size, but cannot obtain the DSCP tag. While the on-server monitor systems identify the processes, they cannot identify exact components when many components share the same process. The application logs 136, 148, 166, 176, and 188 obtain all of the information except the DSCP tags and the exact traffic size. The application logs 136, 148, 166, 176, and 188 can be used to obtain the request and response sizes of services, but not the sizes of the request headers.
System 198 is shown in
The data from data sources 224 may be uploaded by data upload system 203 in component-based network traffic computing system 198 intermittently, such as on an hourly basis, or otherwise. The data may be uploaded into a distributed data storage and processing system or to a local data store, or in other ways. The data is illustratively converted into a uniform format such as that shown in Table 1 above. Also, the different types of data may be uploaded at different intervals. For instance, since the management data 232 is relatively static, relative to the other data in data sources 224, it may be that management data 232 is only updated daily, or at a different interval. Network traffic component system 198 processes the sources of data in data sources 224 independently, and stores aggregation and other processing results in a result data store 236. Result data store 236 illustratively stores an identifier of the top K ports (the K ports having the most traffic), a set of source port tables 240, a set of destination port tables 242, process tables 244, component tables 246, local traffic tables 248, validation tables 250, and there may be other tables or information 252 as well. The schema corresponding to some of the tables in result data store 236 is shown below with respect to Table 2.
It can be seen in Table 2 that the source and destination port tables 240 and 242, respectively, are obtained from the on-router traffic monitor data 226. The schema for those tables includes Timestamp, ServerRole, RateRegion, Port, DSCP tag, and TrafficSize. The process tables are obtained from the on-server traffic monitor data 228 and include Timestamp, ServerRole, RateRegion, Port, Process, and TrafficSize. The component tables 246 are obtained from the application log data 230 and include Timestamp, ServerRole, RateRegion, Port, Process, Component, and TrafficSize. The metro tables 248 and validation tables 250 are discussed in greater detail below and are used with respect to data validation which increases the likelihood that data integrity is being maintained.
Result data store 236 is output to computer consumer systems 260 which consume the information in result data store 236. The consumer systems 260 can include monitor(s) 262, control system(s) 263, a web user interface system 264, and any of a wide variety of other consumer systems 266.
Data upload system 203 then loads the data from the data sources 224 so that the data is accessible by data validation system 204 and data processing system 206, as indicated by block 290 in the flow diagram of
Data processing system 206 then performs data processing on the data in order to reduce the data volume and generate result tables in the result data store 236. Performing data processing is indicated by block 292 in the flow diagram of
Data validation system 204 also performs data validation, as indicated by block 302. Because of the complexity of data system architecture 100, there is a relatively high possibility that data loss can occur. Any control performed on incorrect data may lead to unintended consequences. Therefore, data validation system 204 performs data validation. The data validation system also identifies the top K machine pairs in terms of traffic volume as indicated by block 304 and can perform other operations 306 as well. Data validation is described in greater detail below with respect to
The top K ports 238 are identified using aggregated on-router measurement data. The top K pairs are also returned to the on-server monitor systems so that the top K ports can be used as a filter to only monitor data from the top K ports. Returning the top K ports to filter the on-server data monitors is indicated by block 308. Filtering in this way reduces the amount of computing system resources that are required in order to generate data sources 224.
The result tables in result data store 236 are then provided to consumer systems 260 where it can be exposed for analysis and control, as indicated by block 310. In one example, the consumer systems 260 include a web UI system 264 which exposes a web user interface 312. The web user interface 312 exposes the information in result data store 236 to users, such as engineers. In another example, control system 263 can perform traffic optimization 314 based upon the information in result data store 236. The data can be used to perform traffic discovery, in order to identify the component-level contributions to the network traffic, as indicated by block 316. The data can be used to identify anomaly traffic bursts 318 and to validate network features, network configurations, and other controllable items on the network, as indicated by block 320. The data can be exposed for analysis and control in other ways as well, as indicated by block 322.
In one example, the web user interface 312 is a dashboard that provides engineers, other users, or automated control systems, a way to analyze the traffic. In one example, a user or an automated system can provide an input requesting component-level traffic analysis or metrics, as indicated by block 324. The request may specify a particular component C for which results are requested, as indicated by block 326. The request may be received through the web UI 264 in other ways as well, as indicated by block 328.
Web user interface system 264 can then generate analysis for the identified components, as indicated by block 330. In doing so, web user interface system 264 can execute a calculation algorithm as described below in Table 3, and as indicated by block 332 in the flow diagram of
In Table 3, the algorithm receives an input identifying a component (component C) and the output is a value “TrafficSize” which identifies the traffic volume over a specified period contributed by component C. The steps to take in identifying the traffic size depends on whether the component C shares a process with other components. In line 1 and 2 of the algorithm (and blocks 333 and 335 in
Is used to estimate the total request and response size of all components with remote port P. The ratio between component [C, P] to component [P], together with the port traffic are used (as identified by block 343 in
The results of the analysis can then be output in a wide variety of different ways, as indicated by block 334 in the flow diagram of
Also, in one example, monitors 262 monitor the source port table and destination port table in result data store 236 for overall traffic usage of the application deployed in the computing system architecture 100. Executing these monitors is indicated by block 338 in the flow diagram of
Data processing system 206 then obtains access to the other data sources 224, as indicated by 360 in the flow diagram of
Feature extraction component 218 then translates the IP addresses to locations and server roles using the management data, as indicated by block 366. Feature extraction component 218 uses the location pairs (source-destination pairs) to identify the cost of traffic flow, as indicated by block 368. Longer distance between the source and destination corresponds to a higher cost of the traffic. In one example, a cost feature (referred to herein as a RateRegion feature) replaces the location pair of a flow. The RateRegions may correspond to flows traveling over a geographical continent, across an ocean, locally within an organization, or over a different geographical distance. In one example, there are approximately ten RateRegions and translating the location pairs into the ten RateRegions greatly reduces the data size, and in some examples may reduce the data size by over 99%. Other features can be extracted as well, as indicted by block 370 in the flow diagram of
Data splitting component 220 then performs data splitting based on the source port and the destination port in the source-destination port pairs. Splitting the data in this way also reduces the size of the data by turning a product relationship among sources and destinations into a sum relationship. Also, the data splitting surfaces highly used ports by ranking the split data because the traffic of low usage ports will converge to smaller volume values after aggregation.
Data aggregation component 222 then performs data aggregation, as indicated by block 374. In one example, the data is aggregated based on source port and destination port, separately, as shown in
It has been observed that a relatively small number of ports dominate the total traffic usage in the network. Therefore, data aggregation component 222 also aggregates the low volume ports in the source port table and the destination port table. For each time slot, for example, all ports that contributed less than 1% of the total traffic to a particular record in the table can be aggregated and marked with a tag, to reduce overall data size (or log costs). The particular threshold (e.g., 1%) can be changed in order to change the overall data size (or log cost). Aggregating the low traffic ports is indicated by block 382 in the flow diagram of
After the data in the data sources 224 is processed by data processing system 206, data processing system 206 stores the data to result data store 236, as indicated by block 386 in the flow diagram of
Given a pair of machines that continuously send high levels of traffic to one another, an effective estimation of the on-router data is close to the on-server data. The present description does not use application logs in order to validate the traffic size, because application logs typically capture the content sizes of the request and responses without capturing the headers.
The on-router data recovery component 212 performs recovery of the on-router data using Equation 2 below.
In Equation 2 the packet size, packet number, and sampling rate are available in the on-router flow monitor data 226. The ethernet header length is added to the packet size in order to obtain the frame size for each packet.
After selecting a machine pair, data recovery component 212 uses Equation 2 to recover the on-router data for the pair, as indicated by block 402 in
It can thus be seen that the present description describes a system which extracts features from large data sources and performs feature extraction and data splitting to reduce the size of the data. The data is aggregated to obtain component-level traffic measurement values which can be output consumer systems for monitoring, control, etc. Data validation is performed by recovering on-router data from sampled on-router data and comparing it with on-server data for different sets of machines to ensure that data has not been lost.
It will be noted that the above discussion has described a variety of different systems, components and/or logic. It will be appreciated that such systems, components and/or logic can be comprised of hardware items (such as processors and associated memory, or other processing components, some of which are described below) that perform the functions associated with those systems, components and/or logic. In addition, the systems, components and/or logic can be comprised of software that is loaded into a memory and is subsequently executed by a processor or server, or other computing component, as described below. The systems, components and/or logic can also be comprised of different combinations of hardware, software, firmware, etc., some examples of which are described below. These are only some examples of different structures that can be used to form the systems, components and/or logic described above. Other structures can be used as well.
The present discussion has mentioned processors and servers. In one example, the processors and servers include computer processors with associated memory and timing circuitry, not separately shown. The processors and servers are functional parts of the systems or devices to which they belong and are activated by, and facilitate the functionality of the other components or items in those systems.
Also, a number of user interface (UI) displays have been discussed. The UI displays can take a wide variety of different forms and can have a wide variety of different user actuatable input mechanisms disposed thereon. For instance, the user actuatable input mechanisms can be text boxes, check boxes, icons, links, drop-down menus, search boxes, etc. The mechanisms can also be actuated in a wide variety of different ways. For instance, the mechanisms can be actuated using a point and click device (such as a track ball or mouse). The mechanisms can be actuated using hardware buttons, switches, a joystick or keyboard, thumb switches or thumb pads, etc. The mechanisms can also be actuated using a virtual keyboard or other virtual actuators. In addition, where the screen on which the mechanisms are displayed is a touch sensitive screen, the mechanisms can be actuated using touch gestures. Also, where the device that displays them has speech recognition components, the mechanisms can be actuated using speech commands.
A number of data stores have also been discussed. It will be noted the data stores can each be broken into multiple data stores. All can be local to the systems accessing them, all can be remote, or some can be local while others are remote. All of these configurations are contemplated herein.
Also, the figures show a number of blocks with functionality ascribed to each block. It will be noted that fewer blocks can be used so the functionality is performed by fewer components. Also, more blocks can be used with the functionality distributed among more components.
The description is intended to include both public cloud computing and private cloud computing. Cloud computing (both public and private) provides substantially seamless pooling of resources, as well as a reduced need to manage and configure underlying hardware infrastructure.
A public cloud is managed by a vendor and typically supports multiple consumers using the same infrastructure. Also, a public cloud, as opposed to a private cloud, can free up the end users from managing the hardware. A private cloud may be managed by the organization itself and the infrastructure is typically not shared with other organizations. The organization still maintains the hardware to some extent, such as installations and repairs, etc.
In the example shown in
It is also contemplated that some elements of computing system architecture 102 can be disposed in cloud 502 while others are not. By way of example, data store can be disposed outside of cloud 502, and accessed through cloud 502. In another example, can be outside of cloud 502. Regardless of where the items are located, the items can be accessed directly by device 504, through a network (either a wide area network or a local area network), the items can be hosted at a remote site by a service, or the items can be provided as a service through a cloud or accessed by a connection service that resides in the cloud. All of these architectures are contemplated herein.
It will also be noted that architecture 100, or portions of it, can be disposed on a wide variety of different devices. Some of those devices include servers, desktop computers, laptop computers, tablet computers, or other mobile devices, such as palm top computers, cell phones, smart phones, multimedia players, personal digital assistants, etc.
Computer 810 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 810 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media is different from, and does not include, a modulated data signal or carrier wave. It includes hardware storage media including both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 810. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 830 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 831 and random access memory (RAM) 832. A basic input/output system 833 (BIOS), containing the basic routines that help to transfer information between elements within computer 810, such as during start-up, is typically stored in ROM 831. RAM 832 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 820. By way of example, and not limitation,
The computer 810 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
Alternatively, or in addition, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Program-specific Integrated Circuits (ASICs), Program-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc.
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 810 through input devices such as a keyboard 862, a microphone 863, and a pointing device 861, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 820 through a user input interface 860 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A visual display 891 or other type of display device is also connected to the system bus 821 via an interface, such as a video interface 890. In addition to the monitor, computers may also include other peripheral output devices such as speakers 897 and printer 896, which may be connected through an output peripheral interface 895.
The computer 810 is operated in a networked environment using logical connections to one or more remote computers, such as a remote computer 880. The remote computer 880 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 810. The logical connections depicted in
When used in a LAN networking environment, the computer 810 is connected to the LAN 871 through a network interface or adapter 870. When used in a WAN networking environment, the computer 810 typically includes a modem 872 or other means for establishing communications over the WAN 873, such as the Internet. The modem 872. which may be internal or external, may be connected to the system bus 821 via the user input interface 860, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 810, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
It should also be noted that the different examples described herein can be combined in different ways. That is, parts of one or more examples can be combined with parts of one or more other examples. All of this is contemplated herein.
Example 1 is a computer system, comprising:
Example 2 is the computer system of any or all previous examples wherein generating a control signal comprises:
Example 3 is the computer system of any or all previous examples wherein generating a control signal comprises:
Example 4 is the computer system of any or all previous examples wherein generating a control signal comprises:
Example 5 is the computer system of any or all previous examples wherein the network traffic includes a plurality of packets, each packet being sent from a source port, of a plurality of source ports, to a destination port, of a plurality of destination ports, in a network.
Example 6 is the computer system of any or all previous examples and further comprising:
Example 7 is the computer system of any or all previous examples wherein aggregating comprises:
Example 8 is the computer system of any or all previous examples and further comprising:
Example 9 is the computer system of any or all previous examples and further comprising:
Example 10 is the computer system of any or all previous examples and further comprising:
Example 11 is the computer system of any or all previous examples wherein the on-router flow monitor data comprises flow monitor data generated from network traffic data samples and wherein performing data validation comprises:
Example 12 is the computer system of any or all previous examples wherein recovering additional router data comprises:
Example 13 is a computer system, comprising:
Example 14 is the computer system of any or all previous examples and further comprising:
Example 15 is a computer implemented method, comprising:
Example 16 is the computer implemented method of any or all previous examples wherein generating a control signal comprises:
Example 17 is the computer implemented method of any or all previous examples wherein the network traffic includes a plurality of packets, each packet being sent from a source port, of a plurality of source ports, to a destination port, of a plurality of destination ports, in a network.
Example 18 is the computer implemented method of any or all previous examples and further comprising:
Example 19 is the computer implemented method of any or all previous examples wherein aggregating comprises:
Example 20 is the computer implemented method of any or all previous examples and further comprising:
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2022/102932 | 6/30/2022 | WO |