The disclosure relates to a computer-implemented method for analysing one or more network attributes, a computer-implemented method for training a machine learning model to analyse one or more network attributes, and a system configured to operate in accordance with those methods.
Many existing techniques for network analysis have limitations as they mostly require extensive manual input in order to be performed. For instance, when an event (e.g. a failure) occurs in a network, the event will typically be recorded and an alarm or notification may be sent to a user of the network (e.g. a network operator). In general, the user of the network responds by creating a report on the event which has occurred in the network, such that the report may be stored (e.g. in a database). At some time later, the report is typically picked up by another user of the network, such as a network engineer or a software designer. The other user evaluates the event and takes further action if necessary. For example, if the event is associated with a problem occurring in the network, the further action taken by the other user of the network will typically be an attempt to fix the problem that caused the event.
There is a heavy reliance on manual input in the existing techniques for network analysis and this can cause certain issues. For example, it can result in some procedures being repeated unnecessarily, which can in turn lead to users spending excessive time and effort performing network analysis, and it can also provide unreliable and/or inaccurate results.
Another issue associated with the existing techniques for network analysis is that the procedures that need to be carried out manually may be complicated. As a result, the existing techniques for network analysis can be inefficient. For example, it can be difficult to manually resolve a network event, e.g. due to the nature of the network event and thus a network user (e.g. network engineer and/or software designer) may spend hours or even days manually completing the network analysis necessary to understand (and potentially fix) the network event.
It is an object of the disclosure to obviate or eliminate at least some of the above-described disadvantages associated with existing techniques.
Therefore, according to an aspect of the disclosure, there is provided a first computer-implemented method for analysing one or more network attributes of a network. The method comprises analysing one or more first network attributes of the network using a first machine learning model to generate a first output. The first output comprises information about an event in the network. The method also comprises, if an estimated confidence level for the first output is less than a confidence level threshold, analysing the one or more first network attributes using a second machine learning model to generate a second output. The second output is indicative of one or more second network attributes of the network to analyse using the first machine learning model.
According to another aspect of the disclosure, there is provided a second computer-implemented method for training a machine learning model to analyse one or more network attributes of a network. The second method comprises training a first machine learning model to analyse one or more first network attributes of the network to generate a first output. The first output comprises information about an event in the network. The second method also comprises training a second machine learning model to analyse the one or more first network attributes to generate a second output if an estimated confidence level for the first output is less than a confidence level threshold. The second output is indicative of one or more second network attributes of the network to analyse using the first machine learning model.
According to another aspect of the disclosure, there is provided a first entity configured to operate in accordance with the first method described earlier. In some embodiments, the first entity may comprise processing circuitry configured to operate in accordance with the first method described earlier. In some embodiments, the first entity may comprise at least one memory for storing instructions which, when executed by the processing circuitry, cause the first entity to operate in accordance with the first method described earlier.
According to another aspect of the disclosure, there is provided a second entity configured to operate in accordance with the second method described earlier. In some embodiments, the second entity may comprise processing circuitry configured to operate in accordance with the second method described earlier. In some embodiments, the second entity may comprise at least one memory for storing instructions which, when executed by the processing circuitry, cause the second entity to operate in accordance with the second method described earlier.
According to another aspect of the disclosure, there is provided a system comprising any one or more of the first entity described earlier and the second entity described earlier.
According to another aspect of the disclosure, there is provided a computer program comprising instructions which, when executed by processing circuitry, cause the processing circuitry to perform the first method described earlier and/or the second method described earlier.
According to another aspect of the disclosure, there is provided a computer program product, embodied on a non-transitory machine-readable medium, comprising instructions which are executable by processing circuitry to cause the processing circuitry to perform the first method described earlier and/or the second method described earlier.
Therefore, there is provided an advantageous technique for analysing network attributes of a network. There is also provided an advantageous technique for training a machine learning model to analyse the network attributes of the network. The manner in which the machine learning model is trained and used enables a more accurate analysis of network attributes. This allows more reliable information about an event in the network to be provided. The technique can be applied to reduce or even eliminate the requirement for certain network analysis procedures to be carried out manually, which can further improve the accuracy of the analysis and also reduces the burden on users of the network. The technique is also more efficient.
For a better understanding of the techniques, and to show how they may be put into effect, reference will now be made, by way of example, to the accompanying drawings, in which:
Some of the embodiments contemplated herein will now be described more fully with reference to the accompanying drawings. Other embodiments, however, are contained within the scope of the subject-matter disclosed herein, the disclosed subject-matter should not be construed as limited to only the embodiments set forth herein; rather, these embodiments are provided by way of example to convey the scope of the subject-matter to those skilled in the art.
As mentioned earlier, there is described herein an advantageous technique for analysing one or more network attributes of a network. This technique can be performed by a first entity. There is also described herein an advantageous technique for training a machine learning model to analyse one or more network attributes of a network. This technique can be performed by a second entity. The first entity and the second entity described herein may communicate with each other, e.g. over a communication channel, to implement the techniques described herein. In some embodiments, the first entity and the second entity may communicate over the cloud. The techniques described herein can be implemented in the cloud according to some embodiments. The techniques described herein are computer-implemented. The techniques described herein involve the use of artificial intelligence or machine learning (AI/ML).
The network referred to herein can be any type of network. For example, the network referred to herein can be a telecommunications network. In some embodiments, the network referred to herein can be a mobile network, such as a fourth generation (4G) mobile network, a fifth generation (5G) mobile network, a sixth generation (6G) mobile network, or any other generation mobile network. In some embodiments, the network referred to herein can be a radio access network (RAN). In some embodiments, the network referred to herein can be a local network, such as a local area network (LAN). In some embodiments, the network referred to herein may be a content delivery network (CDN). In some embodiments, the network referred to herein may be a software defined network (SDN). In some embodiments, the network referred to herein can be a fog computing environment or an edge computing environment. In some embodiments, the network referred to herein can be a virtual network or an at least partially virtual network.
The resource and functional level 216 comprises a physical infrastructure. The physical infrastructure comprises, for example, a wireless and fixed access network (e.g. including one or more transport networks), a mobile edge cloud (MEC) comprising at least one data center, a core network, etc. The core network can, for example, be a core (or central) cloud. The core network can communicate with the MEC (or, more specifically, at least one data center of the MEC) via a wide area network, such as a wide area transport network.
As illustrated in
As illustrated in
The 5G network can comprise one or more devices 306, 318, such as one or more wireless devices, e.g. one or more user equipments (UEs) 318 and/or one or more utility or smart city devices 306. The 5G network can also comprise a RAN 316, an edge network 320, an MEC node 322, a core (or metro) network 324, an edge node 326, an Internet connection 328, and some premises (e.g. tenant premises) 330. The RAN 316 can comprise one or more small cell Wi-Fi nodes. The small cells can be cloud enabled. Thus, a small cell can be referred to as a cloud enabled small cell (CESC). From an E2E point of view, the 5G network can comprise one or more data centers, such as one or more data centers at a cloud enabled small cell (CESC), one or more data centers at the MEC node 322, and/or one or more data centers at an edge node 326. At the CESC, a light version of a data center may be provided since it is located next to the base station. At the MEC and/or the edge node, a main data center may be deployed.
The CESC manager 402 is responsible for managing the resources within the edge data center 408 through the VIM 416. The VIM 416 can control a network function virtualisation infrastructure (NFVI). The NFVI may include the computing, storage and networking resources of the edge data centres. The NFVI may create and control the CESC clusters. On top of the physical and virtual infrastructure, users 400, 404, 406, 418, 420, 422 may deploy their applications and/or provide services (e.g. to customers in the case of service providers).
In existing techniques, when the SW failure 500 occurs, the incident is recorded and an alarm or notification is sent to an operator. The operator then creates a trouble report (TR) 508 and stores the TR 508 in a database. Subsequently, a support engineer and/or a SW designer may retrieve the report, evaluate the issue and attempt to fix it. The techniques described herein can replace these manual processes to provide advantageous improvements.
Although some examples have been provided for the type of network referred to herein and the events that might occur in such a network, it will be understood that the network referred to herein can be any other type of network and the event referred to herein can be any other type of event.
In order to aid with understanding the techniques described herein and the associated technical advantages, an example of an existing technique will first be described.
The support flow illustrated in
The three levels of support will now be described in more detail with reference to the example where failed SW is responsible for the problem. The SW failure may happen as illustrated in
If the support engineer concludes that there is not enough information, as illustrated at block 608 of
As illustrated at block 614 of
On the other hand, as illustrated at block 616 of
In summary, in this three level support system for a network operations and management (O&M) model, there are two manual repeated procedures that are designed to obtain accurate information for a SW failure. Those two procedures may require a significant amount of time and effort. For example, if the problem is not obvious, then experienced engineers and/or designers may need to spend hours or even days to resolve the problem. It has proved to be challenging to shorten the time and effort spent collecting information relevant to a problem accurately and efficiently. However, the techniques described herein address the challenges associated with the existing techniques such as that illustrated in
As illustrated in
Briefly, the processing circuitry 12 of the first entity 10 is configured to analyse the one or more first network attributes of the network using a first machine learning model to generate a first output comprising information about an event in the network. The processing circuitry 12 of the first entity 10 is also configured to, if an estimated confidence level for the first output is less than a confidence level threshold, analyse the one or more first network attributes using a second machine learning model to generate a second output. The second output is indicative of one or more second network attributes of the network to analyse using the first machine learning model.
As illustrated in
The processing circuitry 12 of the first entity 10 can be connected to the memory 14 of the first entity 10. In some embodiments, the memory 14 of the first entity 10 may be for storing program code or instructions which, when executed by the processing circuitry 12 of the first entity 10, cause the first entity 10 to operate in the manner described herein in respect of the first entity 10. For example, in some embodiments, the memory 14 of the first entity 10 may be configured to store program code or instructions that can be executed by the processing circuitry 12 of the first entity 10 to cause the first entity 10 to operate in accordance with the method described herein in respect of the first entity 10.
Alternatively or in addition, the memory 14 of the first entity 10 can be configured to store any information, data, messages, requests, responses, indications, notifications, signals, or similar, that are described herein. The processing circuitry 12 of the first entity 10 may be configured to control the memory 14 of the first entity 10 to store information, data, messages, requests, responses, indications, notifications, signals, or similar, that are described herein.
In some embodiments, as illustrated in
Although the first entity 10 is illustrated in
With reference to
The first machine learning model referred to herein may also be referred to as a classification model. The second machine learning model referred to herein may also be referred to as a recommendation model, such as a recommendation of extra network measurements model (RENMM) according to some embodiments. The first machine learning model referred to herein and/or the second machine learning model referred to herein can be any type of machine learning model. Examples of a machine learning model that can be used for the first machine learning model referred to herein and/or the second machine learning model referred to herein include, but are not limited to, a neural network (e.g. a deep neural network), a decision tree, or any other type of machine learning model. In some embodiments, the first machine learning model referred to herein and/or the second machine learning model referred to herein can be a supervised or semi-supervised machine learning model.
Although not illustrated in
Although also not illustrated in
In some embodiments, the first method may comprise, if the estimated confidence level for the first output is equal to or greater than the confidence level threshold, generating a report on the first output. Alternatively or in addition, in some embodiments, the first method may comprise, if the estimated confidence level for the first output is less than the confidence level threshold, generating a report on the second output. The first entity 10 (e.g. the processing circuitry 12 of the first entity 10) can be configured to generate the report on the first output and/or the report on the second output according to some embodiments.
In some embodiments, if a report on the first output is generated, the first method may comprise storing the report on the first output, initiating transmission of a notification indicating that the report on the first output has been generated, and/or initiating transmission of the report on the first output. Alternatively or in addition, in some embodiments, if a report on the second output is generated, the first method may comprise storing the report on the second output, initiating transmission of a notification indicating that the report on the second output has been generated, and/or initiating transmission of the report on the second output. The first entity 10 (e.g. the processing circuitry 12 of the first entity 10) can be configured to store the report on the first output and/or the report on the second output (e.g. in a memory 14 of the first entity 10) according to some embodiments. The first entity 10 (e.g. the processing circuitry 12 of the first entity 10) can be configured to initiate transmission of (e.g. itself transmit or cause another entity to transmit) the notification indicating that the report on the first output has been generated and/or the notification indicating that the report on the second output has been generated according to some embodiments. The first entity 10 (e.g. the processing circuitry 12 of the first entity 10) can be configured to initiate transmission of (e.g. itself transmit or cause another entity to transmit) the report on the first output and/or the report on the second output according to some embodiments.
In some embodiments, the report on the first output may comprise information indicative of the one or more first network attributes. In some embodiments, the one or more first network attributes and the first output may be analysed using the second machine learning model to generate the second output. The first entity 10 (e.g. the processing circuitry 12 of the first entity 10) can be configured to perform this analysis according to some embodiments.
Herein, the information about the event in the network may be any type of information about the event. In some embodiments, for example, the information about the event in the network may comprise any one or more of information indicative of a time of the event in the network, information indicative of a level within the network at which the event occurs, and information indicative of a cause of the event in the network. In embodiments where the information about the event in the network comprises information indicative of a level within the network at which the event occurs, the first machine learning model referred to herein may also be referred to as a network event level classification model (NELCM) or a network failure level classification model (NFLCM) where the event is a failure in the network. The first machine learning model can be trained to classify at which level within the network the event occurs. The levels within the network can be referred to herein as categories of the first machine learning model. The first machine learning model can be trained to classify the event according to one or more other categories in addition to, or alternatively to, the level within the network the event occurs.
Herein, in some embodiments, the level within the network can be any one or more of a level (namely, a node level) at which one or more network nodes are deployed in the network, a level (namely, a platform level) at which one or more computing platforms are deployed in the network, a level (namely, a virtualization or containerization level) at which virtualization or containerization occurs in the network, and a level (namely, a service level) at which one or more services are hosted or executed in the network. In some embodiments, the one or more services referred to herein may comprise one or more applications and/or one or more functions. Thus, in some embodiments, a four level classification can be provided. However, it will be understood that any other number of classifications levels can be used. For example, in some embodiments, additional classification levels may be defined depending on the evolution of network platform architecture.
In some embodiments, the information about the event in the network referred to herein may comprise, for one or more levels within the network, a percentage value indicative of a likelihood that the event in the network occurs at that level within the network. In some of these embodiments, the highest percentage value can be the information indicative of the level within the network at which the event occurs. In some embodiments, the first output generated by analysing the one or more first network attributes using the first machine learning model may be a probability distribution over a plurality of levels (e.g. any two or more of the node level, platform level, virtualization/containerization level, and service level) within the network. In some embodiments, the probability distribution can be expressed in percentage values, e.g. between 1% and 100%. Herein, a level may also be referred to as a class and, similarly, the plurality of levels can be referred to as a plurality of classes.
In some embodiments, the event in the network referred to herein may be a failure in the network, such as a software and/or hardware failure in the network.
In some embodiments, the confidence level threshold referred to herein can be configurable. In some embodiments, the confidence level threshold may be a percentage value. In some embodiments, the confidence level threshold may be at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, or at least 95%. In some embodiments, for example, the confidence level threshold may be set to 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, or 95%. The confidence level referred to herein can be defined as a level (e.g. a value, such as a percentage value) indicative of a confidence with which the first output is accurate or indicative of a probability that the first output is accurate. The confidence level threshold referred to herein can be defined as a threshold that a confidence level is to meet or exceed to be accepted as accurate. When a confidence level is below the confidence level threshold, it can be indicative that one or more second network attributes may be needed to improve the confidence level, e.g. to improve the probability for a certain class.
Although not illustrated in
In some embodiments, all of the one or more second network attributes referred to herein may be different from the one or more first network attributes referred to herein. Alternatively, in other embodiments, the one or more second network attributes referred to herein may comprise at least one of the one or more first network attributes referred to herein and at least one other network attribute of the network. The one or more first network attributes referred to herein can comprise a single first network attribute or multiple first network attributes. Similarly, the one or more second network attributes referred to herein can comprise a single second network attribute or multiple second network attributes.
Although not illustrated in
In some embodiments, the one or more first network attributes may be acquired from any one or more of a log file, a trace file, any other file, a performance management (PM) counter, any other counter, and a memory (or database). Alternatively or in addition, in some embodiments, the one or more second network attributes may be acquired from any one or more of a log file, a trace file, any other file, a PM counter, any other counter, and a memory (or database). In embodiments where the one or more first network attributes are acquired from a file, each line in the file may comprise one or more first network attributes. Similarly, in embodiments where the one or more second network attributes are acquired from a file, each line in the file may comprise one or more second network attributes.
In some embodiments, the one or more first network attributes referred to herein may comprise one or more local network attributes and/or one or more global network attributes. Alternatively or in addition, in some embodiments, the one or more second network attributes referred to herein may comprise one or more local network attributes and/or one or more global network attributes. Herein, a local network attribute can be an attribute of (e.g. information about) a particular node of the network, such as the node from which this attribute is acquired. A local network attribute may only be available to the node from which it is acquired. That is, it may not be available (e.g. accessible) to other nodes in the network. Herein, a global network attribute can comprise an attribute of (e.g. information about) the network, such as one or more nodes in the network. A global network attribute may be available (e.g. accessible) to each of the one or more network nodes in the network, e.g. all network nodes in the network.
Herein, a network attribute can refer to any type of attribute (e.g. parameter, feature, or characteristic) of the network. A network attribute can provide information about the network, e.g. about the overall network or a node of the network. In some embodiments, the one or more first network attributes can comprise one or more network measurements, e.g. one or more network performance measurements. Alternatively or in addition, the one or more second network attributes can comprise one or more network measurements, e.g. one or more network performance measurements. In some embodiments, the one or more first network attributes and/or the one or more second network attributes may be in the form of a digital representation. In some embodiments, the one or more first network attributes and/or the one or more second network attributes can comprise one or more values from a counter (e.g. a PM counter) and/or text from a file (e.g. a log and/or trace file).
In some embodiments, the one or more first network attributes may comprise at least two first network attributes and the at least two first network attributes may be in a time series. Alternatively or in addition, in some embodiments, the one or more second network attributes may comprise at least two second network attributes and the at least two second network attributes may be in a time series. In some embodiments, the one or more first network attributes may comprise at least two first network attributes and each of the at least two first network attributes may have the same time stamp. Alternatively or in addition, in some embodiments, the one or more second network attributes may comprise at least two second network attributes and each of the at least two second network attributes may have the same time stamp.
In some embodiments, the one or more first network attributes may comprise a plurality of first network attributes organised in vector form and/or the one or more second network attributes may comprise a plurality of second network attributes organised in vector form. For example, in some embodiments, the one or more first network attributes may be encoded as a first vector and/or the one or more second network attributes may be encoded as a second vector. In embodiments where each line in a file comprises one or more first network attributes, each line in the file may be encoded as the vector or the one or more first network attributes that the line comprises may be encoded individually. Similarly, in embodiments where each line in a file comprises one or more second network attributes, each line in the file may be encoded as the vector or the one or more second network attributes that the line comprises may be encoded individually.
In some embodiments, the one or more first network attributes may comprise at least two first network attributes that are in different formats. In some of these embodiments, the first method may comprise converting the at least two first network attributes into the same format. In some embodiments, the one or more second network attributes may comprise at least two second network attributes that are in different formats. In some of these embodiments, the first method may comprise converting the at least two second network attributes into the same format. In some embodiments, at least one of the one or more first network attributes may not be in a machine-readable format. In some of these embodiments, the first method may comprise converting the at least one of the one or more first network attributes into a machine-readable format. In some embodiments, at least one of the one or more second network attributes may not be in a machine-readable format. In some of these embodiments, the first method may comprise converting the at least one of the one or more second network attributes into a machine-readable format. In some embodiments, the machine-readable format referred to herein may be a series of numerical digits. The first entity 10 (e.g. the processing circuitry 12 of the first entity 10) can be configured to perform any one or more of the conversions described herein according to some embodiments.
In some embodiments, analysing the one or more first network attributes using the second machine learning model to generate the second output may comprise using the second machine learning model to compare the one or more first network attributes to one or more second outputs previously generated using the second machine learning model and generate the second output based on a result of the comparison. Each previously generated second output may be indicative of one or more second network attributes of the network previously analysed using the first machine learning model.
As illustrated in
Briefly, the processing circuitry 22 of the second entity 20 is configured to train a first machine learning model to analyse one or more first network attributes of the network to generate a first output. The first output comprises information about an event in the network. The second entity 20 is also configured to train a second machine learning model to analyse the one or more first network attributes to generate a second output if an estimated confidence level for the first output is less than a confidence level threshold. The second output is indicative of one or more second network attributes of the network to analyse using the first machine learning model.
As illustrated in
The processing circuitry 22 of the second entity 20 can be connected to the memory 24 of the second entity 20. In some embodiments, the memory 24 of the second entity 20 may be for storing program code or instructions which, when executed by the processing circuitry 22 of the second entity 20, cause the second entity 20 to operate in the manner described herein in respect of the second entity 20. For example, in some embodiments, the memory 24 of the second entity 20 may be configured to store program code or instructions that can be executed by the processing circuitry 22 of the second entity 20 to cause the second entity 20 to operate in accordance with the method described herein in respect of the second entity 20. Alternatively or in addition, the memory 24 of the second entity 20 can be configured to store any information, data, messages, requests, responses, indications, notifications, signals, or similar, that are described herein. The processing circuitry 22 of the second entity 20 may be configured to control the memory 24 of the second entity 20 to store information, data, messages, requests, responses, indications, notifications, signals, or similar, that are described herein.
In some embodiments, as illustrated in
Although the second entity 20 is illustrated in
With reference to
In some embodiments, the first machine learning model may be trained using a first training dataset. The first training dataset may comprise information indicative of a past occurrence of the event in the network and one or more first network attributes of the network corresponding to the past occurrence of the event. In some embodiments, the information indicative of a past occurrence of the event in the network may be acquired from a memory (or database), e.g. comprising a repository of trouble reports and/or trouble report tickets. In some embodiments, the one or more first network attributes may be acquired from any one or more of a log file, a trace file, any other file, a PM counter, any other counter, and a memory (or database). The information indicative of the past occurrence of the event in the network and the one or more first network attributes of the network corresponding to the past occurrence of the event may be labelled.
In some embodiments, the second machine learning model may be trained using a second training dataset. The second training dataset may comprise the one or more first network attributes of the network corresponding to the past occurrence of the event and one or more second network attributes of the network corresponding to the past occurrence of the event. In some embodiments, the one or more first network attributes and/or the one or more second network attributes may be acquired from any one or more of a log file, a trace file, any other file, a PM counter, any other counter, and a memory (or database). The one or more first network attributes and/or the one or more second network attributes may be labelled.
The labelling can be used to train the first and second machine learning models, such as using a supervised or semi-supervised machine learning process, e.g. an auto-encoding process or a bidirectional encoder representations (BERT) process.
In some embodiments, the dataset used to train any of the machine learning models referred to herein may be a time-series dataset. The dataset can itself comprise different types of datasets according to some embodiments. For example, a dataset may comprise one or more log files, one or more trace files, one or more PM counters, and/or any other type of dataset. For datasets (e.g. log and trace files) comprising text, text feature processing may be implemented, such as in the manner described later with reference to
With regard to labelling a dataset, in some embodiments, a label can be assigned to each interval in time in order for the machine learning model to learn the label from the one or more respective network attributes during the same time interval. In some embodiments, the labels may include categories such as keywords (e.g. “node failure”), alarms, post-mortem dumps, or any other information included in the dataset that can aid in understanding the event (e.g. failure/error) in the network. This approach may include the use of domain expertise to categorise events into the selected categories. An example is as follows:
Alternatively or in addition, the criteria to label a dataset can be based on information retrieved about a corresponding report (e.g. a TR) stored in a memory (e.g. database), such as the reason for closing the report. The report may comprise information on how the same event was handled in the past. The closed report can then itself be used as a labelled dataset to train the machine learning model. Thus, a training dataset and the labelling of the training dataset can be based on previous experience (i.e. experience gained in the past), e.g. with or without human intervention. For reports having corresponding closing codes, the labels may be fixed as the closing codes. As more than 50% of reports have data center graphics processor usage manager (DCGM) logs attached to them, a time duration of these logs can be acquired from the attached DCGM logs. Once the time duration is acquired, the corresponding PM counters and report closing code can be acquired.
In some embodiments, a semi-supervised clustering approach may be used for labelling, whereby a plurality of network attributes are grouped into a plurality of clusters. A combination of expert input and report closure codes can be used to label a subset of time stamps belonging to different clusters using the network attributes.
When applying a trained machine learning model, the same process used for training may be followed for cleaning the input dataset. However, the input dataset in the case of applying the trained machine learning model need not be labelled.
In some embodiments, the information indicative of the past occurrence of the event referred to herein can have a time stamp indicative of a time of the past occurrence of the event and/or each first network attribute of the one or more first network attributes can have a time stamp indicative of a time at which the first network attribute was recorded. In some of these embodiments, for each first network attribute of the one or more first network attributes, the time at which the first network attribute was recorded may be a time that falls within a predefined time interval that precedes the time of the past occurrence of the event.
Although not illustrated in
Thus, in the manner described herein, a first machine learning model is built and trained. The first machine learning model can be trained using one or more datasets (or data sources), such as one or more logs, one or more traces, one or more PM counters, and/or any other dataset(s). The trained first machine learning model can learn a mechanism for analysing one or more network attributes of the network (e.g. correlated from different datasets) to identify information about an event in the network.
According to some embodiments, the information about the event can be a level in the network at which the event occurs. In some of these embodiments, it may be determined whether a probability for classifying the event to one or more of those defined levels is high enough based on the given criterion of the confidence level threshold mentioned earlier. The criterion can be configurable and/or set by another machine learning model, such as based on the accumulated dataset from the network. If the probability for classifying the event to one or more of those defined levels is not high enough, the configuration settings for the (e.g. logging and/or tracing) mechanism used to captured information from the network may be updated to capture additional and/or alternative information from the network in the future. This feedback procedure may be repeated (e.g. automatically, such as with minimum human intervention or without human intervention) until the given criterion is satisfied.
The techniques described herein can be applied to any one or more of the manual processes of the existing techniques described earlier. In some embodiments, an event that comprises a service failure may be treated differently from an event comprising any other (e.g. node, platform, or virtualization/containerization) failure. The technique described herein can use information from different locations in the network instead of a single data center. This can be particularly beneficial where service instances appear and disappear in certain locations, such as due to the service demands from end users.
There is also provided a system comprising the first entity 10 described herein and the second entity 20 described herein. A computer-implemented method performed by the system comprises the method described herein in respect of the first entity 10 and the method described herein in respect of the second entity 20.
In the example illustrated in
As illustrated in
A processing unit (or processing circuitry) can be deployed for each DC to be managed. In the example illustrated in
In the example illustrated in
In the example illustrated in
As described earlier, one or more first network attributes of the network are analysed using a first machine learning model to generate a first output comprising information about an event in the network and, if an estimated confidence level for the first output is less than a confidence level threshold, the one or more first network attributes are analysed using a second machine learning model to generate a second output. This second output is indicative of one or more second network attributes of the network to analyse using the first machine learning model.
As also described earlier, the information about an event in the network may be any type of information about the event in the network but, in some embodiments, can comprise information indicative of a level within the network at which the event occurs. In this respect, a four level classification is used in the example illustrated in
As described earlier, the one or more first network attributes that are analysed may comprise one or more local network attributes and/or one or more global network attributes. For a service level event, it can be beneficial for one or more local attributes (e.g. from logs, traces, etc.) and one or more global (e.g. meta) attributes (e.g. location, etc.) to be taken into account in the analysis. For a node level event, platform level event, and virtualization or containerization level event, it can be beneficial for only one or more local attributes (e.g. from logs, traces, etc.) to be taken into account in the analysis. This may be particularly beneficial when the event is considered to only be a local event. For more sophisticated network platform deployment (e.g. a platform that manages heterogenous resources deployed across different geographic locations), other attributes may be taken into account in the analysis, such as its location and/or any other attributes.
The one or more first attributes can be acquired from one or more (e.g. a plurality of different) locations, such as from any one or more of the different DCs 700, 718, 734, 748. The one or more first attributes can be consolidated in the CU at the fourth DC 748. Thus, the CU at the fourth DC 748 may collect all first attributes from different DUs 710, 728, 738 deployed in the first, second, and third DCs 700, 718, 734 located across different geographic regions. The CU at the fourth DC 748 can, for example, retrieve the one or more first attributes from access logs, system logs, and/or traces, which may be for a specific service from all of the involved DCs 700, 718, 734 (e.g. based on a deployment footprint, which may be changed dynamically according to the demands from end users/devices, such as the UEs 746).
The CU at the fourth DC 748 can be configured to apply the first trained machine learning model described herein and, if required, the second machine learning model described herein, e.g. to identify which one or more features in an offered service led to the event in the network. As described earlier, the second machine learning model can be trained using a training dataset of past events to identify (e.g. a list of) one or more additional and/or alternative second network attributes needed to diagnose the incoming or current event. If the CU at the fourth DC 748 needs one or more additional and/or alternative second network attributes (e.g. more detailed log and/or trace information), the CU at the fourth DC 748 can be configured to update one or more (e.g. CM) parameters for this. This update may be according to a rule or policy given by a service provider.
The CU at the fourth DC 748 may send the one or more updated parameters to one or more of the DUs 710, 728, 738 of the DCs 700, 718, 734 to reconfigure the attribute acquisition (e.g. logging and tracing). For example, the attribute acquisition may be reconfigured to acquire more detailed service traffic in any one or more of the first, second, and third DCs 700, 718, 734. At any one or more of the first, second, and third DCs 700, 718, 734, the corresponding DU 710, 728, 738 may send the one or more updated parameters (e.g. for logging and tracing) to at least one DUA 730 that is deployed at node level to configure the corresponding components, e.g. the components that start to log or trace more detailed information for the event occurring in the network.
In
As illustrated at block 804 of
For the purpose of the illustration, when the trained machine learning model analyses the one or more first network attributes, the trained first machine learning model classifies the level within the network at which the event occurs as the service level. Thus, as illustrated at block 810 of
As illustrated at block 816 of
In more detail, if the estimated confidence level for the first output is less than the confidence level threshold, the method proceeds to block 814 of
As illustrated at block 812 of
As illustrated at block 824 of
As illustrated at blocks 818 and 830 of
As illustrated at block 830 of
As illustrated at block 822 of
As described earlier, it can be checked at block 816 of
If the first machine learning model accurately identifies that the event occurs at node level, platform level, or virtualization/containerization level, the DU 802 may generate the final report for the event, store the final report in a memory (or database) and then notify the CU 836. If the first machine learning model accurately identifies that the event occurs at service level, the DU 802 may include links to the datasets (e.g. log files, trace files, PM counters, and/or any other datasets) in the final report and then transmit the final report towards the CU 836.
In
As illustrated in
The one or more first network attributes may be acquired from the one or more acquired datasets 900, 902, 904, 906, 908. For example, the one or more acquired datasets 900, 902, 904, 906, 908 can be processed by the one or more DUs 914, 920, 922 to acquire one or more first network attributes of the network and the one or more first network attributes may be transmitted by the one or more DUs 914, 920, 922 towards the CU 926. Alternatively, the one or more DUs 914, 920, 922 may transmit towards the CU 926 the one or more acquired datasets 900, 902, 904, 906, 908 or a report comprising a link to the one or more acquired datasets. For instance, the CU 926 may receive reports from two DUs 920, 914 at two locations (Y and Z) for the service S4. By following the link in a report, the CU 926 can retrieve the associated datasets from storage.
The CU 926 may process the one or more datasets to acquire the one or more first network attributes of the network. At block 928 of
If the outcome of the trained machine learning models requires the additional and/or alternative information, at block 916 of
If the estimated confidence level for the first output is equal to or greater than the confidence level threshold (e.g. the information about the event in the network is provided by the first machine learning model with a high probability), at block 912 of
A network operator administrator 930 or a service provider administrator 932 may be responsible for defining one or more features related to a service to be offered, as illustrated at block 934 of
In
As described earlier, the first machine learning model can be trained to generate a first output comprising information about an event in the network, e.g. classifying a fault in the network. The first machine learning model can be trained to generate the first output based on one or more first network attributes of the network that are provided as input. The one or more first network attributes of the network can be acquired from one or more datasets, such as one or more log files, one or more trace files, one or more PM counters, and/or any other datasets.
As illustrated at blocks 1004 and 1012 of
As illustrated at blocks 1006 and 1014 of
As illustrated at block 1008 of
As illustrated at block 1016 of
One or more datasets are used as inputs into the encoder 1104. As mentioned earlier, the one or more datasets may comprise data (specifically, network attributes) acquired during a predefined time interval. In the case of multiple datasets, the multiple datasets may each comprise data acquired during the same predefined time interval. For example, the multiple datasets may comprise PM counter values accumulated during the predefined time interval as well as logs recorded during the same time interval. The one or more first network attributes of the network are acquired from the one or more datasets. The encoder 1104 can capture the latent space representation of the one or more first network attributes.
In more detail, as illustrated at blocks 1100 and 1102 of
The latent space representation of the one or more first network attributes, which is generated by the encoder 1104, can be used as the input into the first and/or second machine learning model referred to herein according to some embodiments. Thus, in some embodiments, the encoder 1104 can provide the input for the first and/or second machine learning model referred to herein. As described earlier, the first machine learning model is trained to generate a first output comprising information about an event (e.g. a fault) in a network, such as a type of the event or a status of the event. The status of the event may even be an indication that there is no such event. The first output can be generated for a corresponding timestamp.
In embodiments where decoding is performed, as illustrated at blocks 1114, 1116, 1118, and 1120 of
The second entity 20 (or, more specifically, the processing circuitry 22 of the second entity 20) described earlier with reference to
As illustrated in
As also illustrated in
As mentioned earlier, the data is specifically the first or second network attributes referred to herein. Thus, in some embodiments such as that illustrated in
Table 1 below illustrates an example of values from some PM counters (“PM counter A” and “PM counter B”), which are acquired at a plurality of predefined time intervals. In the example, each predefined time interval is a 15 minute time interval. The acquired PM counter values are in their raw format as a time series of values.
It can be beneficial to convert logs into a time series format. This can be achieved by aggregating log values to the same granularity as the PM counter values. Table 2 illustrates an example of some logs which are acquired at the same plurality of predefined (15 minute) time intervals as the PM counter values of Table 1.
However, as can be seen from the above Tables 1 and 2, even though the PM counters and logs may be collected during the same time intervals, they can differ quite significantly in nature and format. For example, PM counters are generally time driven whereas logs are generally action driven, PM counters are typically continuously recorded at certain time intervals whereas logs are commonly recorded in response to a trigger (e.g. by a network event such as a fault/crash or by a script like a trace), PM counters usually have a fixed granularity (e.g. values collected every 15 minutes) whereas the granularity of logs usually depends on the type of logging (e.g. logs are typically collected continuously for a system health-check), and PM counters are numerical in nature (e.g. in a tabular format) whereas logs can comprise text data (e.g. high domain-specific vocabulary). Also, PM counters may contain empty values (e.g. timestamps may be missing due to a system error) and logs may also contain empty values but, different to PM counters, logs can include texts that are of different formats (and different logs can have different formats in general).
Rather than using a raw input from a dataset (e.g. logs, PM counters, and/or any other dataset) for the first and/or second machine learning model referred to herein, the raw input may first be converted into a machine-readable language. Due to the differences between the types of inputs in terms of their nature and/or format, different processing methods may be used to handle different inputs, such as the numerical inputs from PM counters and the textual inputs from logs.
As illustrated in
As illustrated in
The processing of the raw input 1402 may also comprise an embedding step 1406 that generates a representation (e.g. an embedding) 1408 of the cleaned input. For example, the embedding step 1406 can comprise passing the cleaned input through a bidirectional encoder representations (BERT) model or any other suitable model to generate the representation 1408 of the cleaned input. In some embodiments, a backbone pre-trained (e.g. BERT) model may be acquired at block 1400 of
In some embodiments, where a raw input is processed prior to training the first and/or second machine learning model referred to herein, the processed numerical and/or textual data can then be fed as input into the encoder 1008, 1104 described earlier to learn the latent representation of the one or more network attributes. This phase of the operation can be completely autonomous with the aim of minimising errors between the input to the encoder 1008, 1104 and the output of the decoder 1112.
In the embodiment illustrated in
The reports provide a dataset, which can comprise data indicative of past experience. The query 1504 itself can also be indicative of past experience. In the example illustrated in
The query 1504 and/or the reports may be processed prior to inputting them into the second machine learning model 1500. For example, in some embodiments, the query 1504 and/or the reports may be digitised prior to input into the second machine learning model 1500, such as according to the method that will be described later with reference to
The second machine learning model 1500 is trained using the input query 1504 and reports. Thus, according to the example illustrated in
For example, as illustrated at block 1502 of
When the trained second machine learning model is used for inference, the information about an event in a network (e.g. an event type) can be taken to form a query. The information about the event can, for example, be one of {NODE, VIRTUALIZATION, CONTAINER, SERVICE}. The information about the event is determined by the categories in the first machine learning model referred to herein.
Although not illustrated, the method of how to build a first training dataset to train the first machine learning model referred to herein can be implemented in a similar manner to that described with reference to
The output of the second machine learning model can be a list of reports related to a query. From those related reports, the corresponding attached datasets (e.g. logs, traces, PM counters, and/or any other datasets) are retrieved. Then, the comparison between the datasets in the current query and those in the related reports can be made and the one or more second network attributes (e.g. alternative and/or additional information required) in terms of the datasets can be identified, e.g. as illustrated in
As illustrated in
At block 1704 of
In some embodiments, the first and/or second machine learning models referred to herein can be trained off-line, i.e. as “off-line” models. A model training pipeline that can be used can comprise collecting datasets, training the model, and delivering the trained model (e.g. to one or more involved nodes in the network). The first and/or second trained machine learning models can be integrated into a DU.
The system illustrated in
At block 2108 of
The procedure from step 2124 to step 2134 in
In more detail, as illustrated by arrow 2118 of
As illustrated by arrow 2128 of
The procedure from step 2136 to step 2160 in
As illustrated by arrow 2140 of
As illustrated by arrow 2150 of
An example of the techniques described herein will now be described in relation to a practical implementation in which an event occurs in a network. For the purpose of this example, the event is a failure but it will be understood that other events are also possible. According to the example, the network comprises a local node and a central node. The local node can be located at the edge of the network, e.g. closer to the end users of the network. The central node can be located in a cloud data center. The local node is responsible for dealing with the failure at a node, virtualization, and container level in the network. There is information associated with the local node that is only accessible by that local node deployed in the network. This information is referred to as local information. The central node is responsible for dealing with the failure at a service level in the network. There is information associated with the network that is accessible by all nodes deployed in the network. This information is referred to as global information.
The process that is implemented at the local node level according to the example will now be described. The process involves analysing one or more attributes of traffic flow in the network using the first machine learning model referred to herein and optionally also the second machine learning model referred to herein. The one or more attributes of traffic flow in this example are referred to herein as one or more first attributes. At flow level, four failure-type classes are defined (e.g. node, virtualization, container and service). The first machine learning model is used to classify the failure that occurs in the network according to these failure-types, such as with a probability distribution. Thus, according to the example, the information about the failure in the network that forms the first output referred to herein is a classification of the failure.
It is identified whether an estimated confidence level for the classification meets the confidence level threshold referred to herein. If the estimated confidence level for the classification meets (i.e. is equal to or greater than) the confidence level threshold, a report is generated on the classification and the report is transmitted towards the central node. On the other hand, if the estimated confidence level for the classification does not meet (i.e. is less than) the confidence level threshold, the second machine learning model is used.
The second machine learning model is used to recommend extra and/or alternative information to be collected from the network to improve the estimated confidence level (e.g. classification accuracy) for the classification output by the first machine learning model. This recommended information in this example is referred to herein as one or more second attributes. The configuration is built in order to retrieve the recommended information from the network. The relevant node(s) deployed in the network are instructed to enable the network to produce the recommended information.
The training of the first machine learning model for the purpose of the example can involve a plurality of data processing steps. A first data processing step may comprise text feature retrieval for log and trace files. A second data processing step may comprise numerical time series feature retrieval for PM counters. A third data processing step may comprise synchronization of the log and trace files with the PM counters to provide the first training dataset for training the first machine learning model. A fourth data processing step may comprise using a semi-supervised machine learning process for labelling the first dataset (failure-type classes). The labelling can be based on the first dataset itself and/or one or more reports (e.g. TRs) from a repository, which relate to past experience. A fifth data processing step may comprise using a supervised machine learning (e.g. auto-encoder or BERT) process to train the first machine learning model.
The training of the second machine learning model for the purpose of the example can also involve a plurality of data processing steps. A first data processing step may comprise building a query based on reports (e.g. TRs) from a repository, which relate to past experience. A second data processing step may comprise using the failure-type as a key feature for the query to create an input for the first machine learning model. A third data processing step may comprise using data augmentation to build a training and validation dataset. A fourth data processing step may comprise using tokenization to convert text into a digital representation. A fifth data processing step may comprise using a supervised machine learning (e.g. auto-encoder or BERT) process to train the second machine learning model.
The process that is implemented at the global node level according to the example will now be described. At the global node level, only service failure is considered. Normally each service has one or more features. In order to find the root cause of a service failure, the process at the global node level can involve identifying which one or more features lead to the service failure, whether the current information is enough to identify the cause of the service failure, and what (e.g. alternative and/or additional) information from the network is to be collected for further analysis of the root cause.
The process at the global node level involves analysing one or more attributes of traffic flow in the network using the first machine learning model referred to herein and optionally also the second machine learning model referred to herein. The one or more attributes of traffic flow are retrieved from (e.g. consolidated) reports, which are received from nodes deployed at different locations in the network. The reports can comprise log and trace information as well as PM counters related to the service. The one or more attributes of traffic flow in this example are referred to herein as one or more first attributes. The first machine learning model is applied to classify which feature causes the service failure. Thus, according to the example, the information about the failure in the network that forms the first output referred to herein is a classification of which feature causes the service failure.
It is identified whether an estimated confidence level for the classification meets the confidence level threshold referred to herein. If the estimated confidence level for the classification does not meet (i.e. is less than) the confidence level threshold, the second machine learning model is used. The second machine learning model is used to recommend extra and/or alternative information to be retrieved from the service to improve the estimated confidence level (e.g. classification accuracy) for the classification output by the first machine learning model. This recommended information in this example is referred to herein as one or more second attributes. The recommended information may be retrieved by setting up a different service log level (e.g. from INFO to DEBUG), setting up a trace for the identified feature, and locating the nodes that are currently providing the service. The configuration is sent to these located nodes. On the other hand, if the estimated confidence level for the classification meets (i.e. is equal to or greater than) the confidence level threshold, a final report is generated on the service failure with the identified features.
The training of the first machine learning model can involve inputting a dataset (e.g. log and trace files) from a test lab of the service provider or real traffic. The dataset comprises traffic patterns within which the relation between service and feature are captured. The input dataset is labelled according to the different features. The dataset can comprise different log and trace files. The dataset can have a mixture of traffic patterns that are related to different implemented features. A supervised machine learning process is used to train the first machine learning model to understand the relationship between the service and feature. A similar process can be used to train the second machine learning model.
There is also provided a computer program comprising instructions which, when executed by processing circuitry (such as the processing circuitry 12 of the first entity 10 described herein and/or the processing circuitry 22 of the second entity 20 described herein), cause the processing circuitry to perform at least part of the method described herein. There is provided a computer program product, embodied on a non-transitory machine-readable medium, comprising instructions which are executable by processing circuitry (such as the processing circuitry 12 of the first entity 10 described herein and/or the processing circuitry 22 of the second entity 20 described herein) to cause the processing circuitry to perform at least part of the method described herein. There is provided a computer program product comprising a carrier containing instructions for causing processing circuitry (such as the processing circuitry 12 of the first entity 10 described herein and/or the processing circuitry 22 of the second entity 20 described herein) to perform at least part of the method described herein. In some embodiments, the carrier can be any one of an electronic signal, an optical signal, an electromagnetic signal, an electrical signal, a radio signal, a microwave signal, or a computer-readable storage medium.
In some embodiments, the first entity functionality and/or second entity functionality described herein can be performed by hardware. Thus, in some embodiments, the first entity 10 and/or second entity 20 described herein can be a hardware entity. However, it will also be understood that optionally at least part or all of the first entity functionality and/or second entity functionality described herein can be virtualized. For example, the functions performed by the first entity 10 and/or second entity 20 described herein can be implemented in software running on generic hardware that is configured to orchestrate the first entity functionality and/or second entity functionality described herein. Thus, in some embodiments, the first entity 10 and/or second entity 20 described herein can be a virtual entity. In some embodiments, at least part or all of the first entity functionality and/or second entity functionality described herein may be performed in a network enabled cloud. Thus, the method described herein can be realised as a cloud implementation according to some embodiments. The first entity functionality and/or second entity functionality described herein may all be at the same location or at least some of the first entity functionality and/or second entity functionality may be distributed, e.g. the first entity functionality and/or second entity functionality may be performed by one or more different entities.
It will be understood that at least some or all of the method steps described herein can be automated in some embodiments. That is, in some embodiments, at least some or all of the method steps described herein can be performed automatically. The method described herein can be a computer-implemented method.
Therefore, as described herein, there is provided an advantageous technique for analysing one or more network attributes of a network and an advantageous technique for training a machine learning model to analyse one or more network attributes of a network. The techniques described herein can provide a variety of benefits for the network operator. In particular, the techniques described herein can reduce the effort and time required to resolve an event (e.g. fix a failure, such as a SW failure) occurring in the network. This can lead to a significant saving on the cost required for network operation and/or maintenance. The techniques described herein can provide a variety of benefits for the vendors of the network. In particular, the techniques described herein can significantly reduce the effort and time required by support engineers and/or design engineers to resolve an event (e.g. fix a failure, such as a SW failure) occurring in the operator network. The techniques described herein can also be applied during a test phase in the product development, resulting in significant savings. The techniques described herein can provide a variety of benefits for the end users of the network. In particular, the techniques described herein can reduce the service interruption period significantly. This can be especially beneficial in a service oriented architecture (SOA), which may require a high service availability (e.g. of 99.999%).
It should be noted that the above-mentioned embodiments illustrate rather than limit the idea, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. The word “comprising” does not exclude the presence of elements or steps other than those listed in a claim, “a” or “an” does not exclude a plurality, and a single processor or other unit may fulfil the functions of several units recited in the claims. Any reference signs in the claims shall not be construed so as to limit their scope.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB2021/060645 | 11/17/2021 | WO |