SYSTEMS AND METHODS FOR USE IN DETECTING ANOMALOUS CALL BEHAVIOR

TECHNICAL FIELD

The current application relates to detecting anomalous call behavior.

BACKGROUND

Scam telephone calls and robocalls are becoming an increasing problem. There are solutions that can be used to block inbound calls from numbers known to be used by robocallers and/or for scam calls. While these solutions may be useful they require an end user to install some application or use a device to provide the desired functionality.

It is difficult to adapt existing solutions from end-user devices to a telephone network level as it may be unacceptable for the telephone network to block a number that was incorrectly identified as being associated with a robocall or scam call. Identifying telephone numbers associated with robocalls or scam calls based on data available to network operators can be a difficult task given the volume of data needed to process.

It would be desirable to have new, additional and/or improved tools for use by telephone network operators in identifying and blocking telephone numbers associated with making robocalls and/or scam calls.

BRIEF DESCRIPTION OF THE DRAWINGS

Further features and advantages of the present disclosure will become apparent from the following detailed description taken in combination with the appended drawings, in which:

FIG. 1 depicts a system for identifying and blocking phone numbers associated with robocalls and/or scam calls;

FIG. 2 depicts a user interface presenting identified phone numbers;

FIG. 3 depicts a method for identifying and blocking phone numbers associated with robocalls and/or scam calls;

FIG. 4 depicts a method of pre-processing raw call log records;

FIG. 5 depicts a method for unblocking blocked phone numbers;

FIG. 6 depicts a Precision-Recall curve;

FIG. 7 depicts a method of processing call detail records;

FIG. 8 depicts a method of determining a dialed callee number in the call detail records; and

FIG. 9 depicts a method of identifying call detail records associated with a same call event.

DETAILED DESCRIPTION

In accordance with the present disclosure there is provided a system for use in blocking phone numbers in a telephone network comprising: one or more processors for executing instructions; and at least one memory for storing instructions, which when executed by at least one of the one or more processors configure the system to perform a method comprising: receiving from a plurality of telephone network elements a plurality of raw call log records; periodically processing the received plurality of raw call log records comprising: formatting each of the raw call log records into a corresponding call record having a common format; and identifying raw call log records or call records associated with a same call; and aggregating raw call log records or call records associated with the same call together; periodically processing the call logs comprising: processing the call logs using a first trained model to identify phone numbers associated with anomalous call behaviour as anomalous phone numbers; and processing the call logs using a second trained model to identify phone numbers associated with a first undesirable type of call behaviour as first undesirable call type phone numbers; and blocking at least one phone number of the anomalous phone numbers and the first undesirable call type phone numbers from making calls over the telephone network.

In an embodiment of the system, the first undesirable call type is a Wangiri type scam call.

In an embodiment of the system, the at least one phone number that is blocked is further processed to ensure the number should be blocked prior to being blocked.

In an embodiment of the system, the method provided by executing the instructions further comprises: automatically calling at least one of the phone numbers of the anomalous phone numbers and the first undesirable call type phone numbers; and recording a portion of the calls made automatically.

In an embodiment of the system, the method provided by executing the instructions further comprises: generating a user interface including an indication of one or more of the anomalous phone numbers and the first undesirable call type phone numbers; providing the generated user interface to an investigator of the telephone network operator; and receiving from the user interface a selection including the at least one phone number for blocking.

In an embodiment of the system, the generated user interface further includes an indication of the recorded portion of the calls.

In an embodiment of the system, the method provided by executing the instructions further comprises: retrieving additional information from one or more sources on the anomalous phone numbers and the first undesirable call type phone numbers; and including the additional information in the generated user interface. blocking at least one phone number of the anomalous phone numbers and the first undesirable call type phone numbers from making calls over the telephone network.

In an embodiment of the system, the method provided by executing the instructions further comprises: unblocking blocked phone numbers.

In an embodiment of the system, unblocking blocked phone numbers comprises: identifying blocked phone numbers; and for each blocked phone number, determining if there has been no call activity over the telephone network associated with the blocked phone number for a threshold period of days, and unblocking the blocked phone number when it is determined that the has been no call activity for the threshold period of days.

In accordance with the present disclosure there is further provided a method for use in detecting fraudulent phone numbers associated with undesirable behavior in a telephone network.

In accordance with the present disclosure there is further provided a system for detecting fraudulent phone numbers associated with undesirable behavior in a telephone network.

In accordance with the present disclosure there is further provided a method of processing call detail records (CDRs), comprising: receiving a plurality of CDRs, each of the CDRs comprising a calling party number, a callee number, a gap value, a gap type value, a start time, an end time, a ringing time, and a conversation time; determining a dialed callee number for each of the CDRs; identifying at least two CDRs associated with a call event based on one or more similarity thresholds being met between the at least two CDRs; determining a maximum conversation time among all CDRs associated with the call event; and generating a processed CDR for the call event comprising at least the calling party number, the dialed callee number, and the maximum conversation time.

In an embodiment of the method, determining the dialed callee number in each of the CDRs comprises, for each respective CDR: determining whether an additional number field in the respective CDR has a value and whether the value differs from the calling party number; and determining that the dialed callee number in the respective CDR is the value in the additional number field when it is determined that there is the value in the additional number field and that the value differs from the calling party number.

In an embodiment of the method, when it is determined that the additional number field is blank or that the additional number field has a value that is the same as the calling party number, the method further comprises: determining whether the gap type value denotes one of a destination number, a ported number, or a transfer number; and determining that the dialed callee number in the respective CDR is the gap value when the gap type value denotes one of a destination number, a ported number, or a transfer number.

In an embodiment of the method, when it is determined that the gap type value does not denote one of a destination number, a ported number, or a transfer number, determining that the dialed callee number is the callee number of the respective CDR.

In an embodiment of the method, identifying the at least two CDRs associated with the call event comprises: identifying two consecutive CDRs based on the calling party number, the dialed callee number, and the start time; and determining that the two consecutive CDRs are associated with the call event based on the one or more similarity thresholds being met between the two consecutive CDRs.

In an embodiment of the method, the method further comprises generating a sorted list of CDRs by sorting the plurality of CDRs based on the calling party number, the dialed callee number, and the start time, for use in identifying the two consecutive CDRs.

In an embodiment of the method, the similarity thresholds comprise one or more of: a difference of start time is less than w seconds; a difference of end time is less than x seconds; a difference of ringing time is less than y seconds; and a difference of conversation time is less than z seconds, wherein each of w, x, y, and z are predetermined threshold values.

In an embodiment of the method, the method further comprises determining a maximum ringing time among all CDRs associated with the call event; and generating the processed CDR for the call event further comprising the maximum ringing time.

In an embodiment of the method, the plurality of CDRs comprise CDRs generated from a multi-hop call event.

In an embodiment of the method, the plurality of CDRs are received from a plurality of network switches.

In accordance with the present disclosure there is further provided a system for processing call detail records (CDRs), comprising: one or more processors for executing instructions; and at least one non-transitory computer-readable memory storing instructions which, when executed by at least one of the one or more processors, configure the system to perform a method comprising: receiving a plurality of CDRs, each of the CDRs comprising a calling party number, a callee number, a gap value, a gap type value, a start time, an end time, a ringing time, and a conversation time; determining a dialed callee number for each of the CDRs; identifying at least two CDRs associated with a call event based on one or more similarity thresholds being met between the at least two CDRs; determining a maximum conversation time among all CDRs associated with the call event; and generating a processed CDR for the call event comprising at least the calling party number, the dialed callee number, and the maximum conversation time.

In an embodiment of the system, determining the dialed callee number in each of the CDRs comprises, for each respective CDR: determining whether an additional number field in the respective CDR has a value and whether the value differs from the calling party number; and determining that the dialed callee number in the respective CDR is the value in the additional number field when it is determined that there is the value in the additional number field and that the value differs from the calling party number.

In an embodiment of the system, when it is determined that the additional number field is blank or that the additional number field has a value that is the same as the calling party number, the method performed by the system further comprises: determining whether the gap type value denotes one of a destination number, a ported number, or a transfer number; and determining that the dialed callee number in the respective CDR is the gap value when the gap type value denotes one of a destination number, a ported number, or a transfer number.

In an embodiment of the system, when it is determined that the gap type value does not denote one of a destination number, a ported number, or a transfer number, determining that the dialed callee number is the callee number of the respective CDR.

In an embodiment of the system, identifying the at least two CDRs associated with the call event comprises: identifying two consecutive CDRs based on the calling party number, the dialed callee number, and the start time; and determining that the two consecutive CDRs are associated with the call event based on the one or more similarity thresholds being met between the two consecutive CDRs.

In an embodiment of the system, the method performed by the system further comprises generating a sorted list of CDRs by sorting the plurality of CDRs based on the calling party number, the dialed callee number, and the start time, for use in identifying the two consecutive CDRs.

In an embodiment of the system, the similarity thresholds comprise one or more of: a difference of start time is less than w seconds; a difference of end time is less than x seconds; a difference of ringing time is less than y seconds; and a difference of conversation time is less than z seconds, wherein each of w, x, y, and z are predetermined threshold values.

In an embodiment of the system, the method performed by the system further comprises: determining a maximum ringing time among all CDRs associated with the call event; and generating the processed CDR for the call event further comprising the maximum ringing time.

In an embodiment of the system, the plurality of CDRs comprise CDRs generated from a multi-hop call event.

In an embodiment of the system, the plurality of CDRs are received from a plurality of network switches.

Undesirable phone calls can be a problem for consumers. These calls may include various types of scams or other undesirable calls. For example, some calls may impersonate a revenue agency such as the Canadian Revenue Agency (CRA) or the Internal Revenue Service (IRS) and have the victim transfer money or other payments to the perpetrator. Other types of scam calls may include Wangiri, or “one ring” calls in which a scammer calls a target from a phone number and hangs up after one or two rings, or just long enough to register as a missed call. This process may be repeated from the same or slightly different phone number. If the target calls back the phone number, for example out of curiosity, the return number may be for a “pay to call” or premium rate number causing the target to pay these charges. These types of scam calls may be made by robocalls, or may use robocalls to identify possible phone numbers that are active. As described further below, a telephone network operator may collect and process call data from their telephone network in order to identify phone numbers associated with the undesirable behaviours. Once such phone numbers are identified, they may be blocked from making and/or receiving calls on the telephone network operator's network.

FIG. 1 depicts a system for identifying and blocking phone numbers associated with robocalls and/or scam calls. The system 100 can be implemented by an operator of a telephone network 102, which may include different telephony technologies including for example, Voice over IP (VoIP), cellular, and landline or SS7. Regardless of the particular type or composition of telephone network, it will comprise a plurality of network elements 104a, 104b, 104c (referred to collectively as network elements 104) for completing telephone calls. The network elements 104 may connect the telephone network 102 to consumer (or end user) equipment such as telephones 106a, 106b, 106c (referred to collectively as telephones 106) as well as to other telephone networks 108 or other telephony equipment. Each of the network elements 104 may generate logs for each call, or attempted call, handled by the network elements 104. The logs may include various information about the call such as the telephone number of the party being called (called party or destination number), the telephone number of the party calling (calling party or source number), the time the call was placed, if the call was answered, if the call was answered by a voice message system, a geographical location of the party calling, a geographic location of the party being called, as well as other possible information such as identifying information about the device of the caller/callee devices. In specific embodiments, as described further below, the network elements are network switches, such as Integrated Services Digital Network User Part (ISUP) Signaling System No. 7 (SS7) switches, which generate call detail records (CDRs) having specific fields and information contained therein. As described in further detail below, the log information collected for calls may be processed to identify and block phone numbers associated with undesirable behaviour.

The processing of the data collected from the various network elements 104 may be performed by one or more servers 110. The server(s) 110 comprises one or more processing units 112 for executing instructions and memory units 114 for storing instructions which when executed by the processing units 112 configure the server(s) 110 to provide functionality for identifying and blocking phone numbers associated with undesirable behaviour. The server(s) 110 may also include non-volatile (NV) storage 116 as well as one or more input/output (I/O) interfaces 118 for connecting internal and/or external components, devices and/or peripherals to the server(s) 110.

The functionality 120, which is provided by executing the instructions stored in the memory, includes data collection functionality 122 for processing the data collected by the network elements 104, detection functionality 124 for detecting, or rather identifying, phone numbers associated with undesirable behaviour, action functionality 126 for blocking and unblocking phone numbers, investigative interface functionality 128 for providing an interface to investigators of the telephone network operator, as well as additionally investigative processing functionality 130.

Broadly, the data collected by the network elements 104 is pre-processed by the data collection functionality 122 and the pre-processed data is used by the detection functionality 124 to identify phone numbers associated with undesirable call behaviour. The identified phone numbers associated with undesirable call behaviour can be blocked/unblocked or other actions may be taken by the action functionality 126. The actions may be taken automatically, or may be taken based on additional user (e.g. network operator level) input. In a non-limiting example, the additional user input may be provided by an investigator using an interface provided by the investigative interface functionality 128. The investigative interface functionality 128 may also use or solicit additional information that may be useful to the investigator and provided by the investigative processing functionality 130.

As described above, the data collection functionality 122 pre-processes data collected by the network elements 104. The raw call log data may be stored or accessed in numerous different ways, which are depicted schematically as a database 132 in FIG. 1. The raw call data log records from the network elements 104 are processed by log pre-processing functionality 134 to generate processed call records 136. The pre-processing may include minor processing such as cleaning and standardization of records for ensuring dates and times of records provided from different network elements, and thus possibly in different formats, are in the same format, as well as more major processing. For example, the processing may include identifying and aggregating raw call records, and/or possibly previously processed call records, that are associated with the same call. Aggregating call records associated with the same call can be achieved in various ways. For example, the records may be aggregated together into a single aggregate call record. Additionally or alternatively, the call records associated with the same call may be labeled with a unique call identifier to allow aggregated records to be quickly identified. Additionally or alternatively, a record or other indicator can be provided that identifies all of the related call records that are associated with the same call. In addition to the unique call identification, the processing may further include computing or determining any metrics or features used in the anomaly and/or scam detection.

The raw call data logs may be periodically processed in relatively short periods. For example, the raw call data logs may be processed every 5 minutes. Alternatively, this processing may be done in longer or shorter intervals, or possibly in real time. Regardless of the time intervals of processing the raw call data logs, once the records are processed by the log pre-processing functionality 134 the resulting call records 136 can be stored for subsequent processing by the detection functionality 124.

The detection functionality 124 may comprise various different functionality for processing the call records 136 to identify phone numbers associated with undesirable behaviour. As depicted in FIG. 1 the detection functionality may include general anomaly detection functionality 138 that detects anomalous behaviour in call patterns. The phone numbers that are identified by the general anomaly detection functionality 138 may be associated with behaviours that are out of the normal, although may not require being blocked. In a non-limiting example, the anomalous phone numbers identified by the general anomaly detection functionality 138 may be presented to investigators which may help speed the identification of additional scams or undesirable call behaviour. The general anomaly detection may be done in various ways using algorithms or techniques for identifying anomalies. In addition to the general anomaly detection functionality 138, the detection functionality 124 may further include specialized detection models 140 that detect specific undesirable call behaviour. For example, the specific detection models 140 may include a Wangiri fraud detection model 142 that detects phone numbers, and in particular caller phone numbers, associated with Wangiri fraud calls. Additional detection models 144 may include models trained to detect other specific types of possibly undesirable call behaviour, such as revenue service call fraud, Microsoft™ support scam, etc.

Each of the detection functionalities 138, 142, 144 may label or otherwise provide some other indication of the phone numbers that were detected by the various functionalities as possibly being associated with undesirable call behaviour. That is, for example, the general anomaly detection functionality 138 may provide an indication of one or more phone numbers that were determined to be anomalous, the Wangiri detection functionality 142 may provide an indication of one or more phone numbers that were determined to be associated with Wangiri fraud calls, etc. Details of illustrative implementation of both the general anomaly detection functionality 138 and the Wangiri detection functionality 142 are described in further detail below.

Once one or more phone numbers have been identified by the detection functionality 124, one or more actions may be taken on the phone numbers by action functionality 126. The actions may be taken automatically, or may be taken after some form of user interaction, for example by an investigator of the network operator. For example, an anomalous phone number may not be blocked automatically, but may be marked for blocking after an investigation or further review by an investigator. As depicted, the action functionality 126 may include phone number blocking functionality 146 and phone number unblocking functionality 148.

Depending upon how phone numbers are marked for blocking as well as the level of acceptability of potentially blocking a valid phone number, the blocking functionality 146 may, in a non-limiting example, simply automatically block all provided or marked phone numbers. Alternatively, the blocking functionality may include one or more checks or business rules that are applied to the phone numbers marked for blocking and only those phone numbers passing all of the checks may be blocked.

The phone numbers identified by the detection functionality 124 may be automatically passed to the phone number blocking functionality 146, or they may first be passed to investigative interface functionality 128 for generating an interface for use by an investigator. The investigative interface functionality 128 may include a graphical user interface (GUI) generation functionality 150 that generates an investigative interface that may present the identified telephone numbers to an investigator, which may allow the investigator to determine whether or not the phone number(s) should be blocked or not. The GUI that is generated may include an indication, such as a button or other GUI element, that allows the investigator to select a phone number for subsequent blocking by the phone number blocking functionality 146. In addition to providing an indication of one or more of the phone numbers identified by the detection functionality 124, the GUI may further include additional information that may be helpful to an investigator in determining whether to block a phone number or not.

In order to provide the additional information, the investigative interface functionality 128 may include data collection functionality 152 for retrieving or accessing the additional information presented in the generated GUI. The data collection functionality 152 may retrieve information from various sources. For example, the data collection functionality may retrieve information from one or more subscriber data sources of the telephone network operator to retrieve information associated with phone numbers that are provided by the telephone network operator. Additionally, the data collection functionality 152 may retrieve information from other sources such as provided by the investigative processing functionality 130.

The investigative processing functionality 130 may include one or more different functionalities or elements for providing additional relevant information. For example, the investigative processing functionality may include honey pot number functionality 154 that provides a honey pot phone number that is not used for other purposes and as such any numbers calling the honey pot phone number may be considered anomalous or presenting undesirable behaviour. Additionally, the investigative processing functionality 130 may include automated call-back functionality that can call back identified phone numbers, including for example suspicious numbers or those potentially associated with undesirable behaviours, and record the phone call. The automated call back functionality 156 may simulate a call. Additionally, the investigative processing functionality 130 may include 3^rdparty data collection functionality 158 that can retrieve or access information from 3^rdparty sources such as yellow-page information or 3^rdparty sources collecting information about robocalls or possible fraudulent calls.

FIG. 2 depicts a portion of an illustrative user interface. The GUI 200 may include, for example an area 202 indicating the phone numbers 202 as well as an area with the predictions 204 for each of the numbers, such as either being a relatively certain Wangiri, or a Wangiri that requires manual review. The GUI may also include an area 206 that enables a user to provide their own categorization of the call, as well as another area showing other information such as a recording of the call 208. It will be appreciated that other GUIs and/or layouts are possible.

Returning to the general anomaly detection functionality 138 depicted in FIG. 1, the functionality 138 may use an Isolation Forest approach for detecting anomalies. The anomalies may be detected over various time periods such as hours, days, weeks, etc. As will be appreciated by those skilled in the art, the Isolation Forest algorithm is an unsupervised variant of the Random Forest algorithm, which ensembles multiple weak predictors, aka trees. In a non-limiting example, the features used by the Isolation Forest model may include, among others, for example:

- num_incoming_calls which is the number of unique incoming calls;
- num_outgoing_calls which is the number of unique outgoing calls;
- incoming_call_rate which is num_incoming_calls/num_outgoing_calls;
- call_duration which is how long a conversation lasts for a given call record;
- num_callees which is the number of unique callees; and
- inter_start_time which is the start time of a call.

The Isolation Forest model, tuned using features including those mentioned above, may assign an anomaly score to each number, or originating phone number, which may also be known as calling party or caller. Experiments have shown that the more anomalous the behavior of a particular anumber (i.e. calling party number) is, as defined by the features including those mentioned above, the more likely it is to be assigned a higher anomaly score by the Isolation Forest algorithm, as compared to anumbers that demonstrate “normal” behavior. It will be appreciated that in order to evaluate the detection performance of the Isolation Forest model, one or more sources of verified anomalous phone numbers may be used. For example, the sources used may, for example, be Yellow Pages and/or Nomorobo or similar other sources, which are relatively less biased sources of information due to their crowd-sourced nature.

During performance tuning using Yellow Pages sourced data, it was found that the naïve Isolation Forest model did not result in acceptable accuracy when evaluated by the Yellow Pages reported rate (as defined below). By performing experiments, however, it was found that the addition of a filtering step that eliminated all anumbers with outgoing calls less than a threshold improved the accuracy by approximately 30% when compared to the baseline. Note that the filtering step was not used when evaluating the model using Nomorobo data but still achieved acceptable accuracy.

As a result of the performance tuning experiments, it will be appreciated that the general anomaly detection functionality 138 may, in a non-limiting example, include two Isolation Forest models: (1) the naive Isolation Forest model that detects anomalies that are likely to be also reported by Nomorobo, and (2) the filter controlled Isolation Forest model that detects anomalies that are likely to be also reported by Yellow Pages.

Varied measures may be used to evaluate the performance of the two Isolation Forest models. To evaluate the filter controlled model using Yellow Pages sourced data, one measure that may be used is the Yellow Pages reported rate Y which, for an anumber a that is flagged by the model and is also reported on Yellow Pages, is given by:

$Y = [\sum {YP}_{reported}] / N$

$Where {YP}_{reported} (a) = 1 if (\frac{({YP}_{s c a m m e r} (a) + Y P_{d e b t} (a)}{{YP}_{t o t a l} (a)}) > 0.5;$

$0 otherwise .$

And N is the total number of anomalies detected by the model.

To evaluate the naïve model using Nomorobo sourced data, one measure that may be used for example is the Nomorobo reported rate Φ, which, for an anumber a that is flagged by the model, is given by:

Φ=[ΣNomorobo_reported(a)]/N

Where Nomorobo_reported(a)=

- 1 if the number is found reported as a robocaller in Nomorobo;
- 0 otherwise.
  
  And N is the total number of anomalies detected by the model.

The Yellow Pages reported rate indicates how many anumbers out of the detected anomalies are reported as scammers or debt collectors on Yellow Pages, while the Nomorobo reported rate indicates how many anumbers out of the detected anomalies are reported as robocallers in Nomorobo.

The execution time for performing a grid search in order to tune the parameters of the Isolation Forest models was found to be prohibitive. Therefore, random search was performed instead, using the popular Python library scikit-learn. For the model tuned using Nomorobo sourced data, with 3 fold cross-validation, the resulting accuracy in terms of Nomorobo reported rate was 48.4%.

For the filter controlled Isolation Forest model, an iterative grid search was performed. With 3 fold cross-validation, the resulting accuracy in terms of Yellow Pages reported rate was 60%.

The Precision score was evaluated in a real run, and was calculated by dividing the total number of distinct anumbers that were reported in either Nomorobo or in Yellow Pages by the total number of all anumbers detected as anomalies. The best Precision score observed was 73.87%, on Jan. 3, 2019. During business days, the Precision score is usually observed to be around 60%, while on Sundays, it is usually observed to be less than 40%.

The above has described the anomaly detection as attempting to detect robocalls and/or debt collector/telemarketer calls. It will be appreciated that other anomalous behaviours may also be detected. For example, profile based, or caller behavior based, anomaly detection is possible. In profile based anomaly detection, for example, one might first establish a profile for each caller in the data. The profile may be established by looking at all available call history for each caller, or a subset of it. By analyzing these profiles, it is possible to find unusually deviant behavior, such as sudden spikes/drops in number of calls, sudden increase in calls to a specific destination number, etc. This may help, for example, in detecting spoofed numbers. To build up each caller's profile, time series analysis can be used, more specifically, moving average of each attribute, matrix profiling to discover motif pattern of spammers and hence the abnormality detection.

Returning to the Wangiri detection functionality 142 depicted in FIG. 1, the detection may be provided in various ways. For example, a simple approach may involve using handcrafted rules/heuristics, using knowledge of the scam characteristics. This may not, however, be the best approach because it typically leads to a proliferation of rules over time, exceptions to the rules and so on. Additionally, any rules may have to be frequently tuned manually to account for changes in scammer behaviour. Further still, the developed approach may not be easily applicable to other kinds of scams, potentially necessitating the development of a highly tailored solution for each type of scam.

A machine learning approach may be used to automatically “learn” the characteristics of a particular scam by using labelled examples of the scam. Such an approach can semi-automatically tune itself over time to account for changes in input data, representing, in this case, scammer behavior.

In a non-limiting example, in order to mathematically model the behaviour of Wangiri scammers, the following features may be used, which can be prepared or derived from the call logs.

- 1. dt_from: The lower bound of the time interval within which the Wangiri detection was performed
- 2. dt_to: The upper bound of the time interval within which the Wangiri detection was performed
- 3. anumber: The calling party's number, for which call records are summarized and all the metrics below are computed
- 4. num_outgoing_calls: The number of outgoing calls from the anumber
- 5. num_incoming_calls: The number of incoming calls to the anumber
- 6. incoming_call_rate: The proportion of incoming calls, relative to outgoing calls. This is computed as num_incoming_calls/num_outgoing_calls
- 7. num_callees: The number of unique destination numbers called by this anumber
- 8. callee_rate: The proportion of unique callees, relative to outgoing calls. This is computed as num_callees/num_outgoing_calls
- 9. inter_arrival_time_mean: The average of the inter-arrival time between calls. The inter-arrival time is the interval of time between two successive calls. Note: This is measured in minutes.
- 10. inter_arrival_time_stddev: The standard deviation of the inter-arrival time between calls, measured in minutes
- 11. call_duration_mean: The average of the call duration of all outgoing calls made by this anumber. Note: This is measured in milliseconds. This may be replaced by incoming_call_duration and outgoing_call_duration
- 12. call_duration_stddev: The standard deviation of the call duration of all outgoing calls made by this anumber, measured in milliseconds

Metrics #4 to #12 above are the predictors (aka features) in the Wangiri model, while the response is a class label that can take on one of two values—“Wangiri” or “Not Wangiri”. It will be appreciated that this is an example of a binary classification problem.

The approach used to solve this problem is to estimate one or more mathematical functions that describe the relationship(s) between the predictors and the response. The function(s) may be typically estimated from a set of manually labelled data that provides examples of each class. These functions, which constitute a model, may then be used to predict the class label (“Wangiri” or “Not Wangiri”) of future data. Labelled training data for the Wangiri class may be obtained using the investigator interface which may initially present investigators with anomalous phone numbers to be investigated. The calls that the investigator consider to be Wangiri can be labelled and used for the training data. The non-Wangiri class training data may be obtained from random sampling of the call data since the vast majority of call data passing over a telephone network will not be Wangiri calls.

The Wangiri detection model may use a Random Forest classifier. This particular classifier was determined to be preferable after comparing the performance of several different classifiers on the labelled data. Model hyperparameters (number of estimators and maximum number of features) were chosen using a Grid Search using 10 fold cross-validation, with the objective of choosing the parameter combination that maximized the F1-Score. The rationale for choosing to optimize the F1-Score, rather than the Precision or Recall, is to provide a balance between false positives and false negatives for the initial model. Originally, the selection criteria solely consisted of maximizing the Precision, but a quick ad-hoc analysis showed that some models with slightly lower precisions (−2%) had significantly higher recalls (+20%). The slightly lower precision, which can result in legitimate numbers being incorrectly identified as Wangiri numbers can be addressed by developing additional rules or filters to filter out the legitimate numbers from the Wangiri numbers. Using a business logic layer to protect legitimate customers from accidentally being blocked, optimizing on the F1-Score, provides significant recall, while mitigating any consequence of a slightly lower precision.

The best estimator chosen from the Grid Search has the following scores (over 10 folds):

- Mean F1-Score=0.94; std=0.03
- Mean Precision=0.96; std=0.04
- Mean Recall=0.93; std=0.04

The Precision-Recall curve is shown in FIG. 6. The curve indicates that the chosen estimator has good classification performance on the test set.

The labelled dataset used in this modelling process is fairly large and imbalanced (232,477 examples in total; the positive class makes up 1.73% of total). Due to this, training certain ML algorithms turned out to be infeasible due to very large runtimes. In particular, finding the best estimator using a grid search (or even a random search) for the Support Vector Machine (SVM) with a non-linear kernel and >5 fold cross-validation took unreasonably long. The Random Forest (RF) classifier was chosen mainly for its computational advantages (as well as good classification performance in general), such as the fact that it is inherently parallelizable. Further, RF is relatively less sensitive to the choice of initial values of hyperparameters.

Those skilled in the art will appreciate that it is particularly desirable to have an end-to-end automated system in place that detects and blocks Wangiri scammers, as well as other scams, with minimal human intervention. This blocking may be done automatically; however depending on the level of false positives that are acceptable to be blocked in error, additional logic may be used to further filter out possible legitimate phone numbers that were incorrectly identified as Wangiri numbers. As an example, this logic may, for each suspected Wangiri number, verify that:

- The number has not been detected as a Wangiri number above some threshold number of times, since typically a Wangiri scammer will not re-use phone numbers;
- The phone number has some threshold number of international calls since Wangiri calls typically originate from overseas numbers;
- The phone number is similar to other phone numbers recently detected as Wangiri calls, since typically Wangiri scammers will often use blocks of sequential numbers.

It will be appreciated that the above logic may be weighted so that the importance of one test compared to another may be varied as desired. Further, additional or alternative logic may be used to ensure any incorrectly identified Wangiri numbers are not blocked.

A semi-automated approach may be used to block Wangiri phone numbers, or other scam numbers. In a non-limiting example, the semi-automated approach may automatically block verified Wangiri numbers; however use of a human investigator may be used to verify that Wangiri numbers predicted by the detection model are in fact Wangiri numbers. For example, the predicted Wangiri numbers may be presented to an investigator, possibly along with additional useful information for verifying that the call is a Wangiri call, and the investigator may then either verify or refute the prediction. In addition, the verified/refuted predictions may also be used as training data to further train the prediction models.

The Wangiri detection model may divide model predictions for predicted Wangiri calls into two buckets—“Wangiri” and “Manual Review”—for display to human analysts. This division is based on a general rule that applies to most Wangiri scam calls, namely that the originating number typically originates from overseas. To quantify this, it is possible to compute:

- I=the ratio of international call records to all call records

The division into the two buckets is then performed by applying thresholds on the value of I. The “Wangiri” bucket includes numbers that are with high confidence Wangiri scammers. The “Manual Review” bucket includes numbers that, although identified by the ML model as Wangiri, are less certain, taking into account the value of I.

The items tagged for manual review are intended to be manually investigated and labelled by human analysts. With this process in place, it is possible to create an automatic feedback loop where:

- The thresholds used for computing I are re-discovered from these newly labelled data
- The model occasionally retrains by including these newly labelled data

Alternatively, it is possible to make/a feature for the model itself, rather than post-processing and using thresholds on it.

To avoid over fitting during the automatic training, it is possible to use normal business users' numbers and common users' numbers that have never been flagged (or numbers that are known to be good).

FIG. 3 is a flowchart depicting a method for identifying and blocking phone numbers associated with robocalls and/or scam calls. The method 300 begins with pre-processing raw call log records (e.g. call detail records generated by a network switch) to identify different records that are associated with the same call (302). Depending upon how often the raw call log data is processed, there may be associated call records that are not in the batch of raw call logs currently being processed and as such the associated call records may have already been pre-processed. The pre-processing may also include determining the features used by the different detection models, or the feature calculation or extraction may be performed after the pre-processing of the raw call log data. Once the raw call log data has been processed, the processed call records may be processed using an anomaly detection model (304) to identify phone numbers exhibiting anomalous behaviour. Phone numbers that are identified as being anomalous may be stored or otherwise identified for example in an anomalous numbers list 306. The call records may also be processed by one or more models for detecting specific behaviours, such as a Wangiri detection model to identify numbers associated with Wangiri behaviours (308). The phone numbers identified by the model as being Wangiri numbers may be stored or identified, for example in a Wangiri numbers list 310. It is possible to process the anomalous detection model and the Wangiri detection model, and any other detection models either sequentially or concurrently. The anomalous numbers and Wangiri numbers may be verified as scam or undesirable numbers (312). The verification may be done using additional rules or logic or may be done by an investigator or analyst. After numbers have been verified as scam or associated with undesirable behaviour, they may be blocked (314). It will be appreciated that after blocking the numbers, they may be unblocked (316). For example, it may be desirable to unblock numbers, either in the case of incorrectly blocking a legitimate number or if the scammers have stopped using the number.

FIG. 4 depicts a method of pre-processing raw call log records according to a non-limiting aspect of the invention. The method 400 begins with receiving raw call log records (402). The raw call log records may be received and processed in real-time or in batches, for example every 5 minutes. For each raw call log record (404) the raw call log record may be formatted as a call record (406), for example by placing the raw call log record into a standard format. The call records associated with the same call, which may include previously processed call records, are identified (408). The call records identified as being associated with the same call may be aggregated together (410). The call records may be aggregated into a single record, or a label or other indicator may be added to all of the associated call records in order to easily identify which call records are associated with the same call. Once the call records are aggregated, the next call record may be processed (412). After processing the call records, they may be stored and/or passed on to another process (414), such as the prior described detection models.

FIG. 5 depicts a method for unblocking blocked phone numbers according to a non-limiting aspect of the invention. The method 500 may be used to unblock numbers that are no longer being used by scammers so that they may be used for legitimate purposes. It is possible to have other processes to unblock numbers, such as through a customer support interface that allows incorrectly blocked numbers to be easily unblocked. The method 500 may be performed periodically, for example every day, by retrieving a list of blocked numbers (502). For each blocked number (504), it is determined if the number has been blocked for a threshold number of days (506), for example 5 days may be used as the threshold. If it has not yet been blocked long enough (i.e. No at 506), the number remains blocked and the next number is processed (512). If the blocked number has been blocked for a threshold number of days (i.e. Yes at 506), it is determined if the blocked number has had no or zero call traffic for the past threshold number of days (508). If there has been call traffic in the past threshold number of days (i.e. No at 508), the number remains blocked and the next number is processed (512). If however there has been no call traffic for the past threshold number of days (i.e. Yes at 508), the number may be unblocked.

As described above, call detail records (CDRs) (i.e. raw call log data records described above) are pre-processed so that features can be derived from the processed CDRs (i.e. the processed call records described above) and input into a trained model to identify anomalous call behavior and/or specific types of undesirable call behavior. The CDRs are generated by network elements such as ISUP SS7 network switches as described above. However, these CDRs generally require pre-processing to be input into the models and for performing subsequent analysis. For example, as described with reference to FIGS. 3 and 4 above, raw call log records are pre-processed to identify different records that are associated with the same call (302), and call records associated with the same call, which may include previously processed call records, are identified (408) and may be aggregated together (410).

Specifically, for multi-hop calls, which are calls that involve multiple switches during connection, one CDR is generated per switch. Therefore, the multi-hop calls result in multiple CDRs that are associated with a single call event. Table I shows an example of a single multi-hop call event for which multiple CDRs were generated. The fields in these CDRs are described in Table II, which are defined by SS7 protocol/architecture.

TABLE I

Example CDRs generated from a multi-hop call event.

conversation

anumber
bnumber
cnumber
gap
gaptype
dialled_bnumber
startdate
endtime
ringtime
time

01121**
41679**
NULL
41683**
192
41683**
1590082256
1590082273
1035
15359

01121**
41683**
NULL
NULL

41683**
1590082256
1590082273
1000
14314

01121**
41683**
NULL
NULL

41683**
1590082256
1590082273
985
14382

01121**
41683**
NULL
NULL

41683**
1590082256
1590082273
980
14670

01121**
41679**
NULL
41683**
192
41683**
1590082256
1590082273
1020
15052

01121**
41683**
NULL
NULL

41683**
1590082256
1590082273
990
14718

01121**
41679**
NULL
41683**
192
41683**
1590082256
1590082273
1025
15068

01121**
41679**
NULL
41683**
192
41683**
1590082256
1590082273
1025
15295

TABLE II

Fields in the CDRs of Table I.

Name
Description

anumber
Calling party number

bnumber
Called party number, also called

callee number

cnumber
Depending on its nature, could be

connected number, dialled number,

location number, redirected number,

or additional calling number; herein

referred to as “additional number field”,

i.e. which is a number field in addition

to the anumber and bnumber field

gap
Generic address parameter

gapType
Type of the generic address parameter

startdate
The start time of the call

endtime
The end time of the call

ringtime
The ringing time of the call, i.e. the

amount of time that a call spends ringing

on the callee's device

conversationtime
The conversation time of the call, i.e.

the amount of time for which a call is

connected between the caller and the callee

As seen in Table I, multiple CDRs may be generated from a single multi-hop call event (one per switch). The multiple records generated for multi-hop call events give rise to the following issues:

- (1) In multi-hop call event CDRs, the nominal called party (i.e. the callee) number, or the bnumber, may not be the dialed callee number (i.e. the number dialed by the caller). Instead, for example, the nominal callee number may be the number of an intermediate network component. The dialed callee number (i.e., the dialed number of the called party) may be stored in different fields in the CDR depending on how the call is routed in the network. Note that in Table I, the field “dialed_bnumber” is not present in the raw CDR but is generated by the methods as described below and denoted the dialed callee number for this call event.
- (2) As shown in the Table I, the ringing time (represented by the ringtime field) and conversation time (represented by the conversationtime field) may be different across all records for the same call event.

To detect anomalous call behavior and to identify specific types of undesirable calls, it is important to identify and label all records associated with the same call event so that call events are analyzed and not just individual call records. For example, as described above, the general anomaly detection model may consider features of CDRs including the number of unique outgoing calls, the number of unique callees, how long a conversation lasts for a given call record, etc. Likewise, for Wangiri detection functionality, features considered may include the number of outgoing calls, the number of unique destination numbers called, an average call duration, the standard deviation of call duration, etc. Further, since Wangiri calls are often “one ring” calls, ringing time may be considered as well. It will thus be appreciated that inputting features derived from respective CDRs to the trained model(s), without consideration of whether there are multiple CDRs corresponding to the same call event, will negatively affect the accuracy of the model predictions and classification of caller behavior. For example, the data may suggest that there are multiple calls made to the same callee number when in fact there was only a single call event. Additionally, where two or more CDRs for the same call event differ (e.g. in conversation time and/or ringing time), it is important that the correct value is input to the trained model(s).

FIG. 7 shows a method 700 of processing call detail records (CDRs). The method 700 may for example be performed by log pre-processing functionality 134 as described above with reference to FIG. 1.

The method 700 comprises receiving a plurality of CDRs (702). Each of the CDRs comprise a calling party number, a callee number, a gap value, a gap type value, a start time, an end time, a ringing time, and a conversation time. The CDRs may be generated from a plurality of networks switches, in particular ISUP SS7 network switches. The plurality of CDRs comprise at least some CDRs that are generated from a multi-hop call event.

A dialed callee number is determined for each of the CDRs (704). As described above, the callee number that appears in the CDR may not be the number that is actually dialed by the caller. Instead, the callee number appearing in the CDR may be a number of an intermediate component, a local routing number, etc. Accordingly, to identify CDRs associated with a same call event, the dialed callee number is determined for each of the CDRs. A method of determining a dialed callee number in the CDRs is described in more detail with reference to FIG. 8.

At least two CDRs associated with a call event are identified based on one or more similarity thresholds being met between the at least two CDRs (706). Having identified the dialed callee number, two or more CDRs generated for calls from the caller number to the dialed callee number can be identified as being associated with the same call event when one or more similarity thresholds are met between the two or more CDRs. Accordingly, all CDRs associated with a same call event are identified, and may be labelled or otherwise associated together. A method of identifying CDRs associated with a same call event is described in more detail with reference to FIG. 9.

A maximum conversation time is determined among all CDRs associated with the call event (708). As described above, some values in the CDRs may not be the same across all CDRs for the same call event. To resolve ambiguities in conversation time values, the correct value of the conversation time is considered to be the maximum conversation time among all CDRs associated with the call event. Namely, for a call event with ID i and n redundant CDRs, the conversation duration Di is considered as Di=Max(Di1, Di2, . . . , Din). It will also be appreciated that depending on the parameter in the CDR, different statistical methods (maximum, average, etc.) can be considered for providing the most appropriate value.

A processed CDR for the call event is generated comprising at least the calling party number, the dialed callee number, and the maximum conversation time (710). In some embodiments, a maximum ringing time among all CDRs associated with the call event may be determined, and the processed CDR may also comprise the maximum ringing time. Accordingly, features from the processed CDR, together with other processed CDRs, can be input to one or more trained models to identify anomalous call behavior and/or specific types of undesirable call behavior. Moreover, features from the processed CDRs can be analyzed for other purposes as well, such as customer churn prediction, call volume reporting, etc.

FIG. 8 depicts a method 800 of determining a dialed callee number in the call detail records. As described with reference to the method 700, to identify CDRs associated with a call event it is important to determine the dialed callee number for each CDR. As mentioned above, the dialed callee number may be stored in different fields in different CDRs for the same call event, depending on how the call is routed. The method 800 may be implemented as an algorithm to determine the dialed callee number for each CDR. While the method makes specific reference to fields present in the ISUP SS7 protocol, it will be appreciated how the method may be applied to other protocols which may for example use different field names and values.

Raw CDRs are received (802). A determination is made as to whether the additional number field (i.e. the cnumber field) is not blank (i.e. has a value) and whether that value differs from the anumber value (i.e. the calling party number) (804). If there is a value in the additional number field and that value is not the same as the calling party number (YES at 804), the dialed callee number (i.e. dialed_bnumber) is set as the value in the additional number field (806).

If the additional number field (i.e. the cnumber field) is blank or null, or if the additional number field has a value corresponding to the calling party number (NO at 804), a determination is made as to whether a gaptype value is one of {1, 192, 253} (808). In the ISUP SS7 protocol, a gaptype value of 1 denotes destination number, a gaptype value of 192 denotes ported number, and a gaptype value of 253 denotes transfer number. Accordingly, the determination at 808 determines whether the gaptype value denotes one of a destination number, a ported number, or a transfer number.

When it is determined that the gaptype value denotes one of a destination number, a ported number, or a transfer number (YES at 808), the dialed callee number (i.e. dialed_bnumber) is set as the gap value (810).

If the gaptype value does not denote one of a destination number, a ported number, or a transfer number (NO at 808), the dialed callee number (i.e. dialed_bnumber) is set as the callee number (i.e. the bnumber) (812).

Note that while the method 800 is shown as evaluating the additional number field first at 804 and then the gaptype value at 806, it is also possible that these determinations may be performed in a different order.

FIG. 9 depicts a method 900 of identifying call detail records associated with a same call event. In accordance with the method 800, the dialed callee number can be determined for each CDR. Accordingly, CDRs generated for calls from the caller number to the dialed callee number can be identified, and as described in the method 700, two or more CDRs can be identified as being associated with the same call event when one or more similarity thresholds are met between the CDRs.

In the method 900, CDRs are received (902), which may correspond to the raw CDRs with the dialed callee number determined from method 800 added to or associated with each of the CDRs.

To identify CDRs as being associated with the same call event, it is possible that each CDR (with dialed callee number determined) can be evaluated against all other CDRs to identify CDRs with the same caller number, dialed callee number, and that match one or more similarity thresholds. To improve computational efficiency and reduce processing time, the method of identifying CDRs as being associated with the same call event may comprise identifying consecutive CDRs based on caller number, dialed callee number, and start time, and comparing those two consecutive CDRs against the one or more similarity thresholds. To still further improve computation efficiency, a sorted list of the CDRs may be generated for use in identifying the two consecutive CDRs.

Accordingly, the method 900 may comprise generating a sorted list of CDRs (904) by sorting the plurality of CDRs based on the calling party number, the dialed callee number, and the start time. The method 900 may comprise evaluating every two consecutive CDRs (906).

Two CDRs are evaluated against one or more similarity thresholds. A determination is made as to whether the one or more similarity thresholds are met (908). The similarity thresholds may comprise one or more of: a difference of start time is less than w seconds; a difference of end time is less than x seconds; a difference of ringing time is less than y seconds; and a difference of conversation time is less than z seconds, wherein each of w, x, y, and z are predetermined threshold values. As an example, the threshold value w may be set as 5 seconds; the threshold value x may be set as 2 seconds; the threshold value y may be set as 1 second; and the threshold value z may be set as less than 1 second. It will be appreciated that different threshold values may be set without departing from the scope of this disclosure. In some embodiments, all of the similarity thresholds may be required to be met. In other embodiments, only some of the similarity thresholds may be evaluated or need to be met.

When two records are determined to satisfy one or more of the similarity thresholds (YES at 908), each record may be labelled or associated with one another to indicate that they belong to the same call event (910). The method 900 proceeds to evaluate the next two consecutive CDRs (912). If the one or more similarity thresholds are not met (NO at 908), the method likewise proceeds to evaluate the next two consecutive CDRs (912).

Although certain components and steps have been described above, it is contemplated that individually described components, as well as steps, may be combined together into fewer components or steps or the steps may be performed sequentially, non-sequentially or concurrently. Further, although described above as occurring in a particular order, one of ordinary skill in the art having regard to the current teachings will appreciate that the particular order of certain steps relative to other steps may be changed. Similarly, individual components or steps may be provided by a plurality of components or steps. One of ordinary skill in the art having regard to the current teachings will appreciate that the components and processes described herein may be provided by various combinations of software, firmware and/or hardware, other than the specific implementations described herein as illustrative examples.

The techniques of various embodiments may be implemented using software, hardware and/or a combination of software and hardware. Various embodiments are directed to apparatus, e.g. a node which may be used in a communications system or data storage system. Various embodiments are also directed to non-transitory machine, e.g., computer, readable medium, e.g., ROM, RAM, CDs, hard discs, etc., which include machine readable instructions for controlling a machine, e.g., processor to implement one, more or all of the steps of the described method or methods.

Some embodiments are directed to a computer program product comprising a computer-readable medium comprising code for causing a computer, or multiple computers, to implement various functions, steps, acts and/or operations, e.g. one or more or all of the steps described above. Depending on the embodiment, the computer program product can, and sometimes does, include different code for each step to be performed. Thus, the computer program product may, and sometimes does, include code for each individual step of a method, e.g., a method of operating a communications device, e.g., a wireless terminal or node. The code may be in the form of machine, e.g., computer, executable instructions stored on a computer-readable medium such as a RAM (Random Access Memory), ROM (Read Only Memory) or other type of storage device. In addition to being directed to a computer program product, some embodiments are directed to a processor configured to implement one or more of the various functions, steps, acts and/or operations of one or more methods described above. Accordingly, some embodiments are directed to a processor, e.g., CPU, configured to implement some or all of the steps of the method(s) described herein. The processor may be for use in, e.g., a communications device or other device described in the present application.

Numerous additional variations on the methods and apparatus of the various embodiments described above will be apparent to those skilled in the art in view of the above description. Such variations are to be considered within the scope.

	Number	Date	Country
Parent	17560555	Dec 2021	US
Child	18216044		US

SYSTEMS AND METHODS FOR USE IN DETECTING ANOMALOUS CALL BEHAVIOR

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

CROSS-REFERENCE TO RELATED APPLICATIONS

Provisional Applications (1)

Continuation in Parts (1)