The invention relates generally to analyzing event data, and more particularly to a system and method of providing one or more functions for providing a statistical summarization of event data.
There exist numerous applications in which analysis of event data may be required. For example, data events may be collected in a financial setting to identify potentially fraudulent activity, in a network setting to track network usage, in a business setting to identify business opportunities or problems, etc. Established practices in statistical analysis of data exist for processing and analyzing data events. Much of this has been based around two concepts for “typical” data, the mean and the median. Slightly more extensive analysis has also considered the spread of data around this typical point; that is at least partly captured by the standard deviation (used in conjunction with mean) and percentile values (used in conjunction with median).
There are problems with both the mean and median based methods—both from the mathematical behavior and their match to ‘common sense’ analysis. For example, in the mean/standard deviation approach, there is often too much dependency on outliers, although there are (somewhat arbitrary) techniques for ignoring them. Furthermore, computations are somewhat difficult when dealing with non-center data points. Additionally, assumptions must be made about a Gaussian distribution that may not be appropriate for all conditions.
In the median/percentiles approach, there may be too much dependency on data that is just to one side of the median value. This means that median calculations are often fairly unstable depending on the exact samples taken. Like the mean/standard deviation approach, computational costs may be expensive.
In traditional statistics, the above approaches are utilized in a fairly static manner against a fairly static body of data. Where it is necessary to work on data ‘on the fly’, a typical solution is a moving window over recent past history. More recent work has also permitted computation of a running estimate of all these basic statistical values.
Accordingly, a need exists for analysis techniques that can applied to not only static and running window data sets, but also on running estimates.
The present invention addresses the above-mentioned problems, as well as others, by providing a system and method of applying a function to a difference between a previous statistical summary and a current data value. In a first aspect, the invention provides a system for processing a set E of data event values Ei, comprising: a system for selecting a function F(D); a system for estimating a value of X such that the sum of F(X−Ei) for all data event values Ei in the set E is zero, wherein the value X provides a general statistical property of the set of data event values E; and an analysis system for analyzing the general statistical property.
In a second aspect, the invention provides computer program product stored on a computer readable medium, which when executed, processes a set E of data event values Ei, the computer program product comprising: program code configured for estimating a value of X for a function F such that the sum of F(X−Ei) for all data event values Ei in the set E is zero, wherein the value X provides a general statistical property of the set of data event values E; and program code configured for analyzing the general statistical property.
In a third aspect, the invention provides a method of processing data events, comprising: determining a difference between a statistical summary and a new data event value; inputting the difference into a selected function and generating an output; adding the previous statistical summary to the output of the selected function to obtain a new statistical summary; and analyzing the new statistical summary.
These and other features of this invention will be more readily understood from the following detailed description of the various aspects of the invention taken in conjunction with the accompanying drawings in which:
Disclosed are techniques for processing data events. In the illustrative embodiments discussed with regard to
In
Accordingly, in the illustrative embodiment shown in
Analysis system 14 provides mechanisms (e.g., algorithms, programs, heuristics, modeling, etc.) for examining each running estimate Xi and providing some analysis, e.g., identifying potentially fraudulent activities, identifying trends and patterns, identifying risks, problems, opportunities, etc. For example, a high running estimate 34 may indicate an unusually large withdrawal from an ATM, an unusual amount of bandwidth usage in a network, etc. In a simple application, analysis system 14 might compare the running estimate to a threshold value. If the running estimate is above (or below) the threshold value, analysis system 14 may issue a warning as the analysis output 36.
Because the running estimate 34 can be captured in a single value, few computational resources are required, thus allowing real or near real time processing. Accordingly, data event processing system 10 allows for an immediate action or response to be made to unusual or potentially problematic data event values, without the need to process large amounts of data.
In this illustrative embodiment, running estimate update system 12 includes: a function selection system 16 for allowing a user 38 to select a function F from the function library 22; a function implementation system 18 for implementing the selected function F to a selected event data stream 40; and a function management system 20 for allowing user 38 to create, modify, and delete functions from function library 22.
Illustrative types of functions stored in function library 22 may include, e.g., median and mean generation functions 24, hybrid functions 26, user defined functions 28, outlier handling functions 30; biased functions 32; and tables 34. The functions described herein are not intended to be limiting to the scope of the invention, and other types of functions not described herein fall within the scope of the invention.
As noted above, running estimate update system 12 first calculates a difference D between a previously calculated running estimate Xn-1 and a current data event value En. The difference D is then plugged into a selected function F, the result of which is then used to modify (e.g., added to or subtracted from) the previous running estimate Xn-1 to generate a new running estimate Xn. Thus, in such an embodiment, a new running estimate Xn is calculated according to the general form:
X
n
=X
n-1+(1−k)*F(En−Xn-1).
where k is a damping factor. In implementation, the factor (1−k) may be combined into a scaled function F. Keeping them uncombined separates the damping effect of the running computation from the behavioral effect of a particular function F.
Illustrative functions are described below as graphs shown in
X
n
=X
n-1+(1−k)*(En−Xn-1)=k*Xn-1+(1−k)*En
which is the conventional function for exponential smoothing.
In the case of the median generation function 52, the result of function F is either +1 or −1, depending on whether the difference D is positive or negative, and 0 for D=0. Thus, in the above example, a difference D of −2 would result in a −1 being added to the previous running estimate of 29, resulting in a new running estimate value of 28.
General principles of the mean and median generation functions include:
A second class of functions comprises hybrids of the mean and median generation functions. For example,
F=sign(D)*abs(D)Q.
The superegg gives a range of functions between mean (Q=1) and median (Q=0). The graph in
F=D/(Q−D), where D<=0
F=D/(Q+D), where D>0
Again, varying Q can force this function to look both like a median, and locally (for “small’ values of D) like a mean. In the example shown in
F=D/sqrt(D2+Q).
In this example, a first curve 72 shows with the function with Q=4, and the second curve 74 shows the function with Q=0.5.
A further class of functions involved biased functions in which the result is biased either in the positive or negative direction. For instance,
F=−Q, where D<0
F=1−Q, where D>0
F=0, where D=0.
In
F=Q*D, where D<0
F=(1−Q)*D, where D>=0.
Again, a first region 82 is provided for cases where the difference D is less than 0, and a second region 80 is provided for cases where the difference D is greater than or equal to 0. Note that in general it may be desirable to have biased curves that do not have a discontinuity in the first derivative at 0.
The disclosed embodiments thus provide an enhanced approach for using mean and median. However, as noted above, the techniques described herein are not limited to “running estimate” applications, but can also apply to static data sets. Accordingly, the invention can be explained in a more comprehensive approach as follows. Consider all the data points Ei as objects in one-dimensional space, with the mean or median to be computed as another center object X. The defined function F provides a force field F between each data object Ei acting on this center object X. The combination of these force fields will pull the center object X to some stable center position. F is thus defined as a function F(D) of the (directional) distance D=Ei−X.
The force field (i.e., function) F can therefore be tailored to give the required “center” effect by estimating a value of X such that the sum of F(Ei−X) for all elements Ei in the set E is zero. The resulting value X will thus provide a general statistical property of the set of values.
There are two generic implementations of this. For static data sets, standard iterative optimization techniques can be used. Of course, these may be very much optimized for particular functions. An example of an iterative approach for estimating X is provided below for the data set E1 . . . E6. An initial guess of 11.3 for X results in an initial sum of F(D) for the equation sign(Di)*abs(Di)0.5 to be 8.00171.
For dynamic datasets, techniques using a running estimate with the appropriate force field function can be used, as described in detail above with reference to the
Accordingly, in either case, a force field that is a compromise between a mean and median can be obtained. The exact function may be tailored for different requirements. The precise form of the function is not likely to have a great effect on overall results in a business application, with the differences being swamped by the effect of imprecise modeling and noisy data. It will generally be desirable to choose a function that has the correct general shape for the features required, and which can be efficiently implemented.
In general, data event processing system 10 may be implemented using any type of computing device, and may be implemented as part of a client and/or a server. Such a computing system generally includes a processor, input/output (I/O), memory, and a bus. The processor may comprise a single processing unit, or be distributed across one or more processing units in one or more locations, e.g., on a client and server. Memory may comprise any known type of data storage and/or transmission media, including magnetic media, optical media, random access memory (RAM), read-only memory (ROM), a data cache, a data object, etc. Moreover, memory may reside at a single physical location, comprising one or more types of data storage, or be distributed across a plurality of physical systems in various forms.
I/O may comprise any system for exchanging information to/from an external resource. External devices/resources may comprise any known type of external device, including a monitor/display, speakers, storage, another computer system, a hand-held device, keyboard, mouse, voice recognition system, speech output system, printer, facsimile, pager, etc. Bus provides a communication link between each of the components in the computing system and likewise may comprise any known type of transmission link, including electrical, optical, wireless, etc. Additional components, such as cache memory, communication systems, system software, etc., may be incorporated into the computing system.
Access to data event processing system 10 may be provided over a network such as the Internet, a local area network (LAN), a wide area network (WAN), a virtual private network (VPN), etc. Communication could occur via a direct hardwired connection (e.g., serial port), or via an addressable connection that may utilize any combination of wireline and/or wireless transmission methods. Moreover, conventional network connectivity, such as Token Ring, Ethernet, WiFi or other conventional communications standards could be used. Still yet, connectivity could be provided by conventional TCP/IP sockets-based protocol. In this instance, an Internet service provider could be used to establish interconnectivity. Further, as indicated above, communication could occur in a client-server or server-server environment.
It should be appreciated that the teachings of the present invention could be offered as a business method on a subscription or fee basis. For example, a computer system comprising a data event processing system 10 could be created, maintained and/or deployed by a service provider that offers the functions described herein for customers. That is, a service provider could offer to provide event processing as described above.
It is understood that the systems, functions, mechanisms, methods, engines and modules described herein can be implemented in hardware, software, or a combination of hardware and software. They may be implemented by any type of computer system or other apparatus adapted for carrying out the methods described herein. A typical combination of hardware and software could be a general-purpose computer system with a computer program that, when loaded and executed, controls the computer system such that it carries out the methods described herein. Alternatively, a specific use computer, containing specialized hardware for carrying out one or more of the functional tasks of the invention could be utilized. In a further embodiment, part or all of the invention could be implemented in a distributed manner, e.g., over a network such as the Internet.
The present invention can also be embedded in a computer program product, which comprises all the features enabling the implementation of the methods and functions described herein, and which—when loaded in a computer system—is able to carry out these methods and functions. Terms such as computer program, software program, program, program product, software, etc., in the present context mean any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following: (a) conversion to another language, code or notation; and/or (b) reproduction in a different material form.
The foregoing description of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed, and obviously, many modifications and variations are possible. Such modifications and variations that may be apparent to a person skilled in the art are intended to be included within the scope of this invention as defined by the accompanying claims.