The application relates generally to the field of transaction data, more specifically the methods and systems to complete transaction data, and to a machine-readable medium comprising instructions to perform this method.
Automatic Call Distribution (ACD) centers often use forecasting models to forecast transactions (e.g, calls or other communication requests) during certain periods of time. The forecasting models may be useful in determining adequate and efficient staff scheduling, for instance. Parameters for a forecasting model are often updated with new data to improve forecasting accuracy. Often, such updating is tedious and time consuming for an administrator of the forecasting model.
According to an aspect of the invention there is provided a method and system to receive transaction data; determine a gap in the transaction data; and use an algorithm to generate data to fill in the gap is described. The algorithm is selected from a group including a first algorithm and a second algorithm. The first algorithm is to determine a dominant pattern in the transaction data; identify a region within the dominant pattern that corresponds to the gap in the transaction data; and adopt data associated with the corresponding region into the gap to minimize impact on the dominant pattern. The second algorithm includes a Moore-Penrose pseudo-inverse algorithm to choose the transaction data to fill in the gap based on a set of substitute data from among a group of substitute data sets and adopts the set of substitute data into the gap.
An example embodiment of the present invention is illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements and in which:
According to an aspect of the invention there is provided a method and system to receive transaction data; determine a gap in the transaction data; and use an algorithm to generate data to fill in the gap is described. The algorithm is selected from a group including a first algorithm and a second althorithm. The first algorithm is to determine a dominant pattern in the transaction data; identify a region within the dominant pattern that corresponds to the gap in the transaction data; and adopt data associated with the corresponding region into the gap to minimize impact on the dominant pattern. The second algorithm includes a Moore-Penrose pseudo-inverse algorithm to choose the transaction data to fill in the gap based on a set of substitute data from among a group of substitute data sets and adopts the set of substitute data into the gap.
The system 100 may include a transaction gap module 110, an external data source 120, a forecasting module 125, and a database 130. The transaction gap module 110 may include an interface 135 to receive transaction data from the database 130 regarding, for example, a particular forecast group and/or a particular period of time. The interface 135 may receive transaction data from the external data source 120 through a network 160, such as the Internet.
The database 130 includes data regarding frequency of transactions or calls during periods of time. The database 130 (and/or the external data source 120) may include invalid, missing or incomplete data 165.
The transaction gap module 110 determines if there is a gap (e.g., incomplete data 165) in the transaction data. The gap may be invalid data, such as a data error and/or missing/omitted data (null). The gap may be during a period of time, such as a day or a set of days in a monthly data set. A month (series of weeks) of (possibly incomplete) daily data and a list of dates of invalid data may be included in the transaction data. For each valid date in the month, the data may be a non-negative number.
The transaction gap module 110 may also include a selection module 140 used in determining which algorithm, a first algorithm 145 and/or a second algorithm 150 to use to fill in the gap or gaps in transaction data. An algorithm may replace the invalid, incomplete or missing data 165 in the forecast group with plausible and/or likely values to render a complete output. Several algorithm embodiments are described herein. For example, the first algorithm 145 may include a pattern recognition code 155. A month of daily data, where the data for each day in the month is a non-negative number, may be the output of the algorithm of the transaction gap module 110.
The transaction gap module 110 then sends the output, complete data 170 including the filled-in data, to the forecasting module 125 to forecast transactions.
At block 210, transaction data is received, as discussed herein.
At block 220, a gap in the transaction data is determined, as discussed herein.
At block 230, the algorithm used to fill in the gap is determined. The determined algorithm may depend on the size of the dataset. Additionally, and/or alternatively, the determined algorithm may depend on the desired accuracy of the filled-in data. Additionally, and/or alternatively, the determined algorithm may depend on the desired speed to fill in the missing or invalid data
The algorithm described in
However, the algorithm described in
At block 310, transaction data is received, as discussed herein.
At block 320, a gap in the transaction data is determined, as discussed herein.
At block 330, a dominant pattern in the transaction data is determined, using the algorithm, as discussed herein. The dominant pattern may be determined by the pattern recognition code 155.
At block 340, a region within the dominant pattern that corresponds to the gap in the transaction data may be identified, using the algorithm, as discussed herein.
At block 350, data associated with the corresponding region may be adopted into the gap to minimize impact on the dominant pattern, using the algorithm, as discussed herein.
Using the algorithm, invalid and/or missing data may be replaced with values that are consistent with the arrangement of the valid data. The algorithm and/or the transaction gap module 110 may also take into consideration any restrictions of the forecasting module 125 of the forecasting module. A forecasting module restriction may be that the number of calls during each week has the same pattern throughout the month, for example.
The algorithm of the embodiment of
Two examples of how the algorithm of
In the below examples, in the first algorithm where a dominant pattern in the data may be determined and adopted to fill in the gap (e.g., null data sets), (i,j) refers to a jth day of an ith week, for n weeks with m days in each week, wherein xij includes valid numerical data, and if data is not valid on (i,j), xij=null.
vij includes vij=xij, unless xij=0, in which case, vij=null, wherein wij includes wij=ln(vij) wherenever vij is not null, and wij=null whenever vij=null.
A matrix of column differences, cij, includes cij=wij+1−wij whenever both wij+1 and wij are not null, and cij=null, otherwise.
A matrix of row differences, rij, includes rij=wi+1j−wij whenever both wi+1j and wij are not null, and rij=null, otherwise.
A jth column of cij includes at least one non-null entry, and c*j includes an average of each non-null entry in the jth column of cij, otherwise, c*j=0.
An ith row of rij includes at least one non-null entry, and ri* includes an average of each non-null entry in the ith row of rij, otherwise, ri*=0.
Cj+1=Cj+c*j, where C1=0, wherein Ri+1=Ri+ri*, wherein R1=0, and uij=Ri+Cj.
K includes an average of wij−uij over each (i,j) entry where wij is not null.
yij=wij whenever wij is not null and otherwise, yij=K+uij.
Output zij=Round(exp(yij)), where each date and time period includes valid data. zij is the matrix that is sent on to the forecasting model or module. zij may be sent through a sequence of one or more modules to be analyzed. Results may then be sent to a module that updates parameters of the forecasting module.
Logarithms may be taken of particular values so that multiplicative effects between day-of-the-week and week-of-the-month may be conveniently expressed as additive effects. In some implementations, it may be more convenient for the algorithm to work with additive effects than directly with the multiplicative effects. For example, multiplicative effect: m_effect=affect1*affec2; Additive: a_effect=affect3+affect4; log(m_effect)=log(affect1*affect2)=log(affect1)+log(affect2). By taking logs, a multiplicative effect can be treated as an additive effect where log(m_effect)=a_effect, log(affect1)=affect3, log(affect2)=affect4.
A first example of how the above-recited functions of the algorithm of
In another embodiment, the method is similar to “Fill in Days” for monthly updates described above, however day-of-the-week is replaced by time-period and week-of-the-month is replaced by comparable date. In a particular embodiment, n becomes the number of comparable dates, m becomes the number of time-periods within a day, i becomes an index for comparable dates and j becomes an index for time-period of a day. The calculations are completed using the above described functions in the algorithm of
At block 410, transaction data may be received, as discussed herein.
At block 420, a gap in the transaction data may be determined, as discussed herein.
At block 430, a set of substitute data may be chosen from among a group of substitute data sets using a Moore-Penrose pseudo-inverse algorithm.
At block 440, the set of substitute data may be adopted into the determined gap.
In an embodiment, the Moore-Penrose pseudo-inverse algorithm may be more accurate as compared with the algorithm of
In an embodiment, the Moore-Penrose pseudo-inverse algorithm may fill in null or invalid data by producing an optimal “fill in”.
Let wij be the same as defined above with regard to the algorithm of
For p=1,2, . . . , n+m and q=1,2, . . . , n+m, let fpq denote the elements of an n+m by n+m matrix, F, called the “filler”. The filler is a symmetric matrix, defined in the following way:
For p=1,2, . . . , n and q=1,2, . . . , n, let fpp=the number of non-null entries in the pth row of W and let fpq=0 when p≠q. For p=n+1, n+2, . . . , n+m and q=n+1, n+2, . . . , n+m, let fpp=the number of non-null entries in the (p−n)th column of W and let fpq=0 when p≠q. For p=1,2, . . . , n and q=n+1, n+2, . . . , n+m, let fpq=1 when wpq−n is not null and fpq=0 when wpq−n is null. For p=n+1, n+2, . . . , n+m and q=1,2, . . . , n, let fpq=1 when wqp−n is not null and fpq=0 when wqp−n is null.
If A is some real matrix and B is a real matrix such that ABA=B, BAB=A, AB is symmetric, and BA is symmetric, then B is called a Moore-Penrose pseudoinverse of A. It is a theorem that every real matrix has a mathematically unique Moore-Penrose pseudoinverse. Let F+ denote the pseudoinverse of F. Let F+ be computed from F using, say, Greville's Theorem.
Let b denote the average of the non-null values of W.
For i=1, 2, . . . , n and j=1, 2, . . . , m, define {tilde over (w)}ij by the rule {tilde over (w)}ij=wij−b when wij is not null and {tilde over (w)}ij=null otherwise. Let {tilde over (W)} denote the n by m matrix of the {tilde over (w)}ij.
Define a real vector, g, with n+m components gk, for k=1, 2, . . . , n+m, by the following rules: For k=1, 2, . . . , n, let gk=sum of the non-null elements in the kth row of {tilde over (W)} when at least one such element is not null and let gk=0 when every element in the kth row of {tilde over (W)} is null.
For k=1+n, 2+n, . . . , m+n, let gk equal the sum of the non-null element sin the (k−n)th column of {tilde over (W)} when at least one such element is not null and let gk=0 when every element in the (k−n)th column of {tilde over (W)} is null.
Define a real vector, h, with n+m components hk, for k=1, 2, . . . , n+m, by the following rule: h=F+g. The components of h are used to determine values to replace the null data in W as follows: For i=1, 2, . . . , n, let Ri=hi. For j=1, 2, . . . , m, let Cj=hj+n. Define uij by the rule uij=Ri+Cj. Let yij=wij whenever wij is not null and otherwise, let yij=uij+b.
the real matrix of the yij, Y, can be thought of as the matrix, W, with the null values filled in with data that is considered “valid”. As described above, W may be obtained by taking logarithms of the original data, xij. Now let zij=xij wherever xij has valid data and let zij=Round(exp(yij)) otherwise.
Output the zij.
In an example embodiment, the algorithm of
For the first matrix, W, b=2.8, and the elements of {tilde over (W)} include:
g is given by
Finding F+ by Greville's Theorem, computing h F+g, and solving for the yij in terms of the components of h recovers a matrix that is identical to the yij matrix generated by the algorithm of
The component of the algorithm described here, acts upon the logarithms of the raw data, in the instance where that raw data is not null and not zero. The logarithms may be placed in a (not real) n by m matrix, W, whose elements are either real numbers or null, where at least one entry is not null.
In an embodiment of the algorithm of
The set, S, may be defined by the rule S={(i,j)|wij≠null}.
μ may be defined by the rule
yij may be defined by the rule
Y may be defined to be the matrix of yij.
V may be defined to be a real-valued function of n+m real variables so that V=V(r1, . . . , rn, c1, . . . , cm) where
V is a non-negative quadratic function, so V may have a global minimum value, but there may be many values of (r1, . . . , rn, c1, . . . , cm) that achieve this minimum value of V. To find a minimum of V, points where V is stationary are sought. That is, where
Therefore a minimum satisfies:
The first n sums may be over “non-null” elements in the kth row of Y. The second m sums may be over the “non-null” elements in the lth column of Y.
Let Pk={j|(i,j)∈S and i=k} and let Ql={i|(i,j)∈S and j=l}. The system of equations may be written as
Note that o(Pk) is the number of non-null elements in the kth row of Y and that o(Ql) is the number of non-null elements in the lth column of Y. Also note that
is the sum of the non-null elements in the kth row of Y and
is the sum of the non-null elements in the lth column of Y.
The system of equations shown above comprises n+m simultaneous linear equations in n+m variables. As such, the system of equations may be expressed as a vector-matrix equation in Rn+m of the form Fh=g, where F is an n+m by n+m real matrix and both g and h are vectors in Rn+m.
In order to describe F, the symbol, εij, may be used, where εij=1, when yij is not null and εij=0, when yij is null.
The matrix F is a symmetric matrix. The elements on the diagonal of the matrix F may be expressed in terms of the εij term, as follows:
The equation Fh=g includes at least one solution, and possibly an infinite number of solutions. An infinite number of values may minimize V=V(r1, . . . , rn, cl, . . . , cm). The solution chosen to use for the fill in may be the solution that leads to a most conservative approximation of the yij by the values of ri+cj. Such a solution, h, is one for which ∥h∥ is minimum. In other words, find an h, such that Fh=g and ∥h∥ is minimum. Such asn h may be found by means of the pseudoinverse of F. The pseudoinverse of F is a mathematically unique matrix, denoted F+. The solution for h, such that ∥h∥ is minimum, may be given by h=F+g.
This result follows from the definition of pseudoinverse, where: FF+F=F, F+FF+=F+, FF+=(FF+)T, and F+F=(F+F)T.
The above-recited relations imply that (F+F)(F+F)=F+F and (FF+)(FF+)=FF+, so that, in virtue of their symmetries, F+F and FF+ are both projections. For any x in Rn+m, either of these projections determines a decomposition of x into orthogonal components:
x=(I−F+F)x+(F+F)x or x=(l−FF+)x+(FF+)x,
so that (x,x)=((I−F+F)x,(I−F+F)x)+((F+F)x,(F+F)x)
or
(x,x)=((I−FF+)x,(I−FF+)x,(I−FF+)x)+((FF+x),(FF+x)), respectively.
(F+Fx,F+Fx)≦(x,x) and (FF+x,FF+x)≦(x,x) for any x in Rn+m. Also, if (F+Fx,F+Fx)=(x,x) or (FF+x,FF+x)=(x,x), respectively, then ((1−F+F)x,(I−F+F)x)=0 or ((I−FF+)x,(I−FF+)x)=0, respectively, so that (I−F+F)x=0 or (I−FF+x=0, respectively. This forces F+Fx=x or FF+x=x, respectively. Therefore, if (F+Fx,F+Fx)=(x,x) then F+Fx=x and if (FF+x, FF+x)=(x,x) then FF+x=x.
{tilde over (h)} may be defined by the rule {tilde over (h)}=F+g. Then F{tilde over (h)}=FF+g, F+F{tilde over (h)}=F+FF+g=F+g={tilde over (h)}, so that F+F{tilde over (h)}={tilde over (h)}.
Suppose there is an
F{tilde over (h)}=FF+F
The components of {tilde over (h)} give the values of ri and cj used to fill in the null values of W as follows: If (i,j)∉S, then wij=ri+cj+μ. Otherwise, the value of wij remains unchanged.
The automated update algorithms described herein may make consistent judgments about enormous quantities of numerical data, and may reduce the risk that clerical errors associated with manual update activities may deform the forecast model. Automated introduction of the new data may avoid inappropriate changes in the day of week patterns that are extracted from the data, which may reduce deformation of the forecast model.
The example computer system 600 includes a processor 602 (e.g., a central processing unit (CPU), a graphics processing unit (GPU) or both), a main memory 604 and a static memory 606, which communicate with each other via a bus 608. The computer system 600 may further include a video display unit 610 (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)). The computer system 600 also includes an alphanumeric input device 612 (e.g., a keyboard), a user interface (UI) navigation device 614 (e.g., a mouse), a disk drive unit 616, a signal generation device 618 (e.g., a speaker) and a network interface device 620.
The disk drive unit 616 includes a machine-readable medium 622 on which is stored one or more sets of instructions and data structures (e.g., software 624) embodying or utilized by any one or more of the methodologies or functions described herein. The softvare 624 may also reside, completely or at least partially, within the main memory 604 and/or within the processor 602 during execution thereof by the computer system 600, the main memory 604 and the processor 602 also constituting machine-readable media.
The software 624 may further be transmitted or received over a network 626 via the network interface device 620 utilizing any one of a number of well-known transfer protocols (e.g., HTTP).
While the machine-readable medium 622 is shown in an example embodiment to be a single medium, the term “machine-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term “machine-readable medium” shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present invention, or that is capable of storing, encoding or carrying data structures utilized by or associated with such a set of instructions. The term “machine-readable medium” shall accordingly be taken to include, but not be limited to, solid-state memories, optical and magnetic media, and carrier wave signals. Although an embodiment of the present invention has been described with reference to specific example embodiments, it will be evident that various modifications and changes may be made to these embodiments without departing from the broader spirit and scope of the invention. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.