This application is based upon and claims the benefit of priority from the prior Japanese Patent Application No. 2017-191145, filed on Sep. 29, 2017, the entire contents of which are incorporated herein by reference.
The present invention relates to a financial analysis apparatus, a financial analysis method, and a financial analysis program, using AI (Artificial intelligence) techniques.
AI techniques such as machine learning and deep learning are used in various fields. For example, Japanese Patent No. 6085888 discloses a technique to classify text data into various types of topics and learn a relational structure between the classified topics and other attribute information to quantitatively obtain change in attribute information at the time of topic change, change in topic at the time of attribute information change, etc. Using this technique, matters requiring business improvements, user needs, etc. can be extracted.
The technique of the patent document described above can be applied to financial audit, and risk evaluation can be performed for audit purpose by converting text data included in various documents, which have not been fully utilized, into topics, in order to treat the topics as variables in the same manner as financial information.
However, company's accounting data may be enormous in some cases, although depending on company size, and hence it is not easy to find out the signs of fraud in the accounting data. The patent document described above does not disclose any practical solutions to find out the signs of fraud in an enormous amount of accounting data.
A financial analysis apparatus according to one embodiment has; a first vector generator which generates a first vector that contains components, which are fluctuation of account items for the first term of accounting data; an estimator to estimate fluctuation of account items for each first term belonging to the second term, based on the first vectors within the second term; a residual calculator to calculate a residual between fluctuation estimated by the estimator and the actual fluctuation; an anomalous fluctuation identifier to identify an account item, for which a value derived from the residual exceeds a predetermined threshold for a specific first term; a journal entry pre-extractor to identify journal entries of the accounting data for the specific first term belonging to the second term to generate second vectors containing components, which are amounts of account items in each journal entry; a journal entry extractor to extract such journal entries from the second vectors as ones include account items for which the values derived from the residuals exceed the threshold for the specific first term; an anomaly detector to detect whether there are anomalies among the journal entries extracted by the journal entry extractor; and an anomalous journal entry extractor to extract a journal entry having the anomaly detected by the anomaly detector.
Hereinbelow, embodiments of the present invention will be explained with reference to the accompanying drawings. In the present specification and the accompanying drawings, for easy understanding and simplicity of drawings, the explanation and the drawings are partially omitted, modified or simplified. However, the technical contents to the extent that a similar function can be expected will be interpreted to be included in the embodiments.
The first vector generator 2 generates a first vector, which contains components that are fluctuation of account items for the first term of accounting data. The first term is, for example, one day or one month. In a more practical example, the first vector generator 2 calculates daily (monthly) fluctuation in a trial balance (TB) for each account item on the debit side and the credit side separately and generates a first vector including the fluctuation as a component. Hereinafter, the first vector is also referred as a TB fluctuation vector. Therefore, the TB fluctuation vector, for example, includes fluctuation (change in dollar amount) of account items, such as, cash and deposits, merchandise, accounts payable, common stock, sales, and cost of goods sold, as components.
The first matrix generator 3 generates a first matrix of first vectors aligned in a row direction for a second term including first terms. Respective rows in the first matrix are, for example, first vectors with different dates. Hereinafter, the first matrix is also referred to as a daily (monthly) TB fluctuation matrix on respective debit and credit sides. The second term has a duration of a term that is an integer multiple of the first term. The second term is, for example, three months, a half year or one year.
Based on the first matrix, the estimator 4 estimates each fluctuation of account items for each first term belonging to the second term. Accordingly, for example, a fluctuation of each account item is estimated daily (monthly), separately on the debit and credit sides. Hereinafter, an estimated fluctuation is also referred as an estimation, as required. As described later, the estimator 4 estimates a fluctuation of each account item based on a fluctuation model. As a practical example, the estimator 4 estimates each fluctuation of the account items by minimizing the total sum of a value obtained by taking the square of an error in each fluctuation of the account items and absolute values of regression coefficients associated with the account items based on the first matrix (TB fluctuation matrix).
The residual calculator 5 calculates a residual between a fluctuation estimated by the estimator 4 and an actual fluctuation (actual value). For example, the residual calculator 5 calculates a residual of each account item on the debit and credit sides separately on a daily (monthly) basis.
The anomalous fluctuation identifier 6 identifies account items for which a value derived from the residual exceeds a predetermined threshold for a specific first term. The value derived from the residual can be the residual itself detected by the residual calculator 5 or a value obtained by normalizing the residual. The threshold can be any value, which can be determined in accordance with the value derived from the residual. The fluctuation of the specific account item for the specific first term, identified by the anomalous fluctuation identifier 6, is an anomaly suspect. Here, the anomaly means a specific type of fluctuation of an account item, which is beyond imagination according to a regular transaction tendency. The anomaly can indicate accounting fraud, while regular transactions can cause anomaly as well.
The journal entry pre-extractor 7 identifies journal entries of accounting data for the specific first term in the second term identified by the anomalous fluctuation identifier 6, generates a second vector containing components, which are dollar amounts of account items for each journal entries, and generates a second matrix that consists of second vectors aligned in a row direction. Hereinafter, the second vector and the second matrix are also referred to as a journal entry vector and a journal entry matrix, respectively, as required. Each row in the journal entry matrix is the journal entry vector. For example, the journal entry pre-extractor 7 generates journal entry vectors for all journal entries entered on a specific day identified by the anomalous fluctuation identifier 6 and generates a journal entry matrix with journal entry vectors aligned in the row direction.
The journal entry extractor 8 extracts a journal entry that includes an account item for which the value derived from the residual exceeds the threshold, from the second matrix (journal entry matrix). Journal entries in the journal entry matrix include ones for which none of account items are such that the value derived from the residual exceed the threshold. Therefore, the journal entry extractor 8 extracts only journal entries that include account items for which the values derived from the residuals exceed the threshold, from the journal entry matrix. More practically, the journal entry extractor 8 extracts journal entries that include account items for which absolute values of the values derived from the residuals exceed the threshold, irrespective of whether the values derived from the residuals are positive or negative. Taking into consideration whether the value derived from the residual is positive or negative, two kinds of distinctive positive and negative thresholds are required so as to extract a journal entries involving account items for which the residuals show upward or downward swing, resulting in complicated processing. The journal entry extractor 8 extracts all journal entries that include account items for which the absolute values of the values derived from the residual exceed the threshold, irrespective of whether the residuals show upward or downward swing. Therefore, the burden of processing on the journal entry extractor 8 can be mitigated.
The anomaly detector 9 detects anomalies among the journal entries extracted by the journal entry extractor 8. As described later, the anomaly detector 9 uses the k-nearest neighbor algorithm, LOF (Local Outlier Factor) algorithm, etc. to detect anomalies. The anomaly detector 9, for example, outputs a degree of anomaly for each journal entry that is a quantified value showing that the larger the numerical value of the degree of anomaly, the greater the suspicion of anomaly. The anomalous journal entry extractor 10 extracts journal entries for which anomalies are detected by the anomaly detector 9. When the anomaly detector 9 outputs the degree of anomaly, the anomalous journal entry extractor 10 compares the degree of anomaly with a predetermined threshold to extract journal entries that include an account item for which the degree of anomaly is larger than the threshold.
When step S1 in
Subsequently, the estimator 4 develops a daily (monthly) fluctuation model for each account item on the debit and credit sides separately (step S3). The fluctuation model estimates fluctuation of an account item based on other specific account items automatically selected which are likely to be related to the account item to be estimated. Sparse modeling can be used as a method to build such a model. In the sparse modeling, only relevant explanatory variables are selected as variables to be used for each account item. As a representative of sparse modeling, Lasso (Least Absolute Shrinkage and Selection Operator) is often used.
In usual regressive analysis, a regression coefficient vector β is obtained so as to minimize a total sum Σεi2 of values obtained by taking the square of each component εi of an error vector ε in a regression formula of the following formula (1). Here, Y represents a target variable vector and X represents an explanatory variable matrix.
Y=Xβ+ε (1)
Y002=β1×X001+β3×X003+β4×X004+β5×X005+β6×X006+β7×X007+ε (2)
When the regression formula is expressed with the formula (2), the regression coefficients β1 to β7 are obtained so that a value, which is obtained by taking the sum of squared errors (difference between a value obtained by the regression formula and an actual value) in the above formula, becomes minimum by ordinary least squares (OLS).
In contrast to OLS, Lasso obtains regression coefficients β that minimizes a value Sλ(β) in the following formula (3). Here, λ is a complexity parameter.
S
λ(β)=λΣj|βj|+Σiεi2 (3)
Lasso does not minimize a sum of squared errors as OLS does, but adds a regularization term so that a total of the sum of squared errors and the sum of absolute values of regression coefficients β become minimum. In this way, regression coefficients β of some account items which have low relevance to the explained variable tend to be estimated to be zeros, so that account items of high relevance can only be mechanically extracted from many account items.
As described above, in the fluctuation model estimation of step S3 in
When a fluctuation is estimated in step S4 of
When step S5 in
When step S6 in
When step S7 in
For anomaly detection, 1) methods based on statistical distribution, 2) methods based on distance or 3) methods based on density can be used.
As a method based on statistical distribution, for example, Hotelling theory can be used. The Hotelling theory is a way to calculate the degree of anomaly by calculating mean and standard deviation based on given data, dividing deviation (Mahalanobis distance) between an observation value and the mean by the standard deviation (normalization), and taking the square of the normalized value. However, since this method assumes a normal distribution as data distribution, it may not fit well journal entries having irregular distribution, with many variations.
Under such a circumstance, methods based on distance are used. As a representative one, the k-nearest neighbor method is well known. The k-nearest neighbor method measures the degree of anomaly according to the distance from an observation value to the k-th closest data, k being a number determined beforehand. In more practically, extracted journal entries are compared one another to detect a journal entry which is the k-th closest to a given journal entry in the Euclidean distance and then the distance to the detected journal entry is determined as the observation value. The Euclidean distance between two journal entries is obtained by comparing two journal entry extraction vectors corresponding to the two journal entries. Based on the assumption that outliers should diverge from other journal entry extraction vectors, those that are away from the k-th closest journal entry extraction vector are regarded as anomalies. However, when considering the application to journal entry extraction vectors, since the density of each journal entry extraction vector is not uniform, it is difficult to determine k which is appropriate over the whole area.
Methods used when the density of each journal entry extraction vector is not uniform are methods based on density, and the local outlier factor (LOF) is well known. The LOF is such a method that, based on the ratio of the distance from an observation value to the nearest data (nearest neighbor point) and the distance from the nearest neighbor point to the nearest data, the point where the density becomes lower than the neighborhood is defined as an anomaly.
It is possible to be extended to the form of the ratio of the distance using multiple neighbor points instead of the comparison between the observation value and the nearest neighbor point alone. Therefore, it is an effective method to perform anomaly detection using the LOF with vectors representing journal entries.
However, since the number of account items used in journal entries in real practice is hundreds to thousands, the dimension of journal entry vectors and journal entry extraction vectors becomes very large. In this case, the distance between data could become equal to each other, and hence anomaly detection may be difficult. For this reason, it is necessary to devise an anomaly detection method in high dimensional spaces, to perform anomaly detection with dimensionality reduction in some way, and so on.
As described above, in the present embodiment, it is performed to develop a fluctuation model of each account item, estimate a fluctuation of each account item based on the fluctuation model, and then extract a specific account item on a specific day on which a value derived from a residual between the estimated fluctuation (estimation) and an actual value is larger than a threshold. Subsequently, it is performed to generate a journal entry vector having account items of all journal entries on the specific day aligned in a column direction, extract a journal entry vector including the specific account items from generated journal entry vectors, and then perform anomaly detection of whether there are anomalies among the extracted journal entry vectors.
Accordingly, an irregular journal entry can be identified as anomaly and extracted automatically and effectively from journal entries involving many account items. Therefore, an irregular pattern of accounting activity, which cannot be identified based on assumptions made by accountants or human, can be detected. This strength is useful for prevention of accounting fraud.
At least part of the financial analysis apparatus 1 explained in the embodiment may be configured with hardware or software. When it is configured with software, a program that performs at least part of the financial analysis apparatus 1 may be stored in a storage medium such as a flexible disk and CD-ROM, and then installed in a computer to run thereon. The storage medium may not be limited to a detachable one such as a magnetic disk and an optical disk but may be a standalone type such as a hard disk and a memory.
Moreover, a program that achieves the function of at least part of the financial analysis apparatus 1 may be distributed via a communication network a (including wireless communication) such as the Internet. The program may also be distributed via an online network such as the Internet or a wireless network, or stored in a storage medium and distributed under the condition that the program is encrypted, modulated or compressed.
Additional advantages and modifications will readily occur to those skilled in the art. Therefore, the invention in its broader aspects is not limited to the specific details and representative embodiments shown and described herein. Accordingly, various modifications may be made without departing from the spirit or scope of the general inventive concept as defined by the appended claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
2017-191145 | Sep 2017 | JP | national |