The present invention relates to database technology and data handling in a distributed computer environment.
One object in data base systems is to maintain data integrity. Data entries should fulfill rules of accuracy and consistency, which are dependent on the applicational field, in which the corresponding data are generated. Inaccurate and inconsistent data can arise, e.g., from erroneous input and miscalculations, but also from fraudulent input and data manipulation. Detecting fraudulent data input and manipulation can be challenging, particularly in environments dealing with data, whose nature makes it difficult to determine whether data manipulation has arisen from a fraudulent behavior of the inputting user or from erroneous input. An example for fraudulent data manipulation is the manipulation of pictures, such as photographs, which may then be spread over the Internet. Another example are frauds stemming from employees of big corporations and organizations when they submit their business trip expenses for reimbursement, such as handing in expenses for private purposes as part of expenses which have occurred on a business trip.
Detecting fraudulent data input and manipulation, for example in the field of corporate travel expense accounting, usually requires an individual check of every invoice by a person. This person judges whether the invoice might be fraudulent or erroneous according to its experience which is based on the checking of past invoices. Although this method provides a rather effective approach in order to detect erroneous and even fraudulent travel invoices, this method consumes many resources with regard to working time and involved personnel, especially in large corporations and organizations.
Therefore, it would be desirable to provide a method for detecting erroneous and fraudulent data input for large data systems which requires only a limited amount of resources.
In a first aspect of the invention, a computer-implemented fraud detection method in a distributed computing environment is provided. The method comprises a machine learning activity and a fraud detection activity. The machine learning activity comprises receiving training data entries and receiving classification data and defining a plurality of classification criteria based on the classification data. The machine learning activity further comprises classifying the training data entries according to a first subset of classification criteria and thereby obtaining classified training data entries, grouping the classified training data entries into training data tuples according to a second subset of classification criteria, grouping training data tuples into a set of training data, applying a machine learning algorithm to the set of training data to obtain a model based on the set of training data and storing the set of training data and/or the model in one or more databases. The fraud detection activity comprises receiving additional data entries obtained from one or more documents and classifying the additional data entries according to a first subset of classification criteria, thereby obtaining additional classified data entries. The fraud detection activity further comprises grouping the additional classified data entries into additional data tuples according to a second subset of classification criteria, comparing the additional data tuple with the model obtained by the machine learning activity, thereby determining a set of values indicating the results of the comparison, evaluating the set of values indicating the results of the comparison relative to at least one fraud detection rule, wherein different levels of violation of the fraud detection rule are associated with different corresponding predefined actions, and executing the respective predefined action according to the level of fraud detection rule violation, wherein the predefined action comprises displaying a symbol on a computer screen indicating the level of fraud detection rule violation.
According to a second aspect of the invention, a fraud detection system within a distributed computer environment is provided, which comprises at least one computing system comprising a machine learning module and a fraud detection module and at least one database connected to the at least one computing system. The machine learning module is configured to receive training data entries, receive classification data and define a plurality of classification criteria based on the classification data, classify the training data entries according to a first subset of classification criteria to obtain classified training data entries, group the classified training data entries into training data tuples according to a second subset of classification criteria, group training data tuples into a set of training data, apply a machine learning algorithm to the set of training data to obtain a model based on the set of training data, and store the set of training data and/or the model in one or more databases. The fraud detection module is configured to receive additional data entries obtained from one or more documents, classify the additional data entries according to a first subset of classification criteria to obtain additional classified data entries, group the additional classified data entries into additional data tuples according to a second subset of classification criteria, compare the additional data tuple with the model obtained by the machine learning activity to determine a set of values indicating the results of the comparison, evaluate the set of values indicating the results of the comparison relative to at least one fraud detection rule, wherein different levels of violation of the fraud detection rule are associated with different corresponding predefined actions, and execute the respective predefined action according to the level of fraud detection rule violation, wherein the predefined action comprises displaying a symbol on a computer screen indicating the level of fraud detection rule violation.
According to a third aspect of the invention, a non-transitory computer-readable medium which causes a computer to execute a machine learning activity and a fraud detection activity is provided. The machine learning activity comprises receiving training data entries and receiving classification data and defining a plurality of classification criteria based on the classification data. The machine learning activity further comprises classifying the training data entries according to a first subset of classification criteria and thereby obtaining classified training data entries, grouping the classified training data entries into training data tuples according to a second subset of classification criteria, grouping training data tuples into a set of training data, applying a machine learning algorithm to the set of training data to obtain a model based on the set of training data and storing the set of training data and/or the model in one or more databases. The fraud detection activity comprises receiving additional data entries obtained from one or more documents and classifying the additional data entries according to a first subset of classification criteria, thereby obtaining additional classified data entries. The fraud detection activity further comprises grouping the additional classified data entries into additional data tuples according to a second subset of classification criteria, comparing the additional data tuple with the model obtained by the machine learning activity, thereby determining a set of values indicating the results of the comparison, evaluating the set of values indicating the results of the comparison relative to at least one fraud detection rule, wherein different levels of violation of the fraud detection rule are associated with different corresponding predefined actions, and executing the respective predefined action according to the level of fraud detection rule violation, wherein the predefined action comprises displaying a symbol on a computer screen indicating the level of fraud detection rule violation.
The above summary may present a simplified overview of some embodiments of the invention in order to provide a basic understanding of certain aspects of the invention discussed herein. The summary is not intended to provide an extensive overview of the invention, nor is it intended to identify any key or critical elements, or delineate the scope of the invention. The sole purpose of the summary is merely to present some concepts in a simplified form as an introduction to the detailed description presented below.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate various embodiments of the invention and, together with a general description of the invention given above and the detailed description of the embodiments given below, serve to explain the embodiments of the invention. In the drawings, like reference numerals refer to like features in the various views.
Contemporary decision making, e.g., in big corporations and organizations and almost on all organizational levels, relies on respective data. In order to make proper decisions, these data have to be correct and trustworthy. In the case of travel expenses, for example, the decision to reimburse travel expenses to an employee by the corresponding department requires a well based assumption that the bills and invoices the employee has handed over to the travel department are correct, i.e., that the documents do not contain any errors and fraudulent manipulations. This may be done by using specially trained and experienced employees, who normally review each of the bills and invoices handed in by the corresponding employees and, based on their experience and training, assess the validity of each bill and invoice.
When performing business trips, each employee typically creates costs within a certain range, resulting from the destinations and the length of the business trips, the types of hotels the employee stays overnight, the types of meals the employee consumes etc. These costs normally vary only within a certain rage. Acceptable and legally explainable exceptions can occur, e.g., an overseas business trip or a lengthy business trip. However, higher travel expenses can also occur due to fraudulently handing in invoices, which e.g., cover costs for explicitly private purposes. It has, however, to be kept in mind that the nature of business trips for a single employee can change, e.g., when the employee has got an upgrade, which may result in that the employee now meets with high ranking executives of other companies. Therefore, the costs for hotels and meals may increase, which would in absolute agreement with the guidelines of the corporation or organization.
The travel departments are trained to recognize erroneous and fraudulent handing-ins of invoices and the travel departments are also informed when e.g., an employee has got an upgrade and is now entitled to higher travel expenses. This requires competent staff in the travel department, whose training has always to be up-to-date, and, when the corporation is a big corporation, also numerous members of that staff are required. All this requires the allocation of human resources, costs etc. from the corporation.
Therefore, a method and a system which automatically asses the validity of travel expenses handed in by the employees, but which is also flexible enough to assess legal changes in travel expenses (e.g., when the employee has received a status upgrade) is desirable. The automated system should also allow the deployment in a distributed corporate environment, so that an employee is able to hand in a travel expense at any location and the processing of the travel expense can occur at a central location designated by the corporation.
The method and the system should be able to know for each employee in a corporation or organization the typical amount of expenses for a business trip, which may depend on the field of activity and the status of the employee. This “knowledge” of the system is derived from a set of training data which is obtained by scanned documents and which is composed of the expenses of earlier business trips of the respective employee. When the employee hands in the bills and invoices for a new business trip, the system is then capable to assess whether the expenses claimed by the employee are erroneous and/or even fraudulent. The new data also forms a new data set for the training data, and a self-learning algorithm, which is applied over the training data, may calculate new values for the typical travel expenses the employee normally causes. The self-learning algorithm includes in its processing additional information such as promotions of the employee and changes in the fields of activity. In some embodiments, the self-learning algorithm also keeps track of any attempts of fraud or erroneous inputs of the employee in order to recommend or apply e.g., stricter rules for auditing with respect to the corresponding employee. Such a method and system may be referred to as fraud detection (or: fraud estimation) method and system, respectively.
The self-learning algorithm may use, in some embodiments, additional data input for the calculation of the typical travel expenses which may not be provided by the employee itself but e.g., from a corporate administrator. The additional data input may comprise information such as the change of the field of activity of the employee, the change of the location of activity of the employee, various promotions etc., which may result in increased expenses for various travels.
The self-learning algorithm or machine learning algorithm can be realized by various systems, such as artificial neural networks, support vector machines, Bayesian networks and genetic algorithms using approaches such as supervised, semi-supervised or unsupervised learning, or approaches such as reinforcement learning, feature learning, spare dictionary learning etc.
The computer 1 may be constituted of one or several hardware machines depending on performance requirements. The computer 1 is embodied e.g. as stationary or mobile hardware machines comprising computing machines 100 as illustrated in
The scanning devices 2 are embodied e.g., as hardware components for scanning in paper-based documents and/or as hardware components for taking a photographic image of paper-based documents, such as mobile devices with integrated cameras (e.g., smartphones or the like). The scanned images taken from the paper-based documents are converted into a computer-readable format using techniques such as Optical Character Recognition (OCR). The conversion can be performed e. g at the scanning devices 2 or at the computer 1. In some embodiments, the scanning devices 2 may form part of the computer 1. Photographic images of the paper-based documents taken from mobile devices are e.g., in formats like JPEG or RAW, or the like.
The computer 1 is connected to a database 3. In some embodiments, the database 3 may be formed as a relational SQL (Structured Query Language) database. In some further embodiments, the database 3 may form part of the computer 1.
The computer 1, the scanning devices 2 and the databases 3 are interconnected by the communication interfaces 5. Each of the communication interfaces 5 utilizes a wired or wireless Local Area Network (LAN) or a wireline or wireless Metropolitan Area Network (MAN) or a wireline or wireless Wide Area Network (WAN) such as the Internet or a combination of the aforementioned network technologies and are implemented by any suitable communication and network protocols.
A flow diagram for an example method according to some embodiments is presented in
Once the model has been developed, a set of predictive data is defined in an activity 12. These predictive data may be used to assess the validity of e.g., the travel expenses of a specific (e.g., recent) trip of an employee. As an example, the predictive data may comprise average travel expenses, which should be expected for each employee of a corporation. The travel expenses of a specific trip are received in an activity 20 as additional data entries and processed together with the predictive data in an activity 13, where a result of the assessment of the employee's travel expenses is obtained. This result may in an activity 14 trigger the execution of a predefined action according to the level of fraud detection rule violation, wherein the predefined action comprises displaying a symbol on a computer screen 1020, as shown in
In some embodiments, the computer 1 may comprise of a system of distributed computing entities such as servers, wherein the machine learning system may operate at one computing activity. In some embodiments, at the computer 1, training data entries obtained from one or more documents scanned from a device for scanning documents are received. In some further embodiments, wherein the computer 1 comprises of a system of distributed computing entities, the individual distributed computing entities receive the training data entries and sent them to the computing entity hosting the system for the machine learning activity.
In further embodiments, the documents are paper-based documents. The scanning devices for scanning documents are, in further embodiments, comprised in the distributed computing environment. After scanning of the paper-based document with e.g., a scanner or a camera integrated in a mobile phone, the scanned characters are converted into an electronically processable data structure. This can be performed by e.g., OCR conversion of a scanned paper document such as e. g an invoice for overnight stays in a hotel, which may form part of a collection of expenses for a business travel for an employee.
The processing activity 11 of the example method is shown in more detail in
The flow diagram of
In some further embodiments, the classified data entries and/or the classified training data entries are arranged as input vectors and/or feature vectors, whereby the features may be in a purely numeric format.
The training data tuples are then grouped in an activity 112 into a set of training data. To cite the aforementioned example, the set of training data may comprise of the entirety of the travel expenses an employee of a corporation has handed in so far. In an activity 113, the computer 1 applies a machine learning algorithm to the set of training data to obtain a model based on the set of training data. Such a model may comprise predictive data such as average travel expenses, which should be expected for each employee of a corporation. The model could also yield information, which type of costs a specific employee typically creates or does not create, e.g., the model yields the information that employee John Doe does not use airplanes during its business trip, since he performs such trips only in the same city he is performing his business duties. In some embodiments, the predictive data is stored in an activity 114 in a database.
The initial activities of the example method discussed so far are illustrated in
Referring to
Referring back to
In
In some embodiments, the machine learning algorithm applied by the machine learning system 402 is based on learning algorithms which may comprise of supervised learning, semi-supervised learning, unsupervised learning, reinforcement learning, feature learning, sparse dictionary learning, anomaly detection, decision tree learning, association rule learning, etc. In some further embodiment, the machine learning algorithm applied by the machine learning system 402 is based on support vector machines, Bayesian networks, genetic algorithms etc.
As aforementioned, the predictive data defined in activity 12 (cf.
Computer 1 compares in an activity 132 the additional data tuple with the model obtained by the machine learning activity, thereby determining a set of values indicating the results of the comparison. In some embodiment the comparison of the additional data tuple is carried out with the predictive data 13 obtained from the machine learning activity 113 of
Subsequently, computer 1 evaluates in an activity 134 the set of values indicating the results of the comparison relative to at least one fraud detection rule, whereby different violations of the fraud detection rule are associated with different corresponding predefined actions. To cite an example, when the difference between the hotel costs of the most recent business trip of the employee and the average hotel costs the employee usually has created so far does not exceed a certain threshold, then no fraud or erroneous input would be assumed. On the other hand, when the difference exceeds a certain threshold, then a potential fraud or erroneous input might occur. In some embodiment, computer 1 executes a predefined action according to the fraud detection rule violation, wherein the predefined action comprises displaying a symbol on a computer screen 1020, as shown in
In some embodiment, the levels of fraud detection rule violation are determined based on probabilities that a fraud has occurred and/or on a confidence score, wherein the probability that a fraud has occurred and/or a confidence score are based on a predefined set of confidence thresholds. To cite an example, the chosen confidence thresholds can be formulated as follows (see also
P(ŷ=1): >=0.99: Auto-approve the reimbursement
>=0.9: Green (slight probability of fraud, needs review by an auditor))
>=0.5: Yellow (higher probability of fraud, needs review by an auditor)
<0.5: Red (fraud seems to occur, requires correction by an auditor)
In this example, three different confidence thresholds are predefined (i.e. 0.99, 0.9, and 0.5). However, the number of predefined thresholds may be larger than three, e.g. four or five confidence thresholds may be predefined. In other examples, the number of predefined thresholds may be smaller than three, e.g. two or only one threshold may be predefined.
The predefined set of confidence thresholds can either be entered by a corporate administrator or auditor together with the supplementary data in activity 30 and be used to define in activity 31 (shown in
In some embodiment and shown in
In some embodiment, the training data entries received by the machine learning activity comprise original data entries obtained from scanned documents, as already described in the preceding paragraphs. In some further embodiment, the training data entries further comprise and/or modified data entries provided by a feedback mechanism as feedback training data entries, wherein the feedback mechanism is determined by the classification data, as shown in
In
In some embodiment, the classification data are obtained using the training data entries. If, as an example, the travel receipts of an employee comprises fares for transports using taxis in a regular scale, a corresponding classification could be added to the classification data. The machine learning system 402 may therefore create a classification <taxi fare> as an example.
Computing machine 100 also hosts the cache 107. The cache 107 within the present embodiments may be composed of hardware and software components that store the data entries and the machine learning algorithm so that the methodologies or parts of the methodologies discussed herein can carried out. There can be hardware-based caches such as CPU caches, GPU caches, digital signal processors and translation lookaside buffers, as well as software-based caches such as page caches, web caches (Hypertext Transfer Protocol, HTTP, caches) etc. Computer 1, scanning devices 2 and databases 3 may comprise of a cache 107.
A set of computer-executable instructions (i.e., computer program code) embodying any one, or all, of the methodologies described herein, resides completely, or at least partially, in or on a machine-readable medium, e.g., the main memory 106. Main memory 106 hosts computer program code for functional entities such as database request processing 108 which includes the functionality to receive and process database requests and data processing functionality 109. The instructions may further be transmitted or received as a propagated signal via the Internet through the network interface device 103 or via the network interface device 103.
Communication within computing machine is performed via bus 104. Basic operation of the computing machine 100 is controlled by an operating system which is also located in the main memory 106, the at least one processor 101 and/or the static memory 105.
In general, the routines executed to implement the embodiments, whether implemented as part of an operating system or a specific application, component, program, object, module or sequence of instructions, or even a subset thereof, may be referred to herein as “computer program code” or simply “program code”. Program code typically comprises computer-readable instructions that are resident at various times in various memory and storage devices in a computer and that, when read and executed by one or more processors in a computer, cause that computer to perform the operations necessary to execute operations and/or elements embodying the various aspects of the embodiments of the invention. Computer-readable program instructions for carrying out operations of the embodiments of the invention may be, for example, assembly language or either source code or object code written in any combination of one or more programming languages.
The program code embodied in any of the applications/modules described herein is capable of being individually or collectively distributed as a program product in a variety of different forms. In particular, the program code may be distributed using a computer-readable storage medium having computer-readable program instructions thereon for causing a processor to carry out aspects of the embodiments of the invention.
Computer-readable storage media, which is inherently non-transitory, may include volatile and non-volatile, and removable and non-removable tangible media implemented in any method or technology for storage of information, such as computer-readable instructions, data structures, program modules, or other data. Computer-readable storage media may further include random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other solid state memory technology, portable compact disc read-only memory (CD-ROM), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store the desired information and which can be read by a computer. A computer-readable storage medium should not be construed as transitory signals per se (e.g., radio waves or other propagating electromagnetic waves, electromagnetic waves propagating through a transmission media such as a waveguide, or electrical signals transmitted through a wire). Computer-readable program instructions may be downloaded to a computer, another type of programmable data processing apparatus, or another device from a computer-readable storage medium or to an external computer or external storage device via a network.
Computer-readable program instructions stored in a computer-readable medium may be used to direct a computer, other types of programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instructions that implement the functions/acts specified in the flowcharts, sequence diagrams, and/or block diagrams. The computer program instructions may be provided to one or more processors of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the one or more processors, cause a series of computations to be performed to implement the functions and/or acts specified in the flowcharts, sequence diagrams, and/or block diagrams.
In certain alternative embodiments, the functions and/or acts specified in the flowcharts, sequence diagrams, and/or block diagrams may be re-ordered, processed serially, and/or processed concurrently without departing from the scope of the embodiments of the invention. Moreover, any of the flowcharts, sequence diagrams, and/or block diagrams may include more or fewer blocks than those illustrated consistent with embodiments of the invention.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Furthermore, to the extent that the terms “includes”, “having”, “has”, “with”, “comprised of”, or variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term “comprising.”
While all of the invention has been illustrated by a description of various embodiments and while these embodiments have been described in considerable detail, it is not the intention of the Applicant to restrict or in any way limit the scope of the appended claims to such detail. Additional advantages and modifications will readily appear to those skilled in the art. The invention in its broader aspects is therefore not limited to the specific details, representative apparatus and method, and illustrative examples shown and described. Accordingly, departures may be made from such details without departing from the spirit or scope of the Applicant's general inventive concept.