There is a long-felt need for an artificial intelligence expert system which can detect an anomaly in an application process.
The summary of the invention is provided as a guide to understanding the invention. It does not necessarily describe the most generic embodiment of the invention or the broadest range of alternative embodiments.
An artificial intelligence expert system detects an anomaly between an application and a publication provided by an applicant in a given class. The system calculates an application score using the application and then uses a decision tree to determine a publication score based on the presence or absence of tokens in the publication. An anomaly is detected when the ratio of the application score to the publication score is greater than a threshold value. The decision tree is trained using prior applications in the same class. Subsets of the prior applications are assigned to the leaf nodes of the decision tree. The publication score of each leaf node is set equal to an average application score of the prior applications associated with the leaf node. The threshold value is based on the upper tail of the distribution of anomaly ratios calculated for the prior applications.
The detailed description describes non-limiting exemplary embodiments. Any individual features may be combined with other features as required by different applications for at least the benefits described herein.
As used herein, the term “about” means plus or minus 10% of a given value unless specifically indicated otherwise.
As used herein, a “computer-based system” or “computer implemented system” comprises an input device for receiving data, an output device for outputting data in tangible form (e.g. printing or displaying on a computer screen), a permanent memory for storing data as well as computer code, and a microprocessor for executing computer code wherein said computer code resident in said permanent memory will physically cause said microprocessor to read-in data via said input device, process said data within said microprocessor and output said processed data via said output device.
As used herein, “web site” may comprise any information published by an entity on the World Wide Web. This includes company sponsored domains (e.g. www.companyA.com), Facebook® pages, LinkedIn® pages, Twitter® feeds or any other publication whose content is controlled by said entity. Web sites are considered public if their content can be retrieved via the Internet by a member of the public using a citation to the web site, such as a URL.
As used herein, a “decision tree” is a set of logical operations that determine a leaf node for an entity based on data associated with the entity. Commercially available decision tree modeling software develops a decision tree based on a set of training data. AnswerTree^ software is adequate for developing a decision tree. AnswerTree^ is a trademark of SPSS Inc.
In order to detect an anomaly between an application by an applicant and a publication by an applicant, a decision tree can be generated that relates the presence or absence of prospective tokens in publications produced by a training set of prior applicants and prior application scores calculated for said prior applicants. The prior application scores are produced by an application scoring algorithm using data from prior applications. The application scores are indicative of a loss function for said prior applicants. The training set is limited to prior applicants in the same class as the applicant. The class of an applicant is part of a classification scheme of applicants. The classification scheme may be hierarchical. The class of the applicant may be a terminal class within the hierarchical classification scheme.
The number of branch nodes in the decision tree should be limited to avoid overfitting. A suitable limitation is not more than one branch node per 50 prior art applications in the training database. This helps insure that some of the more sparsely populated leaf nodes will still have enough associated prior application scores to have a meaningful average. The more sparsely populated leaf nodes should have at least 10 associated prior application scores.
The ratio of a publication score to an application score is defined herein as an “anomaly ratio”. The distribution of anomaly ratios for the prior applications gives a measure of how good the fit is between the publications scores and the application scores. A distribution where 80% or more of the anomaly ratios fall between 0.75 and 1.25 is considered a good fit. The distribution can also be used to set a threshold factor for when the decision tree is applied to an applicant. If the threshold is set at the anomaly ratio where the upper tail of the distribution of 0.1 or less, then the number of false positives for detecting an anomaly will be acceptably low.
The process for generating decision trees and associated sets of confirmed tokens can be repeated for different classes within the classification scheme. The decision trees can be stored in a decision tree data base and retrieved for different applicants depending upon the class of the applicant. Decision trees may be limited to classes with a broad range of loss ratios for prior applicants. A spread in loss ratios of 1.5× or higher is considered adequately large.
The representation anomaly system may indicate that there is no anomaly 268 if the anomaly ratio is less than or equal to the threshold. The decision tree database may comprise a plurality of decision trees each associated with a unique class in the classification scheme. Each decision tree may also be associated with a set of confirmed tokens also associated with a class in the classification scheme.
The citations may be URLs. The URLs may be to publications by an applicant, such as the applicant's commercial web site or social media page. The web pages or social media pages are considered public if they can be reached through a search engine, such as Google®, Yahoo® or Bing®.
An individual token may severally comprise a set of words, phrases, word stems, images, sounds or other media that are synonyms of each other. As used herein, “severally comprise”, means that if any one of the words, phrases, word stems, images, sounds or other media in an individual token set are detected in a publication that indicates that the token is in the publication.
The anomaly detection systems described herein may be applied to any application process where an applicant provides an application to an examiner for approval, the application can be rated with a loss function, and the applicant provides publications about itself that can be reviewed. Said publications can be intentional or incidental. An intentional publication might be a web site authored by an applicant. An incidental publication might be a credit history, arrest record or a phone conversation that is monitored. Said application processes include but are not limited to applications for employment, insurance coverage, university admission, club membership, citizenship, loans, insurance claims adjudication, and grants. Applicants may be natural persons, automata, or juristic entities, such as corporations. Similarly, examiners can be natural persons, automata, or juristic entities.
The anomaly detection systems described herein were applied the process of a company applying for workers' compensation coverage for its employees. The application is referred to herein as a “submission”. The loss function is “premium rate”. A premium rate is a premium divided by the payroll of the covered employees in a company. A premium may be a pure premium. A pure premium is the expected cost of benefits provided to injured employees due to their injuries. An application score is referred to herein as a “submission-based premium rate”. A publication score is referred to herein as a “web-based premium rate”. The classification scheme for applicants is an industrial classification code. Prior application records are referred to as “historical submissions”. The publications reviewed were web sites. The citations were URLs of web sites.
An anomaly detection decision tree was generated for the industrial classification code NAIC 561730 (Landscaping) using training data from 1,544 historical workers' compensation submissions from companies in said industrial classification. 561730 is a terminal class of the NAIC classification scheme. A set of prospective tokens was developed by individually reviewing the web sites of said submissions to identify words, phrases and word stems that in a reviewer's judgment seemed to be related to the level of injury hazard the employees of said companies might face while on the job. High risk phrases such as “tree removal” and “crane service” were commonly found on the web sites of companies with high submission-based premium rates (e.g. greater than $20 per $100 payroll). Low risk phrases such as “lawn mowing” and “leaf removal” were commonly found on web sites of companies with low submission-based premium rates (e.g. less than $10 per $100 payroll). After the set of prospective phrases were developed, they were grouped into 6 sets of synonyms. In this example, synonyms were words that indicated a common level of expected loss function. They were not necessarily literal synonyms of each other. Each of the 6 sets was then defined as a prospective token. A web scraping program written in Python® programming language was used to thoroughly examine the entire contents of the companies' web site domains to determine the presence or absence of said tokens on said domains. The token data and submission-based premium rates associated with said historical insurance submissions were read into the decision tree generating program AnswerTree™ running on an appropriate computer-based system. The decision tree generating program calculated the nodes and branches of a decision tree that grouped the historical submissions into leaf nodes that minimized the total standard deviation of the submission-based premium rates of the submissions grouped into each leaf node. The web-based submission rates of each leaf node were then set to the average submission-based premium rates weighted by the total payroll for each historical submission. More details of the algorithms used by AnswerTree are found in “AnswerTree™ 2.0 User's Guide”, by SPSS Inc., 1998. Said guide is incorporated herein by reference.
The decision tree generating process can then be repeated for other sets of historical submissions in other terminal or non terminal classes of a classification scheme. Not all industries in the classification scheme need to have an anomaly detection decision tree. Certain industries tend to have a wider spread in premium rate at their classification code than others. A spread of a factor of 1.5 or greater in a terminal class indicates the need for a decision tree. “Spread” as used herein refers to the ratio of a high value to a low value for a given parameter. Table 1 provides examples of industries that have a spread in workers' compensation premium rates of 1.5 or greater. The list is not exhaustive.
While the disclosure has been described with reference to one or more different exemplary embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the disclosure. In addition, many modifications may be made to adapt to a particular situation without departing from the essential scope or teachings thereof. Therefore, it is intended that the disclosure not be limited to the particular embodiment disclosed as the best mode contemplated for carrying out this invention.
Number | Name | Date | Kind |
---|---|---|---|
5642504 | Shiga | Jun 1997 | A |
6691105 | Virdy | Feb 2004 | B1 |
7315826 | Guheen | Jan 2008 | B1 |
7376618 | Anderson et al. | May 2008 | B1 |
7613687 | Nye | Nov 2009 | B2 |
7788536 | Qureshi | Aug 2010 | B1 |
7813944 | Luk et al. | Oct 2010 | B1 |
7831451 | Morse et al. | Nov 2010 | B1 |
7899762 | Iyengar et al. | Mar 2011 | B2 |
8073789 | Wang et al. | Dec 2011 | B2 |
8412645 | Ramaswamy et al. | Apr 2013 | B2 |
8412656 | Baboo et al. | Apr 2013 | B1 |
8538784 | Witkowski et al. | Sep 2013 | B2 |
8572013 | Nash | Oct 2013 | B1 |
8645170 | Boone et al. | Feb 2014 | B2 |
8660864 | Krause et al. | Feb 2014 | B2 |
8676612 | Helitzer et al. | Mar 2014 | B2 |
8725661 | Goldman et al. | May 2014 | B1 |
8805754 | Zhou et al. | Aug 2014 | B2 |
8850550 | Dalzell | Sep 2014 | B2 |
8972325 | Varghese | Mar 2015 | B2 |
9088596 | Ciocarlie et al. | Jul 2015 | B2 |
9141914 | Viswanathan et al. | Sep 2015 | B2 |
9171158 | Akoglu et al. | Oct 2015 | B2 |
9189370 | Lee | Nov 2015 | B2 |
9349103 | Eberhardt, III et al. | May 2016 | B2 |
9378465 | Stewart et al. | Jun 2016 | B2 |
9424745 | Kagoshima et al. | Aug 2016 | B1 |
9430952 | Bohra et al. | Aug 2016 | B2 |
9612812 | Arcilla | Apr 2017 | B2 |
20020055862 | Jinks | May 2002 | A1 |
20020111835 | Hele et al. | Aug 2002 | A1 |
20030125990 | Rudy et al. | Jul 2003 | A1 |
20050091227 | McCollum | Apr 2005 | A1 |
20050187865 | Grear | Aug 2005 | A1 |
20060143175 | Ukrainczyk | Jun 2006 | A1 |
20070094302 | Williamson | Apr 2007 | A1 |
20080086433 | Schmidtler | Apr 2008 | A1 |
20080183508 | Harker et al. | Jul 2008 | A1 |
20090006616 | Gore | Jan 2009 | A1 |
20110015948 | Adams et al. | Jan 2011 | A1 |
20120290330 | Coleman et al. | Nov 2012 | A1 |
20130013345 | Wallquist et al. | Jan 2013 | A1 |
20130018823 | Masood | Jan 2013 | A1 |
20130226623 | Diana et al. | Aug 2013 | A1 |
20130339220 | Kremen et al. | Dec 2013 | A1 |
20130340082 | Shanley | Dec 2013 | A1 |
20140114694 | Krause et al. | Apr 2014 | A1 |
20140129261 | Bothwell et al. | May 2014 | A1 |
20140180974 | Kennel et al. | Jun 2014 | A1 |
20140214734 | Ozonat et al. | Jul 2014 | A1 |
20150220862 | DeVries et al. | Aug 2015 | A1 |
Number | Date | Country |
---|---|---|
WO 2004088476 | Oct 2004 | WO |
Entry |
---|
Nowak, Boguslaw; Nowak, Maciej; Multi-Criteria Decision Aiding in Project Planning Using Decision Trees and Simulation; International Workshop on Multiple Criteria Decision Making. 2010/2011, p. 163-187. 25p. |
Wikipedia, Classification scheme, https://en.wikipedia.org/wiki/Classification_scheme, last viewed Mar. 10, 2016. |
Wikipedia, Loss function, https://en.wikipedia.org/wiki/Loss_function, last viewed Oct. 21, 2015. |
Wikipedia, Tree (data structure), https://en.wikipedia.org/wiki/Tree_(data_structure)#Terminology, last viewed Mar. 10, 2016. |
De Vries, et al., U.S. Appl. No. 61/935,922 for System and Method for Automated Detection of Insurance Fraud dated Feb. 5, 2014. |
SPSS Inc., AnswerTree™ 2.0 User's Guide pp. 1-99, 1998. |
SPSS Inc., AnswerTree™ 2.0 User's Guide pp. 100-203, 1998. |
Nasser Hadidi, Ph.D., Classification Ratemaking Using Decision Trees, 2003. |
Hendrix, Leslie; Elementary Statistics for the Biological and Life Sciences, course notes University of South Carolina, Spring 2012. |
Insurance Fund Manual, National Council on Compensation Insurance (NCCI) Classification of Industries pp. 218-219, Rev. Jul. 2012. |
SAS, Combating Insurance Claims Fraud, How to Recognize and Reduce Opportunistic and Organized Claims Fraud, White Paper. |
Lanzkowsky, Marc, The Claims SPOT, 3 Perspectives on the Use of Social Media in the Claims Investigation Process, http://theclaimsspot.com/2010/10/25/3-perspectives-on-the-use-of-social-media-in-the-claims-investigation-process/, dated Oct. 25, 2010. |
Networked Insurance Agents, Workers' Compensation California Class Codes, Nov. 2010. |
NAICS, 2012 Definition File Sector 11—Agriculture, Forestry, Fishing and Hunting. |
NAICS, 2012 Definition File Sector 5617 Services to Buildings and Dwellings. |
Wikipedia.com, Spokeo, https://en.wikipedia.org/wiki/Spokeo, Mar. 10, 2014. |
WCIRB California, California Workers' Compensation Uniform Statistical Reporting Plan—1995 Title 10, California Code of Regulations Section 2318.6, Effective Jan. 1, 2014. |
Wikipedia, Decision tree learning, May 12, 2014. |
NCCI, Scopes Manual, Posted Mar. 1, 2014, https://www.ncci.com/manuals/scopes/scopes/scopes-r00399.htm. |
WSJ Blogs, Gary Kremen's New Venture, Sociogramics, Wants to Make Banking Human Again, http://blogs.wsj.com/venturecapital/2012/02/24/gary-kremens-new-venture-sociogramics-raises-2m-to-make-banking-human-again/, dated Feb. 24, 2012. |
Wikipedia.com, Hierarchy, https://en.wikipedia.org/wiki/Hierarchy, Mar. 1, 2016. |
Wikipedia.com, Tree structure, https://en.wikipedia.org/wiki/Tree_structure, Mar. 1, 2016. |
Number | Date | Country | |
---|---|---|---|
61950921 | Mar 2014 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14469632 | Aug 2014 | US |
Child | 15081991 | US |