The disclosure relates generally to a system and method for matching and merging disparate documents.
Data about an entity, such as a subject, company, idea or the like, may be stored in a plurality of disparate data sources. In order to be able to assemble the data about the entity from the disparate sources into a single data store, it is necessary to try to gather the various data from the various data sources and then determine a way to combine the data from the disparate data sources for the particular entity into the single data store.
In the healthcare industry, information/data about each healthcare provider, such as a doctor, a therapist, a nurse, a hospital, a medical practice and the like, may be stored in a plurality of disparate data sources. The information/data about the healthcare provider may include, for example, reviews, directions, rates and the like. The disparate data sources for the data/information for the healthcare provider may include publicly available Centers for Medicare and Medicaid Services' (CMS) National Plan and Provider Enumeration System (NPPES) data to privately curated and licensed data from the American Medical Association (AMA), among others.
The issues that must be confronted in order to successfully integrate the data from these various data sources into a single data store may include:
Thus, it is desirable to provide a system and method for dynamic data identification and combining so that, for example, data from disparate data sources for a healthcare provider may be combined into a single data store.
The disclosure is particularly applicable to a healthcare system in which healthcare provider data is matched and merged and it is in this context that the disclosure will be described. It will be appreciated, however, that the system and method has greater utility since the system and method may be used with any type of entity for which it is desirable to be able to match and merge data about the entity from disparate data sources. Furthermore, the system and method may be used in any industry for which it is desirable to be able to match and merge data about the entity from disparate data sources. For purposes of this disclosure, an entity may be a subject, an idea, a professional, a person, a corporation, a business entity and the like.
In an example healthcare embodiment, a healthcare system may have a goal of providing healthcare pricing transparency and connecting consumers directly to healthcare providers. To provide that healthcare pricing transparency, the healthcare system needs to maintain a comprehensive and up-to-date directory of healthcare providers. In order to build this provider directory, data are combined from disparate sources ranging from the publicly available Centers for Medicare and Medicaid Services' (CMS) National Plan and Provider Enumeration System (NPPES) data to privately curated and licensed data from the American Medical Association (AMA), among others. These data take the form of structured records on a per-provider basis, referred to herein as provider documents.
The system and method provide a computational process to match provider documents from disparate sources which refer to the same provider and merge those documents into a single comprehensive view while taking into account the relative trustworthiness of the data sources for each available data field. The generated single comprehensive view facilitates a more accurate services purchasing and recommendation experience for the healthcare consumer as well as the practitioner in the application domain. The ability to dynamically match disparate data sources with data hygiene metric is crucial in evaluating the behaviors and ratings for ranking practitioners that will be listed in a marketplace of the healthcare system. This improved matching model further facilitates a faceted search paradigm much like one would search for a camera purchase at an internet marketplace site.
The system may include one or more of the following components:
For example, the system may use Bayesian Identity Resolution in which comparators and weight ranges are specified for a subset of the fields in the documents which are determined to be the best features for determining matches. When document pairs are evaluated, each field in the documents are compared using the specified comparator and the result is scaled to the specified weight range resulting in a weighted match score for the field. These weighted field match scores are combined using Bayes' theorem to provide an overall match score for the two documents. If this document match score is above a designated threshold than the two documents are considered to be a match, otherwise they are considered not to match.
As another example, the system may use ElasticSearch. ElasticSearch is a distributed, RESTful, free/open source search server based on Apache Lucene, an open source information retrieval software library. To perform document matching using Elasticsearch, a collection of documents is first “indexed” using the Elasticsearch API. Then a collection of documents is iterated upon, constructing a precise boolean query based on select fields from the iterated document. If the necessary fields are present in this “query” document, the query is issued against the Elasticsearch index, and results indicate a positive match which is saved into results collection. For collection deduplication the iterated collection may be the same collection that was indexed. Alternatively for record “linkage” an entirely different collection may be iterated upon.
Prior to running the ensemble of matcher algorithms, each of the source documents (raw files in
Following the initial data cleansing, each matcher algorithm may be run (matcher processes 106) against the entire set of N provider documents from all sources (our search space). This may be viewed as a sequence of queries using M canonical data source documents as the query documents for which we wish to find corresponding matches in the search space, resulting in M match sets (see
The generated match sets do not contain the actual matching documents, but rather contain references to the matching documents' storage locations and unique identifiers as shown in
A statistical model may be constructed using the results of human evaluation of a random sample of match sets produced by the matcher ensemble. The human evaluator may be presented with the query document and each pair-wise combination with the matching documents represented by a match set. The evaluator determines whether the two documents refer to the same provider, and the determination (or score) is stored for future reference. It is possible that a match set contains both correct and incorrect matches.
The collection of match scores forms the basis of the training data for building the statistical model, along with the feature vector for each document in the training data. For example, an example of the feature vector may be:
A sparse representation of the feature vector for one record in the training data set. This shows that a provider document in the ppd_quarterly_startup source was correctly matched with a provider document in the nppes_npi source. The presence of a field name in the field_distances data structure indicates that the field was present in both documents, and the associated number is the Levenshtein distance between the field values in the two documents. These field names are based on the example in
These features (and all the features of the entire training data set) are the predictors for the Bayesian classifier.
The feature vectors may be comprised of individual data points such as document sources, available document fields, similarity of fields between query and matching documents. Bayesian inference may then be used to determine whether a proposed match as presented in a match set is predicted to be valid. By taking each match in a match set into consideration individually it is possible to accept or reject subsets. As the same set of query documents is used across matchers, the accepted matches for each query document across all matchers are able to be combined into complete match sets.
At this point in the process, the combined match sets are still represented by references to the documents of interest. The next step is to merge the referenced documents (108) (including both the query and match documents) into a single document, with values from all fields present in each. Provenance is maintained for all field/value combinations to track their origins. New unique identifiers are assigned to the resulting merged documents, even if the merge resulted from a singleton match set.
The merged documents may have conflicting values for any given field. The process may thus have a resolve process 110 to resolve such conflicts and rank the values according to confidence in each value's correctness. The resolve process 110 may be accomplished using a combination of heuristics including majority rule (value support), predetermined confidence for data sources (e.g., trusting state medical boards for practitioner licensing data), or once again a statistical model built from human feedback. For example, a “majority rule” resolver would determine the most consistent data value for a given field based on which value for the given data filed that occurs most often. At least three sources would be needed to determine a “winner”. For the merged document in
The canonical documents, which were used as the query documents by the matcher ensemble, now have the field/value combinations from matching documents folded in, along with rankings for each. The consumer of these new documents, as outputted by the system, may choose to utilize the ranked values as appropriate, the simplest case being only to take the highest ranking values. Alternately, the combined documents with ranked values may be preserved as is for display in a faceted browsing system for exploration by the user. The combined documents, in the healthcare example scenario, may be stored in a master directory 112 for healthcare providers.
The one or more data sources 202 may be geographically dispersed or co-located, but each may have a connection to a communication path 204 and may be implemented as a software or hardware based data store or database. The one or more data sources 202 may have data obtained from them over a communication path 204 by a backend unit 206. The communication path 204 may be any wired or wireless network that allows the backend unit 206 to collect data from the data sources, such as the Internet, a wireless data or computer network, a wired data network and the like.
The backend unit 206 may be implemented using one or more cloud computing resources or one or more server computing resources such as at least a processor and a memory. The backend unit may further comprise a plurality of components wherein each component performs one or more processes to implement the matching and merging functionality of the system. Each component may be a plurality of lines of computer code that may be resident in the memory of the cloud computing resources or one or more server computing resources and executed by the processor of the cloud computing resources or one or more server computing resources. Alternatively, each component may be a piece of hardware that implements the operations and processes described. For example, each component may be a programmable logic device, a microprocessor or microcontroller with microcode, an application specific integrated circuit and the like.
The components of the backend unit 206 may include an import and transform component 206A that may perform the import and transform processes 102,104 described above with reference to
In addition to the components, the backend unit 206 may be coupled to a repository 208 that may store the match sets, the merged documents and the merged documents with rank values. In the healthcare example scenario, the repository 208 may also store the healthcare provider directory based on the merged documents with rank values.
Once the human review process is completed, the combined, accepted matches and their match sets may be merged together with the provenance from the match sets. An example of an excerpt from such a document is shown in
While the foregoing has been with reference to a particular embodiment of the invention, it will be appreciated by those skilled in the art that changes in this embodiment may be made without departing from the principles and spirit of the disclosure, the scope of which is defined by the appended claims.
This application claims the benefit of and priority to, under 35 USC 119(e) to U.S. Provisional Patent Application Ser. No. 61/929,787 filed Jan. 21, 2014 and entitled “System and Method for Dynamic Document Matching and Merging”, the entirety of which is incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
5872021 | Matsumoto | Feb 1999 | A |
6546428 | Baber et al. | Apr 2003 | B2 |
7386565 | Singh et al. | Jun 2008 | B1 |
7917378 | Fitzgerald et al. | Mar 2011 | B2 |
7917515 | Lemoine | Mar 2011 | B1 |
7970802 | Ishizaki | Jun 2011 | B2 |
7992153 | Ban | Aug 2011 | B2 |
8073801 | Von Halle et al. | Dec 2011 | B1 |
8095975 | Boss et al. | Jan 2012 | B2 |
8103667 | Azar et al. | Jan 2012 | B2 |
8103952 | Hopp | Jan 2012 | B2 |
8203562 | Alben et al. | Jun 2012 | B1 |
8229808 | Heit | Jul 2012 | B1 |
8286191 | Amini et al. | Oct 2012 | B2 |
8359298 | Schacher et al. | Jan 2013 | B2 |
8364501 | Rana et al. | Jan 2013 | B2 |
8417755 | Zimmer | Apr 2013 | B1 |
8495108 | Nagpal et al. | Jul 2013 | B2 |
8515777 | Rajasenan | Aug 2013 | B1 |
8527522 | Baron | Sep 2013 | B2 |
8817665 | Thubert et al. | Aug 2014 | B2 |
8984464 | Mihal et al. | Mar 2015 | B1 |
9165045 | Mok | Oct 2015 | B2 |
9208284 | Douglass | Dec 2015 | B1 |
20020022973 | Sun et al. | Feb 2002 | A1 |
20020038233 | Shubov et al. | Mar 2002 | A1 |
20020165738 | Dang | Nov 2002 | A1 |
20030055668 | Saran et al. | Mar 2003 | A1 |
20030097359 | Ruediger | May 2003 | A1 |
20030171953 | Narayanan et al. | Sep 2003 | A1 |
20030217159 | Schramm-Apple et al. | Nov 2003 | A1 |
20030233252 | Haskell et al. | Dec 2003 | A1 |
20040143446 | Lawrence | Jul 2004 | A1 |
20050010452 | Lusen | Jan 2005 | A1 |
20050071189 | Blake et al. | Mar 2005 | A1 |
20050102170 | Lefever et al. | May 2005 | A1 |
20050137912 | Rao et al. | Jun 2005 | A1 |
20050152520 | Logue | Jul 2005 | A1 |
20050182780 | Forman et al. | Aug 2005 | A1 |
20050222912 | Chambers | Oct 2005 | A1 |
20060036478 | Aleynikov et al. | Feb 2006 | A1 |
20060074290 | Chen et al. | Apr 2006 | A1 |
20060089862 | Anandarao et al. | Apr 2006 | A1 |
20060129428 | Wennberg | Jun 2006 | A1 |
20060136264 | Eaton et al. | Jun 2006 | A1 |
20070113172 | Behrens | May 2007 | A1 |
20070118399 | Avinash | May 2007 | A1 |
20070156455 | Tarino et al. | Jul 2007 | A1 |
20070174101 | Li et al. | Jul 2007 | A1 |
20070180451 | Ryan et al. | Aug 2007 | A1 |
20070214133 | Liberty et al. | Sep 2007 | A1 |
20070233603 | Schmidgall et al. | Oct 2007 | A1 |
20070260492 | Feied | Nov 2007 | A1 |
20070276858 | Cushman et al. | Nov 2007 | A1 |
20070288262 | Sakaue et al. | Dec 2007 | A1 |
20080013808 | Russo | Jan 2008 | A1 |
20080046292 | Myers | Feb 2008 | A1 |
20080082980 | Nessland et al. | Apr 2008 | A1 |
20080091592 | Blackburn et al. | Apr 2008 | A1 |
20080126264 | Tellefsen et al. | May 2008 | A1 |
20080133436 | Di Profio | Jun 2008 | A1 |
20080288292 | Bi et al. | Nov 2008 | A1 |
20080295094 | Korupolu et al. | Nov 2008 | A1 |
20080319983 | Meadows | Dec 2008 | A1 |
20090083664 | Bay | Mar 2009 | A1 |
20090125796 | Day | May 2009 | A1 |
20090192864 | Song et al. | Jul 2009 | A1 |
20090198520 | Piovanetti-Perez | Aug 2009 | A1 |
20090300054 | Fisher et al. | Dec 2009 | A1 |
20090307104 | Weng | Dec 2009 | A1 |
20090313045 | Boyce | Dec 2009 | A1 |
20100076950 | Kenedy et al. | Mar 2010 | A1 |
20100082620 | Jennings, III et al. | Apr 2010 | A1 |
20100088108 | Machado | Apr 2010 | A1 |
20100088119 | Tipirneni | Apr 2010 | A1 |
20100138243 | Carroll | Jun 2010 | A1 |
20100217973 | Kress | Aug 2010 | A1 |
20100228721 | Mok | Sep 2010 | A1 |
20100295674 | Hsieh et al. | Nov 2010 | A1 |
20100332273 | Balasubramanian et al. | Dec 2010 | A1 |
20110015947 | Erry et al. | Jan 2011 | A1 |
20110047169 | Leighton | Feb 2011 | A1 |
20110055252 | Kapochunas et al. | Mar 2011 | A1 |
20110071857 | Malov et al. | Mar 2011 | A1 |
20110137672 | Adams et al. | Jun 2011 | A1 |
20110218827 | Kenefick | Sep 2011 | A1 |
20110270625 | Pederson et al. | Nov 2011 | A1 |
20120011029 | Thomas et al. | Jan 2012 | A1 |
20120023107 | Nachnani | Jan 2012 | A1 |
20120035984 | Srinivasa et al. | Feb 2012 | A1 |
20120078940 | Kolluri et al. | Mar 2012 | A1 |
20120130736 | Dunston et al. | May 2012 | A1 |
20120158429 | Murawski et al. | Jun 2012 | A1 |
20120158750 | Faulkner et al. | Jun 2012 | A1 |
20120173279 | Nessa | Jul 2012 | A1 |
20120245958 | Lawrence et al. | Sep 2012 | A1 |
20120246727 | Elovici | Sep 2012 | A1 |
20120290320 | Kurgan et al. | Nov 2012 | A1 |
20120290564 | Mok | Nov 2012 | A1 |
20130030827 | Snyder et al. | Jan 2013 | A1 |
20130044749 | Eisner et al. | Feb 2013 | A1 |
20130085769 | Jost et al. | Apr 2013 | A1 |
20130138554 | Nikankn et al. | May 2013 | A1 |
20130166552 | Rozenwald et al. | Jun 2013 | A1 |
20130204940 | Kinsel et al. | Aug 2013 | A1 |
20130304903 | Mick et al. | Nov 2013 | A1 |
20130332194 | D'Auria | Dec 2013 | A1 |
20140046931 | Mok | Feb 2014 | A1 |
20140056243 | Pelletier et al. | Feb 2014 | A1 |
20140059084 | Adams et al. | Feb 2014 | A1 |
20140088981 | Momita | Mar 2014 | A1 |
20140136233 | Atkinson et al. | May 2014 | A1 |
20140222482 | Gautam et al. | Aug 2014 | A1 |
20140244300 | Bess et al. | Aug 2014 | A1 |
20140278491 | Weiss | Sep 2014 | A1 |
20140358578 | Ptachcinski | Dec 2014 | A1 |
20140358845 | Mundlapudi et al. | Dec 2014 | A1 |
20150006558 | Leighton | Jan 2015 | A1 |
20150095056 | Ryan | Apr 2015 | A1 |
20150112696 | Kharraz Tavakol | Apr 2015 | A1 |
20150142464 | Rusin et al. | May 2015 | A1 |
20150199482 | Corbin et al. | Jul 2015 | A1 |
20150332283 | Witchey | Nov 2015 | A1 |
20160028552 | Spanos et al. | Jan 2016 | A1 |
20160055205 | Jonathan et al. | Feb 2016 | A1 |
20160253679 | Venkatraman et al. | Sep 2016 | A1 |
20160328641 | Alsaud et al. | Nov 2016 | A1 |
20170060856 | Turtle | Mar 2017 | A1 |
20170091397 | Shah et al. | Mar 2017 | A1 |
20170132621 | Miller et al. | May 2017 | A1 |
20180082183 | Hertz | Mar 2018 | A1 |
Number | Date | Country |
---|---|---|
2478440 | Oct 2013 | GB |
WO 2012122065 | Sep 2012 | WO |
Entry |
---|
Ahlswede et al., Network Information Flow, IEEE Transactions on Information Theory, vol. 46, No. 4; Jul. 2000 (13 pgs.). |
Bhattacharya, Indrajit and Getoor, Lise, Entity Resolution in Graphs, Department of Computer Science, University of Maryland (2005) (21 pgs.). |
Chen et al., Adaptive Graphical Approach to Entity Resolution, Jun. 18-23, 2007, Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 204-213 (10 pgs.). |
Christen, Data Matching, Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, © Springer-Verlag Berlin Heidelberg, 2012 (279 pgs.). |
Cohen et al., A Comparison of String Metrics for Matching Names and Records, © 2003, American Association for Artificial Intelligence (www.aaai.org) (6 pgs.). |
Coleman et al., Medical Innovation—a diffusion study; The Bobbs-Merrill Company, Inc., 1966 (248 pgs.). |
Domingos et al., Mining High-Speed Data Streams, (2000) (10 pgs.). |
Greenhalgh et al., Diffusion of Innovations in Health Service Organisations—a systematic literature review, Blackwell Publishing, 2005 (325 pgs.). |
Jackson et al., The Evolution of Social and Economic Networks, Journal of Economic Theory 106, pp. 265-295, 2002 (31 pgs.). |
Jackson, Matthew O., Social and Economic Networks, Princeton University Press, 2008 (509 pgs.). |
Krempl et al., Open Challenges for Data Stream Mining Research, SIGKDD Explorations, vol. 16, Issue 1, Jun. 2014 (64 pgs.). |
Rebuge, Business Process Analysis in Healthcare Environments, 2011, Ellsevier Ltd., pp. 99-116 (18 pgs.). |
Wasserman et al., Social Network Analysis: Methods and Applications, Cambridge University Press; 1994 (434 pgs.). |
White et al., Algorithms for Estimating Relative Importance in Networks, Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003 (10 pgs.). |
Webpage: New Health Care Electronic Transactions Standards Versions 5010, D.0, and 3.0, Jan. 2010 ICN 903192; http://www.cms.gov/Regulations-and-Guidance/HIPAA-Adminstrative-Simplification/Versions5010and D0/downloads/w5010BasicsFctCht.pdf (4 pgs.). |
Webpage: U.S. Dept. of Health and Human Services, Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule, http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html printed Oct. 15, 2015 (14 pgs.). |
PCT International Search Report of PCT/US14/52768; dated Nov. 21, 2014; (2 pgs.). |
PCT Written Opinion of the International Searching Authority of PCT/US14/52768; dated Nov. 21, 2014; (5 pgs.). |
(MATHJAX) Naive Bayes Categorisation (with some help form Elasticsearch). Dec. 29, 2013. Blog post. Retrieved from the Internet. Retrieved from: https://blog.wtf.sg/2013/12/29/naive-bayes-categarisation-with-some-help-from-elasticsearch/. pp. 1-5; pp. 2, 5; (8 pgs.). |
Lin et al., A simplicial complex, a hypergraph, structure in the latent semantic space of document clustering, © Elsevier, 2005 (26 pgs.). |
Anonymous: “Oauth—Wikipedia”, Sep. 23, 2013. Retrieved from the Internet URL:https://en.wikipedia.org/w/index.php?title+oAuth&oldid+574187532 (3 pages). |
Version 5010 and D.O, Center for Medicare & Medicaid Services (2 pgs.). |
Anonymous: “Oauth” Wikipedia—Retrieved from the Internet URL:https://en.wikipedia.org/wiki/Oauth (8 pgs.). |
Number | Date | Country | |
---|---|---|---|
20150205846 A1 | Jul 2015 | US |
Number | Date | Country | |
---|---|---|---|
61929787 | Jan 2014 | US |