Utility Patent Application: A Method and Computer Program Product for Detecting and Identifying Erroneous Medical Abstracting and Coding and Clinical Documentation Omissions; Inventor: Daniel T. Heinze, San Diego, Calif.; Assignee: Gnoetics, Inc., San Diego, Calif. (hereafter referred to as “RELATED APPLICATION”)
Not Applicable
Not Applicable
The following disclosure relates to methods and computerized tools for performing Natural Language Processing (NLP) tasks on source documents indirectly using novel indexed content, a novel set of query operators and a novel method of composing the linguistic surface forms of query concepts using a natural language surface form ontology.
Free-text Information Retrieval (IR), the location and retrieval of stored free-text document, is typically performed simultaneously on a large collection of stored or source documents via the intermediary of an index of terms that occur in the source documents to their location. The mapping of a query concept, composed of query terms, onto a set of zero or more source documents can be very fast if the index terms are searchable by means of a rapid search methods such as hashing and inverted-indexing. IR is designed to produce rapid search results for a single or limited set of query concepts against potentially very large document sets.
Natural Language Processing (NLP), the detailed analysis of free-text documents typically consisting of lexical, syntactic, semantic and pragmatic analysis, is typically performed by direct operation on one source document at a time but will produce results related to many concepts during a single analysis pass on that document.
By comparison to indexed IR, NLP is very slow, but NLP can provide a much higher degree of analytical accuracy and a greater depth of analytical detail in terms of identifying or extracting specific information contained in source documents. In addition to being slow, NLP is less flexible in that if even one of potentially thousands of concepts in the system needs to be changed or updated, the time to reanalyze a document set is the same as if all the concepts had changed.
It is desirable, therefore, to have a method, here referred to as indexed NLP employing techniques of grammatical indexing, that achieves the analytical power of NLP with the computational efficiency and speed of indexed IR. This is particularly true when the number of documents to be analyzed far exceeds the number of concepts to be mapped. For example, in the field of medicine, it is frequently desirable to run an NLP engine that includes tens of thousands of medical finding, diagnosis and procedure concepts against tens of thousands of documents. However, with the rise of “Big Data” consolidating or connecting to multiple sources the need to perform rapid, in depth analysis of millions or even billions of documents arises. Also needed is the flexibility to rapidly update analysis results for frequent changes to small numbers of the tens of thousands of query and extraction concepts. A method and computer
In applications where the number of concepts for regular (vs. ad hoc) query and extraction is very large (e.g. medicine), and where the linguistic surface forms expressing a concept are complex and varied, the need arises for a method, here referred to as surface form ontology, to represent the concepts in a structure that represents the cognitive grammar composition of linguistic surface forms representing each concept, can be mapped to indexed NLP query form, and is straight-forward to develop and maintain.
A method and computer program product of indexed NLP that uses novel grammatical indexing and surface form ontology combined with traditional IR indexing and search methods thus producing rapid, flexible and deep analysis, mapping concepts to source documents for retrieval and information extraction is disclosed.
Techniques for performing NLP via the intermediary of an index on a source document set (from here on, the “source document set” or a “source document” will be referred to respectively as “documents” or “document”) of arbitrary size are disclosed. While the following describes techniques in context of medical coding and abstracting and are particularly exemplified with respect to coding medical documents, some or all of the disclosed techniques can be implemented to apply to any text or language processing system in which it is desirable to perform NLP analysis tasks against some documents.
In one aspect, documents in electronic form are indexed on the document terms, parts-of-speech, phrases, clauses, sentences, paragraphs, sections and document source/type. Terms may be single word or a multi-word and are indexed to the begin/end byte offsets within each document in which they occur and to their part-of-speech per occurrence. Phrases, according to their type (e.g. prepositional phrase, noun phrase, verb phrase, etc.), clauses, according to their type (e.g. dependent, independent, etc.), sentences (and sentence fragments), according to their type (e.g. declaration, question, etc.), paragraphs, and sections, according to their type (e.g. subjective, objective, assessment, plan, etc.) are indexed to the begin/end byte offsets within each document in which they occur. Further, principles of cognitive grammar are applied to delimit and index the scope over which a term, phrase, clause, sentence, or paragraph may have influence. Indexed scopes may be nested or overlapping. Document source/type (e.g. lab reports, office visits, discharge summaries, etc.) are indexed to the documents of that source/type.
A query is a construct of concepts that can be mapped onto documents via the index. The constructors for a query are set operators that can be satisfied against the index. Traditional query operators include but are not limited to Boolean, Fuzzy Set, term order and term proximity operators. To these we here add the novel query operators of phraseConstraint, clauseConstraint, sentenceConstraint, paragraphConstraint, sectionConstraint, source/typeConstraint, and scope constraint each relating to the indexing of location (begin/end byte offset and document) and, as applicable, being indexed to the grammatical type (e.g. syntactic category, cognitive grammar category etc.) of the occurrences in the documents. In this way, query terms can be subjected to syntactic, semantic and pragmatic grammatical constraints (the operators and grammatical constraints will heretofore be referred to as “grammatical operators”). For example, the query “source/typeConstraint( radiology, #sectionConstraint(assessment, phraseConstraint(null, and(rib fracture))))” would require that both the terms “rib” and “fracture” occur within the same phrase (phrase grammatical type not specified), within the assessment section of a radiology document.
The concepts that are constructed to form a query may themselves be complex. The construction of an effective query from a complex concept can be difficult. If the surface forms of the concepts are represented in a surface form ontology as described in RELATED APPLICATION, they can be directly mapped to an indexed NLP query form according to method here disclosed.
Implementation can optionally include one or more of the following features. In the RELATED APPLICATION ontology, the surface forms that describe the concepts are composed of a finite set of semantic categories. For example, “rib fracture due to blunt trauma” would be composed of diagnosis( diagnosis(anatomicLocation(rib) and morphologicalAbnormality(fracture)) and environmentalCause(environmentalCause(trauma) modifier(blunt))). Using the unconstrained surface form “rib fracture due to blunt trauma” would likely produce low accuracy results in terms of recall (retrieving all the documents containing the concept) and precision (retrieving only the documents containing the concept). The RELATED APPLICATION surface form ontology representation can be, however, automatically translated into an indexed NLP query consisting of grammatical operators and/or traditional operators by assigning to each surface form ontology component a mapping to one or more grammatical operators and/or traditional operators. Mapping types include but are not limited to: 1) surface form A must occur within a single phrase of optional type X; 2) surface form A.1 to A.n must each appear within a single phrase of optional type X and must all occur with a single clause of optional type Y; 3) surface form A must follow/proceed/co-occur with surface form B within a clause; 4) surface form A must follow/proceed/co-occur with surface form B within a clause without the occurrence of surface form C between; 5) surface form A and surface form B must occur within N contiguous sentences within the same paragraph; 6) surface form A and surface form B must occur within the same paragraph; 7) surface form A and surface form B must occur within the same section; 8) surface form A and surface form B must occur within the same document; 9) surface form A and surface form B must occur within documents that are both indexed to surface form C; where surface form may be a surface form component, a set of surface form components, a surface form, or a set of surface forms as specified by the ontology, or a construct of surface form components, surface forms or surface form sets constructed with the grammatical operators and/or traditional operators.
By associating surface form components and surface forms in the ontology with particular grammatical operators and traditional operators, surface form ontology representations may be automatically translated to indexed NLP queries for information retrieval and extraction.
These aspects can be implemented using an apparatus, a method, a system, or any combination of an apparatus, methods, and systems. The details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.
Like reference symbols in the various drawings indicate like elements.
Techniques for performing NLP via the intermediary of an index on a source document set of arbitrary size are disclosed. While the following describes techniques in context of medical coding and abstracting and are particularly exemplified with respect to coding medical documents, some or all of the disclosed techniques can be implemented to apply to any text or language processing system in which it is desirable to perform NLP analysis tasks against some documents.
Various implementations of indexed NLP are possible. The implementation of techniques for grammatical operators used in the method for indexed NLP are based in and include, but are not limited to, the use of under-specified syntax as embodied in NLP software systems developed by Gnoetics, Inc. and in commercial use since 2009 and the L-space semantics as published in Daniel T. Heinze, “Computational Cognitive Linguistics”, doctoral dissertation, Department of Industrial and Management Systems Engineering, The Pennsylvania State University, 1994 (Heinze-1994). Extending the techniques embodied or described in these sources, novel techniques for indexed NLP are disclosed.
In one aspect, documents in electronic form are indexed by an inverted-index of the document terms, parts-of-speech, phrases, clauses, sentences, paragraphs, sections and document source/type. In addition to inverted-index, any competent method for indexing or mapping may be employed without departing from the spirit and scope of the claims. Terms may be single word or a multi-word and are indexed to the begin/end byte offsets within each document in which they occur and to their part-of-speech per occurrence. Phrases, according to their type (e.g. prepositional phrase, noun phrase, verb phrase, etc.), clauses, according to their type (e.g. dependent, independent, etc.), sentences (and sentence fragments), according to their type (e.g. declaration, question, etc.), paragraphs, and sections, according to their type (e.g. subjective, objective, assessment, plan, etc.) are indexed to the begin/end byte offsets within each document in which they occur. Document source/type (e.g. lab reports, office visits, discharge summaries, etc.) are indexed to the documents of that source/type. Applying principles of cognitive grammar (Heinze-1994), the scope over which a term, phrase, clause, sentence, paragraph, section or document exercises influence or within which it may bind is indexed. Indexed scopes may be nested or overlapping.
A query is a construct of concepts that can be mapped onto documents via the index. The constructors for a query are set operators that can be satisfied against the index. Traditional query operators include but are not limited to Boolean, Fuzzy Set, term order and term proximity operators. To these here are added the novel query operators of phraseConstraint, clauseConstraint, sentenceConstraint, paragraphConstraint, sectionConstraint, source/typeConstraint, and scopeConstraint each relating to the indexing of location (begin/end byte offset and document) and, as applicable, being indexed to the grammatical type (e.g. syntactic category, etc.) of the occurrences in the documents. In this way, query terms can be subjected to syntactic, semantic and pragmatic grammatical constraints (the operators and grammatical constraints will heretofore be referred to as “grammatical operators”). For example, the query “source/typeConstraint(radiology, #sectionConstraint(assessment, phraseConstraint(null, and(rib fracture))))” would require that both the terms “rib” and “fracture” occur within the same phrase (phrase grammatical type not specified), within the assessment section of a radiology document.
The concepts that are constructed to form a query may themselves be complex. The construction of an effective indexed NLP query from a complex concept can be difficult. If the surface forms of the concepts are represented in a surface form ontology based on the cognitive grammar in Heinze-1994, they can be directly mapped to indexed NLP query form according to the here disclosed method.
Implementation can optionally include one or more of the following features. In the Heinze-1994 and RELATED APPLICATION ontology, the surface forms that describe the concepts are composed of a finite set of semantic categories. For example, “rib fracture due to blunt trauma” would be decomposed to diagnosis(diagnosis(anatomicLocation(rib) and morphologicalAbnormality(fracture)) and environmentalCause( environmentalCause(trauma) modifier(blunt))). Using the unconstrained surface form “rib fracture due to blunt trauma” would likely produce low accuracy results in terms of recall (retrieving all the documents containing the concept) and precision (retrieving only the documents containing the concept). The RELATED APPLICATION surface form ontology representation can be, however, automatically translated into an indexed NLP query consisting of grammatical operators and/or traditional operators by assigning to each surface form ontology component a mapping to one or more grammatical operators and/or traditional operators. Mapping types include but are not limited to: 1) surface form A must occur within a single phrase of optional type X; 2) surface form A.1 to A.n must each appear within a single phrase of optional type X and must all occur with a single clause of optional type Y; 3) surface form A must follow/proceed/co-occur with surface form B within a clause; 4) surface form A must follow/proceed/co-occur with surface form B within a clause without the occurrence of surface form C between; 5) surface form A and surface form B must occur within N contiguous sentences within the same paragraph; 6) surface form A and surface form B must occur within the same paragraph; 7) surface form A and surface form B must occur within the same section; 8) surface form A and surface form B must occur within the same document; 9) surface form A and surface form B must occur within documents that are both indexed to surface form C; where surface form may be a surface form component, a set of surface form components, a surface form, or a set of surface forms as specified by the ontology, or a construct of surface form components, surface forms or surface form sets constructed with the grammatical operators and/or traditional operators.
By associating surface form components and surface forms in the ontology with particular grammatical operators and traditional operators, surface form ontology representations are automatically translated to indexed NLP queries for information retrieval and extraction.
Indexed Natural Language Processing System Design
Computer system 150 includes a central processing unit (CPU) 152 executing a suitable operating system (OS) 154 (e.g., Windows® OS, Apple® OS, UNIX, LINUX, etc.), storage device 160 and memory device 162. The computer system can optionally include other peripheral devices, such as input device 164 and display device 166. Storage device 160 can include nonvolatile storage units such as a read only memory (ROM), a CD-ROM, a programmable ROM (PROM), erasable program ROM (EPROM) and a hard drive. Memory device 162 can include volatile memory units such as random access memory (RAM), dynamic random access memory (DRAM), synchronous DRAM (SDRAM) and double data rate-synchronous DRAM (DDRAM). Input device 164 can include a keyboard, a mouse, a touch pad and other suitable user interface devices. Display device 166 can include a Cathode-Ray Tube (CRT) monitor, a liquid-crystal display (LCD) monitor, or other suitable display devices. Other suitable computer components such as input/output devices can be included in or attached to computer system 150.
In some implementations, indexed NLP system 100 is implemented as a web application (not shown) maintained on a network server (not shown) such as a web server. Indexed NLP system 100 can be implemented as other suitable web/network-based applications using any suitable web/network-based computer programming languages. For example Java, C/C++, an Active Server Page (ASP), and a JAVA Applet can be implemented. When implemented as a web application, multiple end users are able to simultaneously access and interface with indexed NLP system 100 without having to maintain individual copies on each end user computer. In some implementations, indexed NLP system 100 is implemented as a local application executing in a local end user computer or as client-server modules, either of which may be implemented in any suitable programming language, environment or as a hardware device with the application's logic embedded in the logic circuit design or stored in memory such as PROM, EPROM, Flash, etc.
In some implementations, indexed NLP system 100 is implemented as a distributed system across multiple instances of computer system 150 each of which may contain zero or more source document index unit 130, query unit 109, source data storage 140 ontology data storage 120, and source data index 145, in which implementation communications links 113, 114115, 116 and 118 will, as needed, be web application communications links between the required instances of computer system 150.
Traditional Operator Indexing Application
Traditional operator indexing application 132 may be any competent indexing application or set of applications that may include but are not limited to inverted-indexing, tree or graph search, hashing, etc. and may include, but is not limited to, features such as term indexing, multi-word indexing, stop wording, stemming, lemmatization, and case normalization.
Grammar Operator Indexing Application
Grammatical Analysis System Algorithm
Unicode. The normalization process also includes annotating the byte offsets of the beginning and ending of document sections, headings, white space, terms and punctuation so that any mappings to ontology data 122, or specifically to surface form data 124, can be mapped back to the original location in documents 142.
The normalized input text is morphologically processed at 204 by morphing the words, numbers, acronyms, etc. in the input text to one or more predetermined standardized formats. Morphological processing can include stemming, normalizing units of measure to desired standards (e.g. SAE to metric or vice versa) and contextually based expansion of acronyms. The normalized and morphologically processed input text is processed to identify and normalize special words or phrases at 206. Special words or phrases that may need normalizing can include words or phrases of various types such as temporal and spatial descriptions, medication dosages, or other application dependent phrasing. In medical texts, for example, a temporal phrase such as “a week ago last Thursday” can be normalized to a specific number of days (e.g., seven days) and an indication that it is past time.
At 208, the grammatical analysis system 134 is implemented to perform syntax parse 208 of the normalized input text and identify the part-of-speech of each term and punctuation, the scope of phrases, the scope of clauses, and the syntactic features of each including but not limited to phrase heads and dependencies. The syntax parse data are stored as annotations for use in ensuing processes. In some implementations, the data structure for representing the annotations includes arrays, trees, graphs, stacks, heaps or other suitable data structure that maintains a view of the generated annotations that can be mapped back to the location of the annotated item in source documents 142. Annotation data 147 produced by grammatical analysis system 134 are stored in annotation data storage 145.
As a refinement to the annotations produced by perform syntax parse 201, identify scope 210 produces further annotation data 147 that identifies the syntactic scope within which terms and punctuation may be combined and for attempted mapping to the ontology data 122 as surface form data 124 and by grammar operator query application 110 and traditional operator query application 111.
Grammar Indexing System Algorithm
Annotations produced by grammatical analysis system 134 are converted to indexes by grammar indexing system 138 and are stored in source data index 145 as index data 147. Index data 147 may be any competent indexing or look-up methodology including but not limited to inverted-index, hashing, graph or tree structure. Grammar indexing system 138 uses the annotations from grammatical analysis system 134 to create index data 147 of one or more of the following grammar constraint type in source data index 145:
1. tokenConstraint,
2. phraseConstraint,
3. clauseConstraint,
4. sentenceConstraint,
5. paragraphConstraint,
6. sectionConstraint,
7. source/typeConstraint,
8. scopeConstraint
each (1-8) relating to the indexing of location (begin/end byte offset and document of documents 142) in index data 147 and, as applicable, being constrained by being indexed in index data 147 to the grammatical type (e.g. part-of-speech, syntactic category, etc.) of each occurrence in documents 142.
Traditional Operator Query Application Algorithms
Traditional operator query application 111 algorithms include but are not limited to Boolean, Fuzzy Set, term order and term proximity operators.
Traditional operator query application 111 algorithms are implemented in such a manner that the traditional operator query application 111 and grammar operator query application 110 can interact in a manner that permits the intermingling and interaction of traditional and grammar operators in query unit 109.
Grammar Operator Query Application Algorithm
Grammar operator query application 110 implements grammatical operators that include but are not limited to:
1. surface form A must occur within a single phrase;
2. surface form A.1 to A.n must each appear within a single phrase and must all occur with a single clause;
3. surface form A must follow/proceed/co-occur with surface form B within a clause;
4. surface form A must follow/proceed/co-occur with surface form B within a clause without the occurrence of surface form C between;
5. surface form A and surface form B must occur within N contiguous sentences within the same paragraph;
6. surface form A and surface form B must occur within the same paragraph;
7. surface form A and surface form B must occur within the same section;
8. surface form A and surface form B must occur within the same document;
9. surface form A and surface form B must occur within documents that are both indexed to surface form C;
1. Where surface form may be
1.a. a surface form component,
1.b. a surface form,
1.c. a set of surface forms as specified in ontology surface form data 124, or
1.d. a construct of surface form components, surface forms or set of surface forms constructed with some grammatical operators and/or traditional operators or some combination(s) of grammatical operators and/or traditional operators, and
2. Where surface form is mapped to specific locations in documents 142 by query unit 109 using index data 147, and
3. Where surface form may be constrained by specification of some grammar constraint type as indexed in index data 147 by grammar indexing system 138.
Query Generator Application Algorithm
Query generator application 112 receives surface form data 124 and relational data 128 from ontology data 122. In ontology data 122, surface form data 124 is composed of a finite set of surface form semantic categories that may optionally be organized in taxonomy. The surface form semantic categories that are chosen are application specific. For clinical medicine, the surface form semantic categories include but are not limited to:
1. Finding
1.a. Disease
1.b. Abnormality
1.c. Measurement
1.d. Substance
1.d.i. Medication
1.d.ii. Environmental substance or artifact
1.d.iii. Bodily substance
1.d.iv. Medical artifact
1.e. Procedure
2. Anatomic entity
3. Modifier
3.a. Spatial relation modifier
3.b. Other modifiers
3.b.i. Certainty
3.b.ii. Severity
3.b.iii. Reporting source
3.b.iv. Timing
3.b.v. Ordinal
3.b.vi. Cardinality
such that each term in each surface form in surface form data 124 is designated by the surface form semantic category in which said term functions in each surface form, and
each surface form semantic category is linked in relational data 128 to some grammar constraint type, and
each term in each surface form in surface form data 124 is related in relational data 128 to each other term in the same surface form with which it shares one or more relations as specified by grammar constraint type.
Computer Implementations
In some implementations, the techniques for implementing indexed
NLP as described in
In some implementations, the computer executable code may include multiple portions or modules, with each portion designed to perform a specific function described in connection with
Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer, including graphics processors, such as a GPU. Generally, the processor will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. Information carriers suitable for embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.
A number of embodiments have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the claims. Accordingly, other embodiments are within the scope of the following claims.
This application claims priority under 35 USC §119(e) to U.S. Patent Application Ser. No. 61/822,597, filed on May 13, 2013, the entire contents of which are hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
61822597 | May 2013 | US |