The invention relates to the field of automated scoring of patient notes.
Patient notes are used in the medical field to record information received from a patient during consultations with medical professionals. Developing patient note taking skills and assessing student readiness to enter into the medical practice is vital. There is a need for more efficient ways to develop patient note taking skills.
Aspects of the invention include methods of generating models for scoring patient notes. The methods include receiving a sample of patient notes, extracting a plurality of ngrams from the sample of patient notes, clustering the plurality of extracted ngrams that meet a similarity threshold into a plurality of lists, identifying a feature associated with each of the plurality of lists based on the ngrams in that list and designating at least one ngram in each list as evidence of the feature associated with that list. The identified features and designated ngram are stored in models for scoring patient notes.
Further aspects of the invention include methods of scoring patient notes. The methods include producing a scoring model based on at least one feature and acceptable evidence that indicates the at least one feature in a patient note, determining for each patient note in a set of patient notes whether at least one piece of acceptable evidence that indicates the at least one feature is present in each patient note, and scoring each patient note in the set of patient notes based upon the determined presence of acceptable evidence in each patient note.
The invention is best understood from the following detailed description when read in connection with the accompanying drawings, with like elements having the same reference numerals. This emphasizes that according to common practice, the various features of the drawings are not drawn to scale. On the contrary, the dimensions of the various features are arbitrarily expanded or reduced for clarity. Included in the drawings are the following figures:
Writing patient notes is one of the basic activities that medical students, residents, and physicians perform. Patient notes are composed by medical professionals after a patient encounter in a clinic, office, or emergency department. They are often considered to comprise four parts, denoted Subjective, Objective, Assessments, and Plan (as a result they are often called SOAP notes). Patient notes are the main tools used by medical professionals to communicate information about patients to colleagues. The ability to effectively write patient notes is one of the key skills used to assess medical students before being licensed to practice medicine. The Step 2 Clinical Skills (“CS”) assessments which form part of the United States Medical Licensing Examination (“USMLE”) directly involve test takers in simulated patient encounters involving standardized patients and require them to take a history, perform a physical examination, determine differential diagnoses, and then write a patient note based on their observations. The patient notes compiled by examinees are expected to contain four elements, following the structure of SOAP notes: Pertinent History (subjective), Physical Examination (objective), Differential Diagnoses (assessments), and potential Diagnostic Studies (plan). Section 3 details the characteristics of these elements.
The notes composed by test takers are then rated by trained physicians. For each case, a scoring guide is made available to raters to facilitate manual scoring of these notes. In the current context, scoring involves assigning a score from 1 to 9 to each note. So far, only trained physicians have been involved in scoring notes. Scoring is a time intensive process, not only in terms of the rating itself, but also due to the case-specific training required. As a result, a Computer-Assisted Patient Note Scoring system (“CAPTNS”), has been developed. CAPTNs uses Natural Language Processing (“NLP”) technology to facilitate rating by suggesting important features that should be present in a note, and lists of textual evidence (ngrams) whose occurrence in a note implies that the test taker has included a particular feature in that note. The term acceptable evidence is used to denote this type of linguistic material. Information on the occurrence of features suggested by CAPTNs and the acceptable evidence for those features may then used to produce feature vectors for each note, which, in turn, may be used to build automated rating models. The final score assigned to a patient note may be derived from information about the presence or absence of the identified features in the note.
Patient notes refer to records of individual physician-patient encounters. They are parts of problem-oriented medical records, whose purpose is to preserve the data in an easily accessible format that encourages on-going assessment and revision of health care plans by all members of the health care team. They are used to communicate information on individual encounters among practitioners. Patient notes are sometimes called SOAP notes, due to the fact that they should consist of four elements: subjective, objective, assessment, and plan. These four elements may have different names, depending on the requirements. In USMLE Step 2 CS, they are translated into History, Physical Examinations, Differential diagnoses, and Diagnostic Studies. USMLE states that such notes are similar to those which physicians have to compose in different settings.
In general for patient notes completed by physicians, the History part is expected to contain the chief complaint(s) of the patient, further information on the chief complaint(s), which can include information about its onset, anatomical location, duration, its character, factors that can alleviate it, or about its radiation (locations that it also affects). The History should also include a review of any significant positives and negatives (the presence or absence of other factors relevant to the diagnosis), details of medications taken, and pertinent family, social, and medical history. The rule of thumb is that physicians should record any information that they deem important for the diagnosis and treatment plan of the case. This allowance means that several alternate structures of the History element may be acceptable, and, in the case of USMLE, the scoring guidelines do not channel test-takers toward any specific structure, as long as important information is recorded. The guidelines do not prescribe the vocabulary and writing styles to be used in the notes either, as long as they are understandable by fellow physicians. Although this is a flexible approach, and similar to the philosophy of the Unified Medical Language System (“UMLS”), which allow multiple world points of view, it creates challenges in processing patient notes. Specifically the processing engines have to be flexible enough to accept considerable variability in vocabulary and style.
The language used in patient notes written by USMLE Step 2 CS test takers was analyzed. In the current context, the test takers are training to be licensed physicians and test administrators consider the notes to be similar to those composed by physicians after their encounters with patients in different clinical settings. It is thus hypothesized that the sub-language to which these notes belong is that to which a wide variety of other types of patient note belong. The aim of the invention is to understand the linguistic features, and to guide the development of the NLP engine accordingly.
Within the four elements of the note, examinees employ free text. The History field is used to summarize information received by interviewing the patient during the encounter. It contains information about different aspects of the history, such as the history of the present illness; medical, surgical, family, and sexual histories; or a summary of the patient's social life/lifestyle. It has been observed that test-takers use several ways to structure these aspects. For the purpose of linguistic analysis, the discourse structure of the History field is considered to comprise 11 possible segments:
Each of these segments can be recognized by a set of discourse markers. For example, the chief complaint often appears under the heading “CC:”, or in the pattern “the patient/he/she complains of”. Significant negatives (indications of the absence of factors) are often marked with “Denies”, or “(−)”. These discourse markers will be utilized to classify features that will help provide some context to reviewers. It is noted that the presence and the order of occurrence of the eleven listed segments of the History varies over notes of equal quality.
Information presented in the Physical examination, which summarizes findings obtained during the physical examination of standardized patients by test takers, is more structured. Each of the procedures carried out and the findings observed are often preceded in the note by a heading, such as Vital (for Vital signs), GA (for general appearance), and so on. For the purpose of linguistic analysis, the Physical Examination element is considered to comprise ten parts:
As in the case of the History, the presence and the order of occurrence of the ten listed segments of the Physical Examination varies even among notes of equal quality.
Differential diagnoses and Workups are lists of free text items indicating possible diagnoses derived from findings observed in the Physical Examination and History and tests intended to investigate the possible diagnoses. As a result, each element of the lists are considered separately within these segments rather than comprising discourse sub-units within them.
In patient notes, sentences with standard structure, typical of other types of written English, rarely appear. When they do, they usually occur in the History part of the note. Other “sentences” often appear incomplete, with frequent elision of information that readers are expected to infer pragmatically. As a result, such sentences often comprise lists of either positive or negative significances (1), short statements (2), or a combination of different structures (3). These examples are all extracted from the same note.
(1) Denies melena, hematochezia, hematuria, dysuria, easy bruising/bleeding, difficulty initiating urination, fever, or weight loss
(2) PSH: kidney operation 26 yrs ago
(3) SH: lives with wife, worked at same company 30 yrs, no tobacco, occasionally drinks 6 pack/day.
Empirical analysis indicates that the syntactic form by which these facts are expressed does not affect the scores, and as a result, syntactic analysis is not performed by CAPTNS and does not contribute to the computer-assisted scoring methodology. Instead, the use of simple heuristics and pattern matching are motivated for the purpose of extracting information from the notes.
The vocabulary of the patient notes can be considered the most difficult problem to address, from the NLP point of view. There are three main issues that the NLP engine needs to address: the use of abbreviations, the use of both specialized medical terms and everyday language, and the occurrence of typographical errors.
Abbreviations are used extensively and there is variation in the choice of abbreviation among authors. The notes assessed demonstrate the use of both conventional abbreviations, and the ad-hoc invention of abbreviated forms.
To illustrate, consider (4) and (5), which present the findings of two types of examination presented in patient notes composed by two different examinees.
(4) Cardiac: RRR, no MRG pulm: ctab.
(5) Pulm: clear to auscultation bilaterally
CVS: regular rate and rhythm, no murmurs, rubs, or gallops
The two notes differ in the choice of reference to the examination. The reference in (4) is more specific than that in (5), denoting cardiac examination as opposed to examination of the cardiovascular system (CVS). In both cases, identical findings are noted. In (4), regular rate and rhythm is abbreviated as RRR, murmurs rubs and gallops is abbreviated as MRG, and clear to auscultation bilaterally is abbreviated as ctab. In general, for notes composed in this context, the guidelines provided to Step 2 CS test takers encourage the use of abbreviation in patient notes (the guidelines provide a table of standard abbreviations that examinees might use and imply that more may be employed where necessary), but strict adherence to a particular methodology for abbreviation is not expected. This kind of heterogeneity in the language of patient notes is a characteristic of those produced by medical professionals in their daily work.
Further evidence is provided by the variety of means used to express the fact that a patient is alert and oriented to person, place and time (i.e. is aware of who they are, where they are, and the current day/date/time). It can be entered as: A/O * 3; A/O x 3; AxOx3; AAOx3; aao x 3; aaox3; AAO X 3; A&Ox3; A+Ox3; aao x 3; ao x 3; or AO to person place date, etc. Similar heterogeneity is apparent in references to examinations. For example, examination of the extremities may be indicated by Extrem-; Extrems; extreme; Extr; ext; Ext; LE; etc.
One additional method of abbreviation is the use of the hyphen in patient notes. It is often used to separate values from attributes (e.g. bp-; VS-WNL; abd-soft); ranges of anatomical locations (e.g. CN II-XII; CN's 2-12; CN2-12), measures (4-5 lbs), or frequencies (e.g. 1-2×/month); to modify findings (CTA-B); and finally in its traditional role to create compound modifiers and compound nouns (y-o; 6-pack).
Spelling mistakes and typographical errors are very common in patient notes (e.g. asymetry; apparetn; alteast for at least; AAAx3 for AAOx3; bowie movements; esophogastroduodenoscopy for esophagogastroduodenoscopy; hennt for heent; surgicla; quite smoking; anemai for anemia, etc.), and scorers are instructed to tolerate these typos as long as the meaning of the note is clear. As a result, the NLP engine is expected to process this type of ill-formed input.
Finally, it is noted that medical and everyday terms are used interchangeably in the sample of patient notes (e.g. rhinorrhea: runny nose, Blood in stool: melena or hematochezia (clinically melena and hematochezia are different, with melena being a sign of hematochezia, but the analysis of the notes and scores indicate that both of them are acceptable), etc.). The NLP engine is therefore designed to be equally effective when presented with such stylistic variation.
The empirical analysis suggests that the language used in patient notes is unique, and as well as offering a challenge to existing tokenization systems, other linguistic tools such as part-of-speech taggers, and parsers are unlikely to be effective. Further, the linguistic features commonly used in other computer-assisted scoring systems will not be suitable for the task. Instead, a new method needs to be developed to deal effectively with the language of patient notes.
Investigation of the contents of notes and their corresponding scores revealed that there is a correlation between the scores assigned to notes and the occurrence of a number of distinctive features (important facts) that are present in them. For example, given the case of a 52 year-old man with significant depression, the distinctive features are weight loss, appetite, low energy and interest, low libido, diarrhea, constipation, no stool blood, normal family history, no chronic illness, medication, ETOH, CAGE, and suicidal ideation. These features, in turn, can be expressed in various ways, for example weight loss as wt loss, weigh loss, etc. The empirical analysis of such example cases led to the formulation of a strategy for computer-assisted patient note scoring in which scores are assigned automatically on the basis of the occurrence in the notes of a set of features specific to each case. The requirement to build specific set of features for each case can pose a barrier to computer-assisted scoring. The benefit of computer-assisted scoring is inversely proportional to the time required to build each set of features. One method by which this obstacle can be negotiated is to automate the process of determining both distinctive features and the sets of acceptable evidence that confirm the presence of that distinctive feature in a patient note, as far as possible. The invention automatically extracts a suggested list of features (e.g. weight loss) and acceptable evidence (e.g. wt loss, weigh loss, etc) that confirms the presence of that feature in a note. This list is then reviewed by human experts. The process of automatically producing the suggested lists of features and acceptable evidence, as well as the human review of these lists is described below.
Referring to
At block 100, a sample of patient notes is received. As used herein, a “sample” of patient notes refers to a number of patient notes that are used to generate a scoring model for scoring patient notes. In an embodiment, the sample of patient notes is not scored in subsequent scoring steps. The number of patient notes in the sample may be any number of patient notes adequate to generate the scoring model. In an embodiment, the sample of patient notes includes about 300 patient notes. In one embodiment, the sample of patient notes includes about 30% of all patient notes generated in a study. The sample of patient notes may be received by a computer processor to be further analyzed and processed in generating the scoring model.
At block 102, ngrams are extracted from the sample of patient notes. The generation of lists of features and evidence associated with the features is based on identification of ngrams rather than other linguistic units due to the unique nature of patient notes. As used herein, “ngrams” refers to units of language in a patient note. Patient notes typically include abbreviations, spelling mistakes, incomplete sentences, etc. These characteristics can create difficulty for other linguistic processes such as part-of-speech tagging, parsing, etc.
At block 104, tokenization is performed to separate the contents of the patient notes into units (excluding sentence tagging), and the units that can be reliably identified are ngrams. In an embodiment, the unit length of each ngram is 1≦n≦5. Other suitable unit lengths will be understood by one of skill in the art from the description herein. In one embodiment, the text of patient notes is split into units/chunks using a set of boundary markers that often signify the boundary of a fact, including semicolons, commas, and slashes. Each unit/chunk is treated as an ngram. Referring to
At block 106, linguistic filters are applied to the ngrams. The filtering process is intended to ensure that the lists of features and evidence presented to reviewers are as readable as possible to avoid adverse effects on them. Such filters are applied to remove ngrams that consist of sequences of function words, punctuation, incomplete syntactic constituents, etc. In one embodiment, the ngrams are cross-referenced with a list of ngrams known to be not useful (e.g., “and”, “or”, “not”, “patient”, etc.). The ngrams may be cross-referenced with a list of known to be not useful starts that indicate incomplete syntactic constituents. For example, those that start with “or” or “and” tend to indicate incomplete syntactic constituents (e.g., “or smoking,” “and fever,” etc.). The ngrams may be cross-referenced with a list known to be not useful finishes that indicate incomplete syntactic constituents. For example, those that end with “and better” and “described as” tend to indicate incomplete syntactic constituents (e.g., “exercise and better,” “pain described as,” etc.). Other suitable linguistic filtering processes will be understood by one of skill in the art from the description herein. Referring back to
At block 108, the remaining ngrams are extracted and then clustered into a plurality of lists. Blocks 110-122 describe sub-steps to the clustering step of block 108 in accordance with aspects of the invention. Similarities between ngrams that may allow them to be grouped into a single conceptual unit, feature, and/or other information, are then identified and clustered by calculating the similarity between the ngrams. The similarity between two ngrams may be calculated using a combination of character-based edit-distance (to cover typos and spelling variations), Wordnet edit distance (to account for general language variation), and UMLS edit distance (to deal with medical language variation).
In an embodiment, character based edit distance is calculated at block 112. Character-based edit distance may be calculated as the Levenshtein Distance between two ngrams, in which edits are performed at the character level.
At block 114, edit distance (e.g., Wordnet edit distance) between ngrams is calculated. Wordnet similarity may be calculated based on Levenshtein distance between two ngrams, in which edits are performed at the word level, and the cost of replacement of one word by another is equal to the normalised distances between the two words according to Wordnet.
In one embodiment, Wordnet similarity SWN(N1,N2), between ngrams N1 and N2, is calculated as:
S
WN(N1,N2)=WED(N1,N2)/Max(Length(N1),Length(N2))
Here, Length(Ni) is the number of words in ngram Ni. WED(N1,N2) is calculated as the (weighted) number of word-level edit operations (insertion, deletion, or substitution) needed to convert N1 into N2. The weight of the substitution operation of word W1 by word W2 is calculated as 1−Ws(W1,W2). The weights of the insertion and deletion operations are both 1. The Wordnet similarity, Ws (W1,W2), between words W1 and W2, is calculated as:
W
s(W1/W2)=1 if W1=W2 or if W1 and W2 are synonyms
Otherwise:
Ws(W1,W2)=1/(1+Min(Ds(W1/W2)))
Here, Min(Ds(W1,W2)) is the minimum path length from W1 to W2 that is formed by relations between senses encoded in WordNet.
Example:
N1=“sick contacts”; N2=“ill contacts”
Min(Ds(“sick”, “ill”))=0; because “sick” and “ill” share the sense 302541302 (affected by an impairment of normal physical or mental function). This means that Ws(“sick”, “ill”)=1; WED(“sick contacts”, “ill contacts”)=0; and SWN(“sick contacts”, “ill contacts”)=0.
In an embodiment, the Unified Medical Language System (UMLS) similarity between ngrams is calculated at block 116. Calculation of UMLS similarity exploits the concept unique identifiers (CUI) assigned to every entry in the UMLS Metathesaurus vocabulary database, the contents of which are incorporated herein by reference.
In one embodiment, UMLS similarity, SUMLS(N1/N2), between ngrams N1 and N2, is calculated as follows:
S
UMLS(N1,N2)=1 if either N1 or N2 does not contain any string that can be found in UMLS.
Otherwise:
S
UMLS(N1,N2)=Min(NEDCUI(WholeCUI(N1),WholeCUI(N2)),NEDCUI (SplitCUI(N1),WholeCUI(N2),NEDCUI(WholeCUI(N1),SplitCUI(N2),NEDCUI (SplitCUI(N1),SplitCUI(N2)))
The term SplitCUI(N1) is derived by replacing each word in N1 with its CUI; Any word that does not have a CUI remains unchanged; if a word has multiple CUIs, it is replaced by that list of CUIs.
The term WholeCUI(N1) is produced by replacing each word in the longest matching string in N1 with its CUI. NEDCUI (S1,S2) is the normalised edit distance between the two strings S1 and S2, and is calculated as
N
EDCUI(S1,S2)=ECUI(S1,S1)/(2×Max(Length(S1),Length(S2))
Here, ECUI(S1,S2) is calculated as the weighted number of word level operations (insertion, deletion, and substitution) needed to convert S1 into S2. The weights of insertion and deletion operations are both 1. The weight of the substitution of word W1 by word W2 is 0 if W1 contains CUIs of W2 or vice versa (which means that W1 can mean W2). This weight it will be set to 2 if both W1 and W2 are lists of CUIs, but they do not overlap. This is an empirical setting, the rationale being that if both W1 and W2 are medical terms, but do not share CUIs, they are presumed to be more dissimilar than would be the case if W1 were a medical term and W2 was not (and vice versa), in which case the weight will be set to 1.
Example:
N1=“viral pneumonia”
N2=“atypical pneumonia”
WholeCUI(N1)=C0032310(“viral pneumonia” CUI is C0032310)
WholeCUI(N2)=C1412002(“atypical pneumonia” CUI is C1412002)
SplitCUI(N1)=C0521026C0032285 (“viral” CUI is C0521026; “pneumonia”: C0032285)
SplitCUI(N2)=C0205182C0032285 (“atypical” CUI is C0205182)
NEDCUI(WholeCUI(N1),WholeCUI(N2))=1
NEDCUI (WholeCUI(N1),SplitCUI(N2))=0.75
NEDCUI (SplitCUI(N1),WholeCUI(N2))=0.75
NEDCUI (SplitCUI(N1),SplitCUI(N2))=0.5
SUMLS(N1,N2)=0.5
At block 118, syntactic similarity may be calculated using bag-of-words similarity. For example, there is similarity between “decreased appetite” and “appetite is decreased.”
The combination of similarities between ngrams calculated at blocks 112-118 enables recognition of orthographic, lexical, syntactic, and semantic variants, and cluster together ngrams whose surface forms are different but which have identical meanings. For example, decreased appetite may be grouped with decrease in appetite (syntactic variation), decreased appetite (orthographic variation), poor appetite (semantic variation), and anorexia (lexical variation). One of skill in the art will recognize many combinations of calculations can be performed to assess the similarity between ngrams from the description herein.
At block 120, it is determined whether the similarity between ngrams meets a similarity threshold. It is contemplated that such a determination against a similarity threshold may be made after each of the calculations performed at steps 112, 114, 116, and/or 118. In an embodiment, the similarity threshold is adjustable. The extent of the adjustment will depend on the nature of the application. For example, to determine what percent of patient notes contain “chest pain” as a symptom, the similarity threshold for two ngrams to be considered similar such that they are clustered may be set lower. As a result, the system may identify “cp,” “chestpain”, or “ch pain” as synonyms of “chest pain” but not “back pain.” For applications of the system that require higher sensitivity, the similarity threshold may be increased. For example, to identify patient notes containing “thyroid disease” as a diagnosis, one would set a higher cut-off point of similarity so that “hypothyroism” would be grouped with “thyroid disease.” Other suitable similarity threshold adjustments will be understood by one of skill in the art from the description herein.
At block 122, ngrams that meet the set similarity threshold are grouped into lists, thus completing the clustering step of block 108.
At block 124, a feature associated with each list of clustered ngrams is identified. The feature associated with the list may be identified on the basis of the number of times an ngram appears in the list. In one embodiment, the ngram that occurs most frequently is identified to denote the feature associated with the list. In an embodiment, the feature associated with an ngram that appears frequently throughout the sample of patient notes may be identified as important to the case, and subsequently selected as will be described below with respect to
At block 126, ngrams in each list are designated as potential evidence of the feature associated with the list. The designated ngrams may exactly match the feature, be an alternative term for the feature, an abbreviation of the feature, a typographical error close to the feature, etc. Referring to
At block 128, each identified feature is categorized by the type of information each feature provides. As described above, the type of information may be related to the pertinent history, physical examination, differential diagnoses, and/or diagnostic studies. Each feature may be further categorized by information on the chief complaint, significant negatives, social history, etc. Referring back to
At block 130, the categorized features are merged. In this step, features that belong to the same type of information, which share multiple patterns, may be merged into a single feature. In this way, the list of features is further reduced and refined.
At block 132, the features, the evidence, and the categorization are stored to be presented to a human reviewer for review. This assembly of data assembled from steps 100-130 may be stored in a file for review. In one embodiment, the data is exported to an html file. In an embodiment, the data is stored in an accessible database file.
Referring next to
At block 402, features and evidence acceptable to indicate the feature are selected. In one embodiment, the feature and acceptable evidence are selected by the reviewer. Referring back to
At block 404, comments are optionally added. As can be seen in the comments column 316, a Chief Complaint for this study has been identified as “strange episode,” and comments are added to indicate which evidence is acceptable to indicate the Chief Complaint.
At block 406, an automatic key file is generated. The file may be generated when the reviewer has completed the selections and comments at blocks 402 and 404. The combination of data assembled in accordance with the steps of flowchart 10 along with the selections and comment received in accordance with the steps of flowchart 40 completes the scoring model, which may be stored as a key file. The key file may be stored, e.g., in memory, for use in scoring subsequently analyzed patient notes.
Referring next to
At block 502, a set of patient notes is received. As used herein, a “set” of patient notes refers to a number of patient notes that are scored using the scoring model. In one embodiment, the “set” of patient notes does not include patient notes from the “sample” of patient notes used to generate the scoring model at
At block 504, it is determined whether the feature and acceptable evidence of the feature is present in each note in the set of patient notes. The determination may be performed according to steps 506-514.
At block 506, ngrams are extracted from each patient note in the set of patient notes. The ngram extraction may be performed in a manner similar to the extraction step at block 102 of
After ngrams are extracted from each patient note, it is determined whether the ngrams match the feature or acceptable evidence of the feature at block 508. The program may be configured to determine exact matches as in block 510, fuzzy matches as in block 512, or both. In embodiments where the program is configured to perform exact matching at block 510, the program confirms the presence of a feature if any acceptable evidence of it is present in the note exactly as specified in the automatic key file. In embodiments where the program is configured to perform fuzzy matching at block 512, the program confirms the presence of a feature if a number of typographical errors, general vocabulary variations and/or medical vocabulary variations of the acceptable evidence is present in each patient note. Other suitable matching techniques will be understood by one of skill in the art from the description herein.
In one embodiment, features traditionally used in automated scoring such as word counts, presentational scores, and readability scores will also be produced. Experiments undertaken while developing aspects described herein showed that presentational scores and readability scores were non-contributory in this particular application. They were therefore not exploited.
At block 514, each patient note in the set of patient notes is scored based on the presence of the feature or acceptable evidence indicating the feature. The scoring step of block 514 may be performed by representing each patient note as a vector of binary values as in block 516 and/or computing the score of each patient note with the binary vector values and a regression model at block 518.
At block 516, a file is outputted that includes each patient note represented as a vector of binary values indicating for each feature selected in the scoring model whether the feature is present or not in the note. At block 518, the outputted file is used by a linear regression program to build models based on scores assigned by human raters. Such models may be used to automatically predict scores.
According to one embodiment and in addition to linear regression models, experiments applying hierarchical learning were undertaken. In this context, raters' scores are first grouped into three classes (e.g. scores 1-2-3, 4-5-6, and 7-8-9), or five classes (e.g. 1-2, 3-4, 5-6, 7-8, and 9). Various machine learning approaches were used to classify the notes into the designated classes. The resulting classifications were used to guide the next stage of the approach in which linear regression was employed to distinguish between class members within each class that includes sufficient instances for training. These experimental settings were chosen because they provide more training data for each of the classes, and the strengths of alternate classifiers can be exploited, rather than relying solely on linear regression.
The scoring step of block 514 and an evaluation of the scoring step is now described according to one embodiment of the invention. The following embodiment is exemplary and not exclusive. Other suitable scoring techniques for each patient note in a set of patient notes will be understood by one of skill in the art from the description herein.
The evaluation setting was designed to mimic the operational context in which the amount of available training data is limited. As a result, first a sample of 300 notes was used to automatically suggest features and acceptable evidence for these features. The reviewers then examined and filtered the list, and a CAPTNS key file (as opposed to the key files intended to be used by human raters) was produced for each case. In total, CAPTNS key files for 14 cases have been produced. Each of these CAPTNS key files is then used by the automatic annotation program to produce a vector representation of each patient note characterizing a case. These vectors serve as input to linear regression and other machine learning algorithms. For each case, 8 different settings have been used (see also Table 1 for their descriptions and abbreviations), most of them relying on linear regression (only Hi makes use of other machine learning algorithms). Setting C uses all of the features and acceptable evidence suggested by CAPTNS. This helps establish the added value of human reviewers. In W, only the word counts of the History and Physical Examination parts are used. This is to establish the baseline of the scoring systems. In F, only the features and acceptable evidence selected by reviewers are used to produce the vectors for linear regression. In F+P, instead of using a generic engine that recognises Physical examinations, specific Physical Examination finding patterns were used. In W+F, word counts of History and Physical Examinations are added to the vectors produced by F. In W+F+P, word counts of History and Physical Examinations are added to the vectors produced by F+P. In Hi, the best setting of hierarchical learning is used to produce the final scores. In SF, the best features from the W+F+P set are selected using feature selection on all available data. This is to test how much we could gain if we would know beforehand the best possible set of features.
All of the models presented above were built using 30% of all the notes that are not used to produce the feature list, and the correlations r are calculated between the scores produced by the models and those produced by human raters on the remaining 70% of the notes that were not used in building the automated scoring models, or in producing the feature files. This stringent ratio of training/testing data (30% training, 70% testing) is used to simulate the expected conditions of future system usage rather than test the regression models. We consider, as a baseline, the models built using word counts only (W). We compare the correlations between the scores produced by the models and those produced by human raters on the unseen testing data, and the correlations between the scores of two human raters (Hu) for each case. The overall results are presented in
Alternative methods to produce automated scores (such as hierarchical learning), the statistical significance of the differences in r produced by each setting, the learning curves which indicate the relations between the amount of training data and r, the hypothetical effect of having access to an “ideal” set of features. We will also discuss the effect of combining features, and the reliability of using certain parameters to predict r are described below.
In addition to linear regression, the scoring problem has been approached from a classification perspective. Firstly, each potential score was considered as a class: scoring patient notes was equivalent to classifying them according to a scheme involving nine possible classes corresponding to scores ranging from 1 to 9. Since the intention was to simulate a scenario in which models are trained on a relatively low number of patient notes (30% of the available data for training), the amount of training data proved to be insufficient for the classification problem involving nine classes and with instances represented using approximately 40 features (the number of features was somewhat case-dependent). This explains the poor results achieved by several classifiers trained on 30% of the data and evaluated on the remaining 70%, with the output of the best classifier indicating an average correlation of 0.185 across all cases. In an attempt to overcome the difficulty of this high-dimensional classification problem, we experimented with hierarchically decomposing the problem by grouping neighbouring scores together. The following score groupings were experimented with:
Grouping A:
Grouping B:
Grouping C:
Grouping D: (this Grouping was Chosen to Proportionally Distribute the Number of Notes into Classes)
Grouping E:
Various classifiers (i.e. BayesNet, SVM, SMO, JRip, AdaBoost, J48) were evaluated on all the previously mentioned groupings with 30/70% split between training and testing data. After classification, experiments were conducted in which the classes included in each grouping were mapped to their median scores. For example, in the case of Grouping A, we mapped the class A1 to 8, the class A2 to 5, and the class A3 to 2. The correlation between scores originally provided by human raters and the scores resulting from mapping the classification output was then measured. In this context, 0.2≦r≦0.3, much more distant than the correlation between humans raters. However, given that only one score was assigned to all the instances belonging to a particular class, the correlation was unexpectedly close. Having made the initial coarse-grained classification, the second level of the hierarchical classification process was invoked in order to distinguish between scores belonging to the different classes.
After the classifiers learn to distinguish among the coarse classes situated at the top level of the hierarchy, for each class in a particular grouping, the corresponding instances are filtered on the basis of the top level classifier's prediction. The SMO (Sequential Minimal Optimization) classifier was found to be most accurate in distinguishing between the top level classes included in various groupings, and was therefore selected for the top level classification. For the lower level distinctions, Linear Regression yielded the best accuracy. However, at the second hierarchical level, Linear Regression is only applied to those classes in which there are enough instances to support an additional 30/70% split between training and testing data. Typically these are the mid-range classes (i.e. A2, B2, C2, E3), which are the most frequent ones. In this approach, the classifications provided by Linear Regression for these classes are then combined with those obtained by the SMO model applied to less frequent classes (which had previously been mapped to a single score per class, as described earlier). The correlation between human annotated scores and the scores yielded by the hierarchical classification process were calculated.
Closest correlation with human annotated scores was observed for Grouping B (Class B1: 8-9, Class B2: 5-6-7; Class B3: 1-2-3-4). These results appear in column Hi in
Table 2 presents a statistical significance matrix indicating differences between the correlation of different scoring methods with human raters. If the difference between (i) the closeness of correlation of system A to human raters and (ii) the closeness of correlation of system B to human raters is sufficiently large, then the difference between A and B is considered to be statistically significant. The statistical significance scores are calculated on the basis of the Fisher transformation of r into z, followed by application of standard two tailed t-tests to those zs. The most important observations are detailed below. Firstly, models built using word count only (W) are not statistically significantly different from models built using only features and acceptable evidence suggested by CAPTNS (C). Models built using features (including features present in the PHYSICAL EXAMINATION segments of the notes) are statistically significantly different from models built using only word count, W (see column 3). This confirms that the content-based scoring methodology, when applied in isolation, is better than word count-based scoring. The next observation is that the best configuration, taking into account operational requirements (limited number of training instances) is W+F+P, which is slightly better correlated with human raters than other models, but the differences are only statistically significant for C (CAPTNS only), W (word count only), and F (features only, in which physical examinations are generic). These results are presented in row 7. This means that although use of the W+F+P setting is to be recommended, W+F is another viable option. The final observation noted here is that when “ideal” sets of features are available, the produced average z is not statistically significantly different from those of W+F, W+F+P, and Hu. This indicates that although the methodology would benefit from careful selection of features, a cost-benefit analysis comparing the time taken to find ideal sets versus the improvement obtained in r, indicates that this strategy may not be economical. Human average z, although slightly better than those of all of the models, except for W+F+P and SF (“ideal” set of features), shows statistically significant differences only with C, W, and F, and differences that are statistically significant at 0.05<p<0.1 from F+P (see row 8).
The dependency of the magnitude of correlation coefficients between different rating methods and human experts on the number of training samples available to those rating methods is now discussed. This dependency was investigated by running the same experiment, using 30, 40, 50, and 60 percent of the available data as training data, and validating on the rest of the data. This process indicated that for word count only, the average correlation coefficients stabilize at 0.32, whereas when only content features are used, the average r increases from 0.40 to 0.43. This indicates that content-based scoring benefits from more training data, whereas word-count based scoring does not (see
A set of “ideal” features for each case is selected by applying linear regression to all the data. The rationale behind this is to discover the degree of improvement in performance of the rating methods that could be obtained by selecting the best set of features. It was observed that the (weighted) average r increases to 0.49 (Δr=0.02) using this approach, although this difference is not statistically significant. The case for which performance improves most when the best set of features is selected is the one for which there is the least amount of training data (case 5124, which has around 200 training samples). This suggests that it is not necessary to optimize the set of features any further, apart from in those cases where little data is available.
Intuitively, it may be thought that the presence of certain pairs of features in the same note should be penalized. For example, xray and CT should not be used together (due to concern of over exposure of the patient to ionized radiation), although the occurrence of either xray or CT would be acceptable. To capture this, from any two features (A and B), we produced three derivative features F1=A and B, F2=A or B, F3=A x or B, and investigated their effect on the correlation coefficients. The results indicate that the derivative features do not have any effect on the final r across the 14 cases.
For workflow management, it will be useful to predict r (or zr) before any human intervention in the selection of features and acceptable evidence. Analysis of the outputs of different raters was used to determine whether the r produced by CAPTNS alone can be used to predict the final r. It was noted that the zr produced by C and zr produced by W+F+P are highly correlated (r=0.91). This finding has a practical application: before asking human experts to review the feature list, linear regression exploiting all the features suggested by CAPTNS can be used to predict the final zr using the formula: zr
The correlation between the number of features selected by the reviewers and the final r was also investigated. Correlations between zr and the number of features selected by human experts were calculated: the results are presented in Table 3. Essentially, the results suggest that selection of a greater number of features occurring in Physical Examinations and Work Up segments is beneficial, whereas the number of History features selected is almost inconsequential. It should be noted that because these correlations are calculated for a small sample size (14), it cannot be said that these zr are significantly different from zero at 95% confidence. The results also indicate that the number of expressions selected as acceptable evidence of the occurrence of a feature (represented by Line count) does not have any significant effect on the performance of the linear regression models.
The results allow several observations to be made. Most important is that it is possible to build regression models that assign scores to patient notes whose correlation with scores assigned by human raters is comparable with that of other human raters, as long as humans are involved in the selection of important features and the acceptable evidence confirming the presence of those features. This is shown in the (weighted) average of r of the optimized setting (W+F+P) of 0.47, which is the same as the r between human raters. When using features suggested by CAPTNS alone (C), automatically, performance is worse than that of baseline methods (although not with statistical significance). This indicates that humans should remain key actors in computer-assisted scoring, and their role should be to indicate potentially important features that serve as the basis for scoring.
The next important observation is that, as shown in Table 2 (column 3), models built using only content features outperform those built using only word count (W). This indicates that the methodology is complementary to those based only on word counts, which is one of the main objectives of computer-assisted scoring. Nevertheless, when word counts are used in addition to content-related features, the results approach the correlations observed between human raters (see also row 8, table 2, which indicates that the difference between human raters' correlations and settings that do not use word count are statistically significant at p<0.1). This is not surprising, as word count also correlates with the amount of information contained in notes. Furthermore, the results show that raters may have a positive bias toward the length of a note, as opposed to its content.
The strength of this methodology is that the reasoning behind the scores is fully accountable and customizable. Human raters can easily modify the plain text CAPTNS key files used by the automatic scoring system.
Rather than calculate the mean r directly, we use Fisher's z transformation. This is the most appropriate way to calculate the average correlation coefficient. This is due to the fact that the sampling distribution of r is very skewed, especially when r>0.5. It should also be noted that it is possible to calculate the confidence interval of r in each case (which means that the difference can be assessed), thanks to Fisher's z transformation.
Referring to
The analysis of the correlations between zr and other factors such as the amount of available data, the number of features selected by reviewers or the number of patterns (acceptable evidence) selected by reviewers suggests several interesting hypotheses, but adequate testing will require access to a larger sample of cases. The first is that the more training data that is available, the higher the level of zr that can be obtained. In the current article, the obtained level of r=0.46 is not statistically significantly different from zero, given that N=14. The second hypothesis is that reviewers should prioritize selection of features of the WORKUP category (r=0.35), rather than selection of acceptable evidence of the occurrence of those features (r=−0.15). Increasing the number of cases (N) will lead to a better understanding of these correlations.
The new Step 2 CS patient note features a new component wherein examinees are asked to explicitly list the history and physical exam findings that support each diagnosis they list. This new section of patient note is called Data Interpretation (DI) and represents a segment of the larger clinical reasoning construct. The history and physical examination sections are grouped together under the Data Gathering (DG) component. The DI and DG sections have separate scoring rubrics and therefore raters produce two separate scores for each patient note.
Efforts to adapt/modify CAPTNS to accommodate the new patient note began immediately after the operational adoption. To address data in the new DI section where the supporting history and physical examination findings are presented, the same principle (splitting the texts into ngrams, finding similar ngrams, and grouping them together) described in the main document was applied. Nevertheless, due to the concise way information is commonly presented in the supporting findings sections, and the fact that these sections are considerably shorter than the History and Physical Examination sections within the DG component, the ngrams are collected using a specific method. Firstly, the text is split into chunks using a set of boundary markers that often signify the boundary of a fact in this section, including [; |, | and |/]. Then each chunk is treated as an ngram.
Supporting Hx: “HTN, borderline cholesterol, alcohol intake”
Supporting Ngrams: “HTN”, “borderline cholesterol”, “alcohol intake”.
For each diagnosis, in order to be suggested as potential supporting evidence, an ngram has to be present in the supporting evidence section of that same diagnosis in more than one note. This threshold is determined using empirical observations, and could be modified if needed. Similar ngrams are collected using the same method in other sections of the note.
With regard to the marking of supporting evidence, in addition to noting “yes” or “no” as the presence or absence of supporting evidence, CAPTNS is also able to return a number to indicate how many supporting pieces of information are presented for a given diagnosis.
One or more of the steps described herein can be implemented as instructions performed by a computer system. The instructions may be embodied in a non-transitory computer readable medium such as a hard drive, solid-state memory device, or computer disk. Suitable computers and computer readable media will be understood by one of skill in the art from the description herein.
Referring to
Memory 802 stores information for system 80. For example, memory 802 stores data comprising information to be outputted with the output 806. Memory 802 may further store data comprising patient notes, scores, features, evidence, selections, etc. Suitable memory components for use as memory 802 will be known to one of ordinary skill in the art from the description herein.
Processor 800 controls the operation of system 80. Processor 800 is operable to control the information outputted on the output 806. Processor 800 is further operable to store and access data in memory 802. In particular, processor 800 may be programmed to implement one or more of the methods for scoring patient notes and/or generating scoring models for patient notes described herein.
It will be understood that system 80 is not limited to the above components, but may include alternative components and additional components, as would be understood by one of ordinary skill in the art from the description herein. For example, processor 800 may include multiple processors, e.g., a first processor for controlling information outputted on the output 806 and a second processor for controlling storage and access of data in memory 802.
Although aspects of the invention are illustrated and described herein with reference to specific embodiments, the invention is not intended to be limited to the details shown. Rather, various modifications may be made in the details within the scope and range of equivalents of the claims and without departing from the invention.
This application claims priority to U.S. Provisional application Ser. No. 61/779,321, entitled “PATIENT NOTE SCORING METHODS, SYSTEMS, and APPARATUS, filed on Mar. 13, 2013, the contents of which are incorporated fully herein by reference.
Number | Date | Country | |
---|---|---|---|
61779321 | Mar 2013 | US |