This disclosure relates to configuration and training of neural networks to improve the accuracy of neural networks used for multi-label document classification (MLDC), for instance, in Computer Assisted Coding (CAC) where multiple billing codes are assigned to a clinical document.
MLDC is a natural language processing technique for assigning one or more labels from a collection of labels to a particular document in a document corpus based on the text contained within the document. It differs from the multi-class classification techniques where each document is tagged with only one label.
MLDC has a great number of practical applications. One of the use cases is the automatic medical coding, where the medical records are assigned with multiple appropriate medical codes. A huge number of medical records need to be coded for billing purposes every day. Professional clinical coders often use rule-based or simple machine-learning-based systems to assign the right billing codes to a patient encounter, which usually contains multiple documents with thousands of tokens. The large billing code space (e.g., the ICD-10 code system with over seventy thousand codes) and long documents are especially challenging for machine learning models implementing traditional MLDC approaches. Consequently, effective models with the capability of handling these challenges will have an immense impact in the medical domain as they help to reduce coding cost, improve coding accuracy and increase customer satisfaction.
Deep learning methods have been demonstrated to produce the state-of-the-art outcomes on benchmark MLDC tasks but demands remain for more effective and accurate solutions. For example, existing convolution methods rely on overly simple architectures, and standard transformer-based models can only process a few hundred input tokens.
The present disclosure describes systems and techniques for configuring and training a neural network to improve the accuracy of the network used for MLDC. In a first aspect, a system can include one or more computer processors, a non-transitory computer-readable storage medium communicatively coupled to the one or more computer processors, and a machine learning model for multi-label clinical document classification stored on the storage medium. In the first aspect, the model can include an input layer that automatically transforms one or more documents into a plurality of word embeddings, a deep convolutional-based encoder that automatically combines information of adjacent words present in the documents and learns one or more representations of the words present in the documents, an attention component that automatically selects one or more document features and generates label-specific representations for a plurality of identified labels and an output layer that produces zero or more final classification predictions. The word embeddings included in the system can be pretrained word embeddings and the pretrained word embeddings can be determined using a skip-gram technique. The pretrained word embeddings be context insensitive or context sensitive.
In the first aspect, the documents can be unstructured documents. The final classification prediction can include one or more probabilities that one or more respective labels in the plurality of labels are present in the text of the one or more documents. The deep convolutional-based encoder can further include a plurality of squeeze-and-excitation (SE) and residual convolutional modules. The SE and residual convolutional blocks can form an SE/residual convolutional block pair, and the SE and residual convolutional blocks in each pair are independent but receive the same data. The SE convolutional module can include an SE network followed by a layer normalization component. The SE network can include one or more one-dimensional convolutional layers, a global average pooling component, a squeeze component, and an excitation component. The deep convolutional-based encoder can include a plurality of encoding blocks. The attention component can be configured to extract all outputs form the plurality of encoding blocks. The attention component can automatically select the most important text features. The model can identify both frequently occurring and rarely occurring labels in the documents using a first determination for the frequently occurring labels and a second determination for the rarely occur labels. The first determination can be a binary cross entropy loss determination and the second determination can be a focal loss determination. The one or more documents can be represented by a word embedding matrix that includes each of the transferred word embeddings.
In a second aspect, a method for training the neural network is disclosed, including receiving a plurality of documents, providing the received documents to a neural network model comprising a deep convolutional-based encoder including a plurality of squeeze-and-excitation (SE) and residual convolutional modules that form a plurality of SE/residual convolutional block pairs, determining a word embedding matrix for the plurality of documents, providing one or more word embeddings in the word embedding matrix to the encoder, generating one or more label-specific representations based on the output of the plurality of SE/residual convolutional block pairs, computing a probability of a label being present in the one or more documents given the one or more label specific representations, and using a first loss function to train the model for frequently occurring labels and a second loss function to train the model for rarely occurring labels, wherein the first loss function is used until the model performance saturates using the first loss function before training the model using the second loss function.
In the second aspect, the first loss function is a binary cross entropy loss function and the second loss function is a focal loss function. Providing each word embedding can include providing each word embedding to each SE/residual module pair. Providing each word embedding to each SE/residual module pair can also include providing each word embedding to the SE module, which can include computing one or more channel-dependent coefficients using a two-stage computation, and normalizing the computed channel-dependent coefficients to generate an SE module output and simultaneously providing each word embedding to the residual module, which can include transforming the word embeddings using a filter-size-1 convolutional layer, and adding the transformed word embeddings with the SE module output generating the output of the SE/residual convolutional block pair. The two-stage computation can include in a first stage of the two-stage computation, compressing each channel into a single numeric value, and in a second stage of the two-stage computation, processing each channel using a dimensionality-reducing computation with a reduction ratio, and processing each channel using a dimensionality-increasing computation to return the channel to its original dimension.
In the second aspect, instead of providing each word embedding to an SE/residual convolutional block pair, an output from a previous layer in the neural network model can be provided to the SE/residual convolutional block.
Systems and techniques are described that utilize and train a neural network in accordance with various aspects of the disclosure. In general, the neural network is arranged into four primary components: an input layer, a deep convolution based-encoder, an attention component, and an output layer. The disclosure also describes additional techniques that are used to enhance the results of the specified model. For example, to obtain high-quality representations of the document texts, we incorporate the squeeze-and-excitation (SE) network and the residual network into the convolution-based encoder. Furthermore, the convolution-based encoder consists of multiple encoding blocks so as to enlarge the receptive field and capture text patterns with different lengths. This can be important, for example, when used in processing documents in the medical field, and particularly as it relates to unstructured text (e.g., represented by a dictation in the medical record) where the text patterns are unlikely to have a uniform length.
Another example of an enhancement described herein pertains to the manner in which attention is performed. Specifically, according to various aspects of the disclosure, instead of only using the last encoder layer output, the systems and techniques that implement the instant disclosure are configured to extract all encoding layer outputs and to apply the attention computation on all outputs to select the most informative features for each label. In other words, the system and techniques disclosed herein can be used to select one or more phrases describing the specific clinical conditions that be used to generate a code or label. For instance, according to particular implementations, the phrase “hypertension” can be selected as the most informative feature assigning the code I10 to a particular document. Yet another example of an enhancement described herein, and to improve the accuracy of the underlying model, is using a combination of one or more loss functions to train the model. For example, the model can first be trained using a binary cross entropy loss function. After the model has saturated (e.g., the accuracy of the model has not improved over one or more epochs), the model is then trained using a focal loss function. By training the model in this way, it was been observed that systems and techniques implementing aspects of the disclosure perform better on both frequent and rare labels that may appear in a document corpus than existing models trained using different techniques.
Referring now to the input layer, which is represented by block 102, each of the Res-SE blocks 104a-104n takes a word sequence as input. According to particular implementations, the word sequences can originate from structured documents in a document corpus, unstructured documents in the document corpus, structured or unstructured portions of documents in a document corpus, or combinations thereof. For example, the input layer 102 could receive a sequence of words identified in a medical record containing both structured and unstructured regions. In some implementations, the sequence of words could be an entire encounter between a physician and a patient, or it could be some subset of words related to the encounter, to name two examples. Other sequences are also possible, for instance, portions of the medical record pertaining to a particular diagnosis or procedure over a number of different physician/patient encounters. According to aspects of the present disclosure, the input layer is configured to determine a word embedding for each word in the received sequence of words. As used herein, a word embedding is a learned representation whereby words having the same or similar meaning have the same or similar representation. For instance, the words “yes” and “affirmative” may have similar representations as word embeddings because the definitions may be the same or similar in certain contexts. Conversely, the words “not” and “knot” may have different representations as word embeddings because the meanings of the two words are different regardless of context.
In some implementations, the word embeddings may be pretrained word embeddings. In general, pretrained word embeddings are word embeddings that have been learned or used in one context but may be applied in a different but related context. This may improve the speed at which the underlying model 100 is trained. For example, word embeddings pretrained on general English texts (e.g., online encyclopedia pages) could be suitable for topic classification on social media. In some implementations, the pretrained word embeddings are determined using a conventional technique, such as a skip-gram technique. This means that the resulting pretrained word embeddings are not dependent on the context. For example, the embedding for “dog” will be the same whether it appears in the context “the dog wagged its tail” or “a dog day afternoon.” The benefit of context-insensitive word embeddings is that they can be represented with a simple look up table, which is incredibly efficient. In some implementations, word embeddings may be a function of their current context whereby the embedding for “dog” in the previous example would be different. While context sensitive word embeddings generally lead to more accurate predictions, they are also extremely computationally expensive to compute, often requiring as much or more computational effort than a convolutional neural network (CNN), such as the one described herein. This means, in general, that the pretrained word embeddings are learned without taking into consideration the surrounding words. Likewise, in some implementations, the pretrained word embeddings are context sensitive, which means that the pretrained word embeddings are learned with taking into consideration the surrounding words. For example, the word “knot” may have different meanings associated with the context present in the document in which the word appears, so it would not be unexpected for the context sensitive word embedding for “knot” to differ from the context insensitive word embedding. That said, depending on the application, context sensitive word embeddings do not always outperform context insensitive word embeddings. For example, while it has been observed that contextualized embeddings are much better at sentence level tasks, it's not the case at the document level. Consequently, the choice of whether to use context sensitive or context insensitive embeddings may vary according to particular implementations. And in those particular implementations, it may not be justified to use the more computationally expensive models.
In one implementation, each word is represented by (or transformed into) a word embedding of size de. In general, an embedding can be represented by a vector of numbers, and the size defines how many numbers are in the vector. A vector of size 200 can encode more information about the word than a size 100 vector, for example. Furthermore, in one implementation, the one or more documents to be analyzed may contain Nw words. In such an implementation, the word embedding matrix generated by the input layer can be represented by Xe=[x1, . . . , xNN
The convolutional encoder is represented by the Res-SE blocks 104a-104n. In general, the convolutional encoder is configured to transform information received by one or more layers of the model. For instance, the input word embeddings Xe generated by the input layer 102 can be provided to the one or more Res-SE blocks 104a-104n. In addition, higher layers in the model 100 can receive input from lower layers of the model 100 to aggregate information. For instance, as depicted in
According to one implementation, each Res-SE block is made up of a residual squeeze-and-excitation convolutional block. The Res-SE blocks 104a-104n present in the convolutional encoder, and the manner in which the Res-SE blocks 104a-104n transform information will be described in more detail with respect to
According to particular implementations, the outputs 106a-106n are provided to the attention layer, represented by components 108a-108n in
According to one implementation, to attend to the ith Res-SE layer output (e.g., as depicted in N
N
N
N
N
N
In other implementations, where the application domain has a large label space, but insufficient data points, a multi-layer attention model may be difficult to train. In such implementations, it may be advantageous to instead use a sum-pooling attention in model 100 where each convolutional layer is first transformed to have the same dimension as the last layer, then sum all layers before applying attention to the summed output. In such implementations, the resulting vector V′∈N
The output layers 114 and 116 receives label specification representations and use those label specific representations to compute the probability that each label appears in the document corpus. In one implementation, the probability determinations leverage the fully connected layer (e.g., represented by block 114) and performing both a sum-pooling operation and a sigmoid transformation (represented by block 116). The probability p can be represented thusly: p=sigmoid(pooling(WfΣ
N
Referring back to the SE module 204, the module 204 is configured to improve the performance of neural network configuration 100 by, e.g., enhancing the learning of document representations that ultimately improve prediction tasks (such as predictions made by the output layer 116) of the neural network configuration 100. At a high level, the SE module 204 receives input 202, such as input X, and according to various computations, produces an output denoted as {tilde over (X)}. The manner in which the SE module 204 is configured to produce output {tilde over (X)} is further described in
The Res-SE block 104 also simultaneously provides input 202, such as input X, to the residual module 208. The residual module 208 is configured to take input 202 and transform the input 202 using a filter-size-1 convolutional layer. For instance, using the filter-size-1 convolutional layer, input X can be transformed into X′. The Res-SE block 104 can then add the SE module 204's output {tilde over (X)} with the residual module 208's output X′, as represented by block 210. Finally, a gelu activation function (or Gaussian Error Linear Units activation function) can be applied to the sum of {tilde over (X)} and X′ to produce the output of the Res-SE block 104, which is denoted as H (e.g., blocks 106a-106n in
k×d
(i.e., the size of the input embeddings) and dconv is the out-channel size (i.e., the size of the output embeddings). Applying this 1-dimensional convolutional filter to input 202 represented as input X, the 1-dimensional convolution is computed as: ci=Wc*xi:i-k+1+bc, where * is the convolutional operation and bc is the bias. In accordance with aspects of this disclosure, the output convolutional features can be represented as C=[c1, . . . , cN
N
After C is determined by the 1-dimensional convolutional layer 208, the SE module 204 use a two-stage process: “squeeze” and “excitation” to compute the channel-dependent coefficients to enhance the convolutional features. According to one implementation, the “squeeze” stage is represented by the GAP component 302, and the “excitation” stage is represented by the two linear components 304 and 306, respectively.
In the “squeeze” stage represented by the GAP component 302, each channel (or single dimension in the multiple dimensional input produced by a 1-dimensional convolutional layer 301) is compressed into a single numeric value via global average pooling (GAP). GAP can be defined as zc=GAP(C), where zc∈d
In the “excitation” stage, the channel descriptor zc is provided first to a dimensionality-reduction-layer 304, which has a reduction ratio r. The reduction ratio r is a hyperparameter that can be tuned. In one particular implementation, r is 20 based on empirical evaluation of results of the model 100, although r can be any value according to desired outcomes. After the dimensionality-reduction performed by layer 304, the channel descriptor zc is next provided to a dimensionality-increasing-layer 306. Layer 306 increases the channel dimension back to the channel dimension of C. This two-stage process of first providing the channel descriptor zc to dimensionality-reduction-layer 304 and then to dimensionality-increasing-layer 306 can be defined as follows:
are the weights and biases of the fully-connected layers 304 and 306.
After the SE module 204 computes sc, component 308 rescales the convolutional feature C by the value sc, or {tilde over (X)}=scale(C, sc), where scale denotes the channel-wise multiplication between C and se. Finally, {tilde over (X)} can be normalized and used as the output of the SE module 204. As described above in reference to
In order for the neural network model 100 as described above to accurately predict the presence of labels in the text of a document corpus, the model 100 must be trained.
In step 404, the system provides the received documents to a neural network model. For instance, according to particular implementations, the received documents can be provided to neural network model 100.
In step 406, a word embedding matrix is determined. For example, as described above in connection with N
In step 408, the system provides the word embeddings to an encoder. For example, as described above in reference to
Also, as described above in
In step 410, the system generates one or more label-specific representations. For instance, as described above in reference to N
In step 412, the system computes a probability of a label being present in the one or more documents. For instance, as showing in
In step 414, the system uses a loss function to train the model 100. According to particular implementations, the system uses a first loss function to train the model 100 for frequently occurring labels and a second loss function to train the model 100 for rarely occurring labels. For instance, the system can first use a binary cross entropy loss function to train the model 100. One such binary cross entropy loss function can be represented mathematically as follows: LBCE(pt)=−log pt, where pt is defined as
where p is the predicted probability andy represents whether human inspection indicates that a label is or is not present in a particular document.
Technique 400 can be run over a number of epochs, or iterations. As the model 100 is trained the results of training the model 100 may yield diminishing returns. At some point, the training is likely to saturate, which means that only negligible improvements in the accuracy of model 100 can be realized through further training. After the model 100 training saturates, the loss function can be changed, and the model 100 can continue to be trained. In one implementation, the second loss that is used is intended to train the model 100 to identify rare labels (i.e., labels that occur infrequently in a document corpus). For instance, the focal loss function can be used to handle long-tail distribution of the labels (i.e., rare labels in the text of the document corpus). In general, the focal loss function dynamically reduces the loss weights assigned to well-classified labels which makes the model more sensitive to labels that are less well-classified. According to one implementation, that focal loss function can be defined by the following mathematical representation: LFL(pt)=−(1−pt)γ log pt, where γ is a tunable parameter to adjust the strength (or influence) by which the loss weights are reduced. In other words, the weight term (1−pt)γ suppresses the loss from well-classified labels (where pt tends to be high) and “focuses” the loss on labels that get less confident predictions.
Ultimately, using the neural network configuration 100 and training techniques described herein, improved predictive characteristics over other models and other training techniques have been observed. For instance, depending on the benchmark data used to conduct benchmark testing, a significant improvement in accuracy has been realized. Improvements to the underling accuracy of the system can yield numerous technical and financial benefits. For instance, by improving accuracy without the need to implement more computationally expensive models, computational resources of the computing environment are conserved.
In this example, processing unit includes processing circuitry that may include one or more processors 504 and memory 506 that, in some examples, provide a computer platform for executing an operating system 516, which may be a real-time multitasking operating system, for instance, or other type of operating system. In turn, operating system 816 provides a multitasking operating environment for executing one or more software components such as application 518. Processors 504 are coupled to one or more I/O interfaces 814, which provide I/O interfaces for communicating with devices such as a keyboard, controllers, display devices, image capture devices, other computing systems, and the like. Moreover, the one or more I/O interfaces 514 may include one or more wired or wireless network interface controllers (NICs) for communicating with a network. Additionally, processors 504 may be coupled to electronic display 508.
In some examples, processors 504 and memory 506 may be separate, discrete components. In other examples, memory 506 may be on-chip memory collocated with processors 504 within a single integrated circuit. There may be multiple instances of processing circuitry (e.g., multiple processors 504 and/or memory 506) within processing unit 502 to facilitate executing applications in parallel. The multiple instances may be of the same type, e.g., a multiprocessor system or a multicore processor. The multiple instances may be of different types, e.g., a multicore processor with associated multiple graphics processor units (GPUs). In some examples, processor 504 may be implemented as one or more microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field-programmable gate array (FPGAs), or equivalent discrete or integrated logic circuitry, or a combination of any of the foregoing devices or circuitry.
The architecture of processing unit 502 illustrated in
Storage units 534 may be configured to store information within processing unit 502 during operation. Storage units 534 may include a computer-readable storage medium or computer-readable storage device. In some examples, storage units 534 include at least a short-term memory or a long-term memory. Storage units 534 may include, for example, random access memories (RAM), dynamic random access memories (DRAM), static random access memories (SRAM), magnetic discs, optical discs, flash memories, magnetic discs, optical discs, flash memories, or forms of electrically programmable memories (EPROM) or electrically erasable and programmable memories (EEPROM).
In some examples, storage units 534 are used to store program instructions for execution by processors 504. Storage units 534 may be used by software or applications running on processing unit 502 to store information during program execution and to store results of program execution. For instance, storage units 534 can store the neural network configuration 100 as it is being trained using technique 400.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/IB2022/058478 | 9/8/2022 | WO |
| Number | Date | Country | |
|---|---|---|---|
| 63261276 | Sep 2021 | US |