This disclosure generally relates to methods that apply machine learning techniques for modifying or otherwise processing electronic content. More specifically, but not by way of limitation, this disclosure relates to using machine learning techniques to generate section identifiers for one or more sections of the unstructured or unformatted text data.
The volume of digital content written as text documents is growing every day, at an unprecedented rate. In particular, section identifiers (e.g., heading, list) in plain text could act as a precursor to many document-processing applications such as auto-stylizing unformatted text, font/style suggestion, text summarization, and topic detection. However, a large number of documents are unstructured. Identifying the section identifiers in the unstructured documents is largely a manual process that is time consuming, labor intensive, and costly. Existing techniques such as Natural Language Processing (NLP) and other Deep Learning techniques, have been applied to identify logical structures in a document. However, existing techniques rely on formatting information of the original text to identify structures therein. For instance, these techniques often involve identifying features such as text case (e.g., lowercase, uppercase) and features derived from the font (e.g., size, color, a font type distinct from other sections of the document) that are applied to the text. As such, existing techniques are ineffective for processing documents that do not include these or other types of formatting information.
Certain embodiments involve automatically detecting section identifiers (e.g., identifiers of a heading, a body, a list, etc.) in an unformatted and unstructured document. For instance, a document-processing application identifies, with a feature-prediction layer of a machine-learning model, a feature representation that represents a semantic structure of a given text section (e.g., a paragraph) within the unformatted and unstructured document. The document-processing application enhances the feature representation with additional paragraph-level features (e.g., number of words) to generate an enhanced feature representation of the text section. The document-processing application generates, with a sequence-prediction layer of the machine-learning model, a section identifier (e.g., heading, body, list) for a corresponding text section by applying the sequence-prediction layer to the enhanced feature representation and using contextual information of neighboring text sections.
These illustrative embodiments are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided there.
Features, embodiments, and advantages of the present disclosure are better understood when the following Detailed Description is read with reference to the accompanying drawings.
Certain embodiments described herein can address one or more of the problems identified above by using machine learning techniques to generate a section identifiers for one or more sections of the unstructured or unformatted text data. For instance, a document-processing application uses a feature-prediction layer of a machine-learning model to generate a representation of a semantic structure for corresponding text sections within the unformatted and unstructured document, and further uses a sequence-prediction layer to augment that representation with paragraph-level features and thereby generate a section identifier. A section identifier identifies a type of text section (e.g., heading, body, list) associated with the corresponding text section. The document-processing application applies the section identifier to the corresponding text sections to generate a formatted text document for subsequent text-processing operations.
In an illustrative example, a document-processing application accesses unstructured and unformatted input text data having multiple text sections. For instance, the document-processing application accesses a recipe document that includes a heading section, a first body section, a list section, and a second body section. In this example, the text document does not include any information (e.g., metadata, section identifiers) to indicate a type of text section for each of the four text sections of the text document.
Continuing with this example, the document-processing application identifies various text sections in the recipe document by identifying a sequence of text tokens ending with a newline character. The document-processing application generates a first feature that represents a first one of the identified text sections and a second feature that represents a second one of the identified text sections. To generate the first feature and the second feature, the document-processing application applies a feature-prediction layer of a machine-learning model to a first input embedding derived from the first text section and to a second input embedding derived from the second text section. The document-processing application or another software tool generates each input embedding by translating sparse vectors that represent words of a corresponding text section into a relatively low-dimensional vector that is the input embedding. This input embedding represents at least some of the semantics of the corresponding text section.
In this example, the document-processing application generates section identifiers for the first and second text sections of the recipe document based on a predicted contextual relationship between the first and second text sections. For instance, the document-processing application determines a predicted contextual relationship by applying a sequence-prediction layer of the machine-learning model to the first feature and the second feature. A contextual relationship indicates transitions between text sections and establishes a role of a particular text section in determining the prediction of the previous and subsequent section identifiers. The sequence-prediction layer is able to predict a contextual relationship by, for example, identifying one or more words (e.g., “following”) and/or characters (e.g., colon character) in a text section that are indicative of a transition to a different type of text section for a subsequent text section (e.g., list). The document-processing application generates a heading-section identifier for the first text section of the recipe document and a body-section identifier for the second text section of the recipe document based on the determined contextual relationship between the first and second sections. The document-processing application also generates the heading-section identifier and the body-section identifier based on the relationship between the two sections and other remaining sections of the recipe document.
The document-processing application generates a text document having the input text data augmented with section identifiers. For instance, the document-processing application applies the first section identifier to the first text section and applies the second section identifier to the second text section. As a result, the generated text document includes metadata (e.g., the section identifiers) that facilitate navigating to or otherwise identifying different text sections within the text document. Furthermore, the section identifiers also facilitate various other text-processing operations that can be performed on the formatted text document, including auto-stylizing of text sections, font/style suggestion, text summarization, table-of-contents generation, and topic detection. For instance, in the example in which the document-processing application generates a formatted recipe document, a text editing tool could modify visual characteristics of each text section in the recipe document by using the section identifiers to select a certain section (e.g., the first text section having a heading-section identifier, the second text section having a body-section identifier, etc.) and apply specific auto-formatting rules to that section (e.g., italicizing and enlarging the first text section because it is a header).
As described herein, certain embodiments provide improvements to software tools that use machine-learning models for modifying or otherwise processing digital text content. For instance, existing software tools that might rely on document metadata specifying certain formatting cues (e.g., font attributes specifying size or formatting) to identify sections of a document, which are unsuitable for input text without such metadata (e.g., text derived from an object character recognition process, plain text entered into an electronic form, etc.). Relying on these existing technologies could decrease the utility of editing tools that use section identifiers to modify or transform text, such as auto-formatting tools or text summarization tools. Embodiments described herein can facilitate an automated process for distinguishing and identifying text sections that avoids this reliance on ineffective technologies. For instance, the feature-prediction layer that extracts semantic features of text sections and sequence-prediction layer that utilizes contextual information to supplement the extracted features allows sections to be identified based on the semantic content of the text and relationship among text sections without regard to the formatting of the text. These features allow various embodiments herein to segment a wider variety of electronic document types than existing tools, thereby reducing manual, subjective efforts involved with segmenting unformatted or unstructured text more effectively than conventional techniques.
The document-processing application 102 then uses an embedding-matrix generator 110 to generate an embedding matrix for each of the sequence of text sections 108a-108n. Specifically, the embedding-matrix generator 110 generates the embedding matrix for each of the sequence of text sections 108a-108n, such that the embedding matrix encodes one or more tokens of the text section. In some instances, the embedding matrix includes an identifier usable to identify a position of the text section (e.g., an index value) within the sequence of text sections 108a-108n. In each embedding matrix, the embedding-matrix generator 110 generates an input embedding for each token of the text section, in which the token includes one or more values that encodes a semantic definition of the token.
The document-processing application 104 then applies one or more layers of a machine-learning model 112 to process the embedding matrices representing the plurality of text sections 108a-108n and generate the section identifiers 106. The machine-learning model 112 includes a feature-prediction layer 114 and a sequence-prediction layer 116. The feature-prediction layer 114 transforms the embedding matrix of each of the plurality of text sections 108a-108n into a feature representation that identifies one or more semantic characteristics of the text section. In some instances, the feature-prediction layer 114 includes a convolutional neural network (CNN) that receives the embedding matrix and applies one or more convolutional layers to extract a feature representation of the text section. The document-processing application 102 uses the feature representations of the plurality of text sections 108a-108n to identify the section identifiers 106.
The sequence-prediction layer 116 processes the feature representation to generate a set of output features used by a fully-connected layer (for example) for identifying the section identifiers 106 for the plurality of text sections 108a-108n. In some instances, the sequence-prediction layer 116 identifies a section identifier for a particular text section by using a predicted contextual relationship between the text section and other text sections (e.g., previous text section, subsequent text section) of the unformatted text data 104. In some instances, the sequence-prediction layer 116 includes a recurrent neural network (“RNN”) for using the predicted contextual relationship to identify the section identifier. Additionally or alternatively, the sequence-prediction layer includes a long short term memory (“LSTM”) network, a type of an RNN, for using the predicted contextual relationship to identify the section identifier. The LSTM network can be a bidirectional LSTM network.
In some instances, a learned set of parameters from a given layer are used to train the other layer of the machine-learning model. For example, each iteration of the training process for the sequence-prediction layer 116 includes feeding the loss backwards through the network (e.g., backpropagation) to fine tune parameters of the feature-prediction layer 114. For example, word tokens such as “following” may not much of a factor for the feature-prediction layer in identifying a “body” identifier for a particular text section. However, such word tokens can be could be a strong indicator in identifying a “list” identifier for a subsequent text section. Features (e.g., “following”) learned in a particular layer (e.g., a sequence-prediction layer) can be propagated to optimize parameters of the other layer (e.g., a feature-prediction layer). Thus, the trained feature-prediction layer is likely to predict a “list” identifier for a subsequent text section of the text section having the word token “following.”
In some instances, a section-statistics generator 118 of the document-processing application 102 concatenates the feature representation with statistical features (e.g., length-frequency, syntax characteristics) corresponding to a text section. In some instances, the statistical features additionally include a frequency of uppercase characters appearing in the text section or a ratio between a count of uppercase characters and a count of words in the text section. ASCII values corresponding to one or more characters of the text section can also be considered as part of the statistical features. The statistical features facilitate a more complete representation of the text section by identifying syntax and other characteristics of the text section. For example, the section-statistics generator 118 identifies one or more statistical features of a given text section:
The section identifiers 106 augment text data of an output text document. For example, the document-processing application 102 applies a heading identifier (“Abstract”) to a text section of the plurality of text sections 108a-108n and a sub-heading identifier (“1.1.1 Nearest Neighbor (NN)”) to another text section. As a result, the text document generated by the document-processing application 102 includes metadata (e.g., the section identifiers) that facilitate navigating to or otherwise identifying different text sections within the text document. Furthermore, the section identifiers also facilitate various other text-processing operations that can be performed on the formatted text document, including auto-stylizing of text sections, font/style suggestion, text summarization, table-of-contents generation, and topic detection.
At step 202, the document-processing application accesses unstructured and unformatted input text data having a first text section and a second text section. The unstructured and unformatted input text data does not include any information (e.g., metadata, section identifiers) to indicate a type of text section for each of a plurality of text sections in the input text data. In some instances, the document-processing application identifies first text section and the second text sections by identifying a sequence of text tokens ending with a newline character. Additionally or alternatively, a text section can be identified by specifying a fixed number of word tokens (e.g., 120 tokens) to be associated with the text section.
At step 204, the document-processing application generates a first feature that represents the first text section by, at least, applying a feature-prediction layer of a machine-learning model to a first input embedding derived from the first text section. In particular, the document-processing application applies the feature-prediction layer (e.g., a CNN) to the first input embedding to generate the first feature (e.g. a vector including a set of values) corresponding to the first text section. The document-processing application or another software tool generates the first input embedding by translating sparse vectors that represent words of the first text section into a relatively low-dimensional vector that is the input embedding. The first input embedding thus represents at least some of the semantics of the first text section. A detailed example for performing step 204 is described herein with respect to
In some instances, the document-processing application uses a pre-trained language model (e.g., word2vec, fastText) to encode each word token of the first text section into a corresponding input embedding. The document-processing application combines the input embeddings of the tokens into an embedding matrix that represents the word tokens of the first text section.
At step 206, the document-processing application generates a second feature that represents the second text section by, at least, applying a feature-prediction layer of a machine-learning model to a second input embedding derived from the second text section. Similar to step 204, the document-processing application applies the feature-prediction layer to the second input embedding to generate the second feature corresponding to the second text section. The second input embedding represents at least some of the semantics of the second text section. A detailed example for performing step 206 is described herein with respect to
At step 208, the document-processing application identifies a first section identifier for the first text section and a second section identifier for the second text section based on a predicted contextual relationship between the first text section and the second text section. In some instances, the document-processing application determines the predicted contextual relationship by applying a sequence-prediction layer of the machine-learning model to the first feature and the second feature. The contextual relationship identifies transitions between the first and second text sections and identifies a role of a given text section (e.g., the second text section) in determining the prediction of the previous and subsequent section identifiers (e.g., the first text section). In some instances, the sequence-prediction layer includes an LSTM network. A detailed example for performing step 208 is described herein with respect to
At step 210, the document-processing application 102 generates a text document having the input text data, the first section identifier applied to the first text section, and the second section identifier applied to the second text section. In some instances, the document-processing application applies the first section identifier is to the first text section and applies the second section identifier to the second text section. As a result, the generated text document includes metadata (e.g., the section identifiers) that facilitate navigating to or otherwise identifying different text sections within the text document. Process 200 terminates thereafter.
The document-processing application uses machine learning techniques to generate section identifiers for one or more sections of the unstructured or unformatted text data. In particular, the document-processing application uses a machine-learning model that includes a feature-prediction layer and a sequence-prediction layer for identifying a section identifier for a given text section of the text data. The document-processing application uses the feature-prediction layer (e.g., 1-dimensional convolutional neural network) to generate a feature representation of a text section. In some instances, the feature representation is concatenated with a set of values corresponding to length-frequency and syntax characteristics of the text section. The document-processing application also applies the sequence-prediction layer (e.g., a recurrent neural network) to a sequence of the feature representations generated by the feature-prediction layer, so as to correlate each text section to other text sections. A contextual relationship predicted between the text sections is used to determine a section identifier that identifies a type of text section (e.g., heading, body, list) associated with the corresponding text section.
Each of the layers of the machine-learning model can be trained using a training dataset. For example, a training dataset includes a set of documents (e.g., 7000 PDF documents), in which each document includes one or more text sections that are associated with their respective section identifiers. The section identifiers include a type of text section, such as Title, Heading-1, Sub-Heading (e.g., Heading-2), Sub-Sub-Heading (e.g., Heading-3), Paragraph, Table, List, Blockquotes, EndNotes, Footnotes, etc. To train the machine-learning models, the section identifiers are removed from each document, and outputs generated from the machine-learning models are compared with the removed section identifiers for backpropagation. In some instances, one or more text sections of text are omitted from the training dataset based on their respective section identifiers.
1. Feature-Prediction Layer
A document-processing application receives a text section 305 that includes a plurality of tokens (e.g., words, punctuation characters). The document-processing application encodes each token into an input embedding (e.g., a vector represented by a plurality of values) based on its semantic characteristics. In some instances, the document-processing application is configured to generate input embedding with a predefined number of dimensions. For example, as shown in
The CNN accesses the embedding matrix 310 of the text section and applies a series of operations which form a single convolutional layer: (1) convolution; (2) batch normalization; and (3) max-pooling. To perform convolution, the CNN applies one or more filters including a matrix of values that can “slide over” the embedding matrix 310 so as to generate a set of feature maps 315. A filter includes a matrix of numbers that are different from a matrix values of another filter, in order to allow the filter to extract different features from the embedding matrix 310. In some instances, a set of hyperparameters that correspond to the feature map generation are predefined (e.g., based on manual input). Feature-extraction hyperparameters may identify (for example) a number of filters, a stride for each filter (e.g., 1-step, 2-step), a padding size, a kernel size, and/or a kernel shape. For example, as shown in
Continuing with the example of
After the batch normalization operation, the CNN performs a pooling operation on the set of normalized feature maps 320 in order to reduce the spatial size of each feature map and subsequently generate a set of pooled feature maps 325. In some embodiments, the CNN performs the pooling operation to reduce dimensionality of the set of normalized feature maps 320, while retaining the semantic features captured by the embedding matrix 310. In some instances, the CNN system performs a max pooling operation to access a group of values within the feature map (e.g., 2 values within the feature map) and selects an element associated with the highest value. This operation can be iterated to traverse the entirety of each feature map of the set of normalized feature maps 320, at which the max pooling operation completes the generation of the set of pooled feature maps 325. For example, as shown in
The CNN system may alternatively or additionally perform an average pooling operation in place of the max pooling operation which selects the sum or average value of the elements captured in the area within the feature map. By performing the pooling operations, the CNN system may achieve several technical advantages including capability of generating an input representation of the embedding matrix 310 that allows reduction of number of parameters and computations within the CNN model.
Continuing with the example of
In addition, the CNN applies a global max pooling operation on the second set of feature maps 330 such that a maximum value for each feature map is selected to form a second set of pooled feature maps 335.
The CNN applies a fully connected layer (alternatively, a dense layer) to the second set of pooled feature maps 335 to generate a feature representation 340 of the text section 305. The fully connected layer includes a multi-layer perceptron network incorporating a softmax activation function or other types of linear or non-linear functions at an output layer. In some instances, the CNN uses the fully connected layer that accesses the extracted features and generates an output that includes a feature representation that identifies one or more semantic characteristics of the text section. For example, as shown in
The feature representation 340 can then be used as an input for the sequence-prediction layer, which then performs a series of operations for identifying a section identifiers corresponding to the text section 305. In some instances, the output and the labels of the training dataset are used as input for loss functions to optimize the parameters in the CNN. An error value generated by the loss functions is used in backpropagation algorithms to adjust the parameters in the CNN and thus improve the accuracy of subsequent feature representations outputted by the CNN. The feature representation 340 is an example of a feature generated in steps 204 or 206 from process 200 in
It will be appreciated that, while
2. Sequence-Prediction Layer
A sequence-prediction layer receives a feature representation for each of the text sections generated by the feature-prediction layer. The feature representation of a given text section includes a set of values that identify one or more semantic characteristics of the text section and may be combined with statistical features (e.g., length-frequency, syntax characteristics) of the text section. The sequence-prediction layer processes the feature representation to generate a set of output features used by a fully-connected layer (for example) for identifying a section identifier for the text section. In some instances, the sequence-prediction layer identifies a section identifier for a given text section by using a predicted contextual relationship between the text section and other text sections of the text data. In some instances, the sequence-prediction layer includes an RNN. Additionally or alternatively, the sequence-prediction layer includes an LSTM network, which is a type of an RNN. The LSTM network can be a bidirectional LSTM network.
With the sequence-prediction layer, the document-processing application detects transitions between text sections and establishes a role of a particular text section in determining the prediction of the previous and subsequent section identifiers. Thus, the sequence-prediction layer not only compares how similar or different text sections are to detect transitions, but also identifies which features in the particular text section are indicative of the prediction of section identifiers for other text sections in the text data.
s
t=φ(Uxt+Wst-1),
The output of the recurrent neural network is expressed as:
o
t=softmax(Vst),
where V is a weight value applied to the hidden state value st.
Thus, the hidden state st can be referred to as the memory of the network. In other words, the hidden state st depends from information associated with inputs and/or outputs used or otherwise derived from one or more previous text sections. The output at step ot is a set of values used to identify section identifier for the text section, which is calculated based at least in part on the memory at text section position t.
As shown in
In addition to maintaining and updating a hidden state st, the LSTM network maintains a cell state Ct. As used herein, a cell state encodes information of the inputs that have been observed up to that step (at every step). In some embodiments, rather than using a single layer for a standard RNN such as the tan h layer shown in
The ⊗ symbol or the ∘ symbol represents the Hadamard product. Gates can allow or disallow the flow of information through the cell. As the sigmoid function results in a value between 0 and 1, the functions value affects how much of each feature of a previous text section should be allowed through a gate. Referring again to
f
t=σ(Wf[ht-1,xt]+bf),
where Wf is a scalar constant, bf is a bias term, and the brackets indicate concatenation of the input values.
i
t=σ(Wi[ht-1,xt]+bi)
In addition, an output value of the tan h function {tilde over (C)}t may be expressed by the following equation
{tilde over (C)}
t=tan h(Wc[ht-1,xt]+bc)
C
t
=f
t
⊗C
t-1
+i
t
⊗{tilde over (C)}
t
For example, the LSTM network passes the input values ht-1, xt to a sigmoid function. The LSTM network applies a tan h function to a cell state Ct, which was modified by the forget gate and the input gate. The LSTM network then multiplies the output of the tan h function (e.g., a value between −1 and 1 that represents the cell state) with the output of the sigmoid function. The LSTM network retrieves the hidden state determined from the output gate (e.g., return_sequence=true), and assigns the hidden state as a set of output features used for identifying the section identifier for the text section. For example, a fully connected neural network processes a given output feature to identify a corresponding section identifier. The identified section identifier is an example of a feature generated in step 208 from process 200 in
o
t=σ(Wo[ht-1,xt]+bo)
h
t
=o
t⊗ tan h(Ct)
The LSTM network as depicted in
3. Backpropagation Between Feature-Prediction and Sequence-Prediction Layers
The feature-prediction layer and the sequence-prediction layer can be trained together to optimize their respective parameters. During training, the output features from a given layer are used as input to train the other layer of the machine-learning model. For example, each iteration of the training process for a sequence-prediction layer includes feeding the loss backwards through the network (e.g., backpropagation) to fine tune parameters of the feature-prediction layer. In other words, an error value generated by a loss function of the sequence-prediction layer is backpropagated to adjust the parameters in the feature-prediction layer. Thus, the features identified in the feature-prediction layer are used not only to predict a section identifier for a single text section, but also used as features for predicting section identifiers for other text sections in the text data. Such configuration is advantageous over conventional techniques by increasing accuracy of predicting section identifiers in the text data.
As another illustrative example, word tokens such as “following” may not much of a factor for the feature-prediction layer in identifying a “body” identifier for a particular text section. However, such word tokens can be could be a strong indicator in identifying a “list” identifier for a subsequent text section. Features (e.g., “following”) learned in a particular layer (e.g., a sequence-prediction layer) can be propagated to optimize parameters of the other layer (e.g., a feature-prediction layer). Thus, the trained feature-prediction layer is likely to predict a “list” identifier for a subsequent text section of the text section having the word token “following.”
It will be appreciated by one skilled in the art that various backpropagation algorithms can be used for training the feature-prediction layer and the sequence-prediction layer. Example algorithms include gradient techniques, such as gradient descent or stochastic average gradient, as well as other techniques of higher order such as conjugate gradient, Newton, quasi-Newton, or Levenberg-Marquardt.
An example of section-identifier identification process includes a document-processing application to perform the following operations: (i) receiving raw, unformatted text data as input; (ii) pre-processing, tokenizing, and converting the text data into a plurality of text sections having a fixed size (e.g., 120 tokens, 128 tokens); (iii) determining statistical features from each text section; (iv) generating input embeddings for all tokens in each text section; (v) combining the input embeddings to generate an embedding matrix for each text section; (iv) applying the feature-prediction layer 505 (e.g., a CNN) to identify a feature representation for each text section; (v) enhancing the feature representation with the statistical features determines from the text section; and (iv) applying the sequence prediction layer 515 (e.g., an LSTM network) to generate section identifiers for the text sections.
The techniques implemented in the machine-learning model for identifying section identifiers in unstructured and unformatted data are advantageous over several conventional techniques. For example, conventional techniques use a trained, supervised classifier to identify heuristic-based features, including number of words, text casing (e.g., lowercase, uppercase), part-of-speech (POS) tags, and features derived from the font that are applied to the text. However, these conventional techniques in this example heavily rely on formatting information. Embodiments of the present disclosure are capable of identifying section identifiers even when the input text data is unformatted.
In another example, conventional techniques process unformatted text to identify whether a block of text corresponds to one of twenty types of section headings in legal documents. Text features including string length, percentage of capitalized characters, and presence of specific keywords are used to determine a type of section associated with a given text block. However, this conventional technique is limited to a specific type of document with previously known sections, thus cannot be implemented across various types of unstructured documents.
In yet another example, the conventional techniques include using a text segmentation technique, in which lexical similarity between neighboring paragraphs is measured. The lexical similarity is used to classify text in segments, such that each segment represents a specific topic. However, the segments cannot identify headings or sub-headings (for example) that identify a degree of relatedness between the neighboring paragraphs.
In yet another example, conventional techniques include using a word-level Recurrent Neural Network (RNN) sentence modeling layer followed by a sentence-level bidirectional Long Short-term Memory (LSTM) topic modeling layer to segment a text stream text into a number of topics. The sentence-level RNN is used to detect sudden transition in topics, which identifies when a particular topic has ended. The use of RNN and LSTM layers, however, is limited to sentence-level analysis, and, as a result, the conventional techniques fall short of identifying section identifiers for one or more text sections.
As such, the techniques implemented in the machine-learning model for identifying section identifiers in unstructured and unformatted data are advantageous over several conventional techniques, which rely on formatting information or known information over specific domains.
1. Generate an Embedding Matrix from Input Text Data
To generate an embedding matrix, the document-processing application preprocesses input text data to clean it and remove unwanted information. As referred above, the input text data can be unstructured and unformatted text data. The document-processing application may remove several punctuation and unknown characters from the input text data. However, some punctuation characters including “.”, “,”, “-”, “:”, “;”, “?”, and “!” are retained as the presence or absence of some of these punctuation characters in a text section is indicative of a particular type of section identifier for the text section. For example, lists are often preceded by a text section ending with “:”, headings generally do not contain punctuations, and body generally ends with “.”. The document-processing application divides the tokenized text data into a plurality of text sections, in which each text section is defined by a sequence of tokens ending with a newline character. Additionally or alternatively, a text section can be identified by specifying a fixed number of word tokens (e.g., 120 tokens) to be associated with the text section. As shown in
In some instances, the document-processing application performs a sliding window operation to obtain a plurality of text sections, with a step size of one text section. The input text data may thus be represented as a plurality of text section (equal to window size), in which each text section includes a fixed number of tokens that denote particular words.
In some instances, the document-processing application determines statistical features from each text section. The statistical features are typically determined before embedding matrices are generated for the text sections, as the statistical features depend on (for example) a syntax, part-of-speech, letter casing, and word-frequency of word tokens in the text section. For example, the statistical features include a ratio between a count of uppercase characters and a count of words in the text section. In some instances, the document-processing application uses a sliding window operation to represent the statistical features for the text section. The following diagram provides an example of statistical features that are determined from a text section “Sarah is giving a demo for feature extraction.”:
The document-processing application converts tokens of each text section into input embeddings, in which a definition of each unique word token is encoded into an input embedding. In some instances, the input embedding is a vector having one or more numerical values that represent a corresponding token. Each text section is thus represented by a sequence of vectors, in which each vector including a number of numerical values representing a particular token. In some instances, because length of each individual text section may be different, the document-processing application padded and truncated the text sections, such that each text section of the plurality of text sections corresponds to the same fixed length (e.g., 120).
The document-processing application then combines the input embeddings of the text section to generate an embedding matrix for the text section. In some instances, the document-processing application uses a pre-trained neural network (e.g., FastText), which can be further trained using unsupervised learning or supervised learning to obtain the input embeddings for the tokens of the text section.
2. Determine a Feature Representation of a Text Section by Using a Feature-Prediction Layer
The feature-prediction layer 505 then accesses the embedding matrix for each text section to generate a feature representation corresponding to the text section. In some instances, the feature-prediction layer includes a CNN that applies one or more convolutional layers to extract a feature representation of the text section. The feature representation identifies one or more semantic characteristics of each text section.
As shown in
3. Concatenate Statistical Values Associated with the Text Section
The section-statistics generator 510 then encodes the statistical features of each text section into a set of feature values and concatenates the set of feature values into the feature representation of the text section. As a result the section-statistics generator 510 facilitates enhancement of the feature representation for the text section. In some instances, the section-statistics generator 510 applies a fully connected neural network to the statistical features to derive the set of feature values for each text section. For example, the section-statistics generator 510 applies the fully connected neural network to 13 statistical features corresponding to a text section “Sarah is giving a demo for feature extraction.” (e.g., number of words, last character ascii), thereby generating a set of 32 feature values that represent the statistical features of the text section.
4. Identify a Section Identifier of a Text Section by Using a Sequence-Prediction Layer
The sequence-prediction layer 515 receives the enhanced feature representation for each of the text sections. As described above, the enhanced feature representation of a given text section includes a set of values that identify one or more semantic characteristics of the text section and the set of feature values derived from the statistical features (e.g., length-frequency, syntax characteristics) of the text section. The sequence-prediction layer 515 processes the enhanced feature representation to generate a set of output features used by a fully-connected layer (for example) for identifying the final output 520, i.e., a section identifier for the text section.
For example, as shown in
Continuing with the example, the sequence-prediction layer 515 processes the enhanced feature representations through two bidirectional LSTM networks, each of which having two layers with 256 hidden states (also considered as outputs since return sequence was set to “true”). As described above, each of the two bidirectional LSTM networks applies two LSTM network layers to the enhanced feature representation for each text section: (i) a first LSTM network layer trained to process input features of the text sections according to a forward sequence of text sections in the text data (e.g., first text section to last text section); and (ii) a second LSTM network layer trained to process input features of the text sections according to a reverse sequence of text sections in the text data (e.g., last text section to first text section).
After the two bidirectional LSTM layers are applied to the enhance feature representations, the sequence-prediction layer applies a dropout layer of 0.5 to the output features to reduce overfitting of the machine-learning model. Each output feature of the sequence-prediction layer includes a set of 512 values. The document-processing application applies a fully connected neural network with a softmax activation function to the output features of the text sections and generates a section identifier for each text section. In some instances, the final output 520 includes, for each text section, a probability value for each possible section identifier, in which the section identifier with highest probability value is identified as the section identifier for the text section. The final output 520 is an example of a feature generated in step 208 from process 200 described above.
In certain experiments, an example of the machine-learning model 112 implemented by a document-processing application using involving embodiments described herein was tested to evaluate its performance levels. The machine-learning model was trained on a training dataset extracted from 7000 PDF documents. A total number of data points after input preparation was 1,829,945. A window size of 5 was selected. The entire training dataset was shuffled and is split into a 70/30 train-test split to divide the training dataset. The machine-learning model was trained in batches of 200.
The machine-learning model was able to capture the variablity of the training dataset as shown by the results. The result and the confusion matrix between the predicted labels are as follows:
The confusion matrix indicates that a large number of text sections are correctly assigned with their respective section identifiers. The false-positive rate for misclassifying a “body” text section with “list” identifier was relatively higher than other types of text section. Such results are also reflected in the F1 scores corresponding to identification of section identifiers for the respective text sections. The “body” text sections were classified most accurately with an F1 score of 0.90, while the “list” text sections were classified least accurately with an F1 score of 0.85. Nonetheless, an overall range of F1 scores are between 0.85 and 0.90, thus indicating that the machine-learning model can accurately identify the section identifiers for the text sections.
Any suitable computing system or group of computing systems can be used for performing the operations described herein. For example,
The example of
The memory device 604 includes any suitable non-transitory computer-readable medium for storing data, program code, or both. A computer-readable medium can include any electronic, optical, magnetic, or other storage device capable of providing a processor with computer-readable instructions or other program code. Non-limiting examples of a computer-readable medium include a magnetic disk, a memory chip, a ROM, a RAM, an ASIC, optical storage, magnetic tape or other magnetic storage, or any other medium from which a processing device can read instructions. The instructions could include processor-specific instructions generated by a compiler or an interpreter from code written in any suitable computer-programming language, including, for example, C, C++, C#, Visual Basic, Java, Python, Perl, JavaScript, and ActionScript.
The computing system 600 could also include a number of external or internal devices, such as a display device 610, or other input or output devices. For example, the computing system 600 is shown with one or more input/output (“I/O”) interfaces 608. An I/O interface 608 can receive input from input devices or provide output to output devices. One or more buses 606 are also included in the computing system 600. Each bus 606 communicatively couples one or more components of the computing system 600 to each other or to an external component.
The computing system 600 executes program code that configures the processing device 602 to perform one or more of the operations described herein. The program code includes, for example, code implementing the document-processing application 102 or other suitable applications that perform one or more operations described herein. The program code can be resident in the memory device 604 or any suitable computer-readable medium and can be executed by the processing device 602 or any other suitable processor. In some embodiments, all modules in the document-processing application 102 are stored in the memory device 604, as depicted in
In some embodiments, the computing system 600 also includes a network interface device 612. The network interface device 612 includes any device or group of devices suitable for establishing a wired or wireless data connection to one or more data networks. Non-limiting examples of the network interface device 612 include an Ethernet network adapter, a modem, and/or the like. The computing system 600 is able to communicate with one or more other computing devices (e.g., a computing device that receives inputs for document-processing application 102 or displays outputs of the document-processing application 102) via a data network using the network interface device 612.
An input device 614 can include any device or group of devices suitable for receiving visual, auditory, or other suitable input that controls or affects the operations of the processing device 602. Non-limiting examples of the input device 614 include a touchscreen, stylus, a mouse, a keyboard, a microphone, a separate mobile computing device, etc. An output device 616 can include any device or group of devices suitable for providing visual, auditory, or other suitable sensory output. Non-limiting examples of the output device 616 include a touchscreen, a monitor, a separate mobile computing device, etc.
Although
Numerous specific details are set forth herein to provide a thorough understanding of the claimed subject matter. However, those skilled in the art will understand that the claimed subject matter could be practiced without these specific details. In other instances, methods, apparatuses, or systems that would be known by one of ordinary skill have not been described in detail so as not to obscure claimed subject matter.
Unless specifically stated otherwise, it is appreciated that throughout this specification discussions utilizing terms such as “processing,” “computing,” “calculating,” “determining,” and “identifying” or the like refer to actions or processes of a computing device, such as one or more computers or a similar electronic computing device or devices, that manipulate or transform data represented as physical electronic or magnetic quantities within memories, registers, or other information storage devices, transmission devices, or display devices of the computing platform.
The system or systems discussed herein are not limited to any particular hardware architecture or configuration. A computing device can include any suitable arrangement of components that provide a result conditioned on one or more inputs. Suitable computing devices include multi-purpose microprocessor-based computer systems accessing stored software that programs or configures the computing system from a general purpose computing apparatus to a specialized computing apparatus implementing one or more embodiments of the present subject matter. Any suitable programming, scripting, or other type of language or combinations of languages could be used to implement the teachings contained herein in software to be used in programming or configuring a computing device.
Embodiments of the methods disclosed herein can be performed in the operation of such computing devices. The order of the blocks presented in the examples above can be varied—for example, blocks can be re-ordered, combined, and/or broken into sub-blocks. Certain blocks or processes can be performed in parallel.
The use of “adapted to” or “configured to” herein is meant as open and inclusive language that does not foreclose devices adapted to or configured to perform additional tasks or steps. Additionally, the use of “based on” is meant to be open and inclusive, in that a process, step, calculation, or other action “based on” one or more recited conditions or values could, in practice, be based on additional conditions or values beyond those recited. Headings, lists, and numbering included herein are for ease of explanation only and are not meant to be limiting.
While the present subject matter has been described in detail with respect to specific embodiments thereof, it will be appreciated that those skilled in the art, upon attaining an understanding of the foregoing, could readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, it should be understood that the present disclosure has been presented for purposes of example rather than limitation, and does not preclude the inclusion of such modifications, variations, and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art.