The present disclosure relates to computing, and in particular, to character sequence recognition using neural networks.
Advances in computing technology have led to the increased adoption of machine learning (aka artificial intelligence) across a wide range of applications. One challenge with machine learning is that data typically requires complex preprocessing steps prior to prepare the data for analysis by a machine learning algorithm. However, for some types of data inputs, it may be desirable and more efficient to have a machine learning algorithm that can process batches of data inputs with a minimum, or completely without, any computationally intensive preprocessing while still yielding accurate results. One example data set that could benefit from such a system would be data corresponding to receipts.
Embodiments of the present disclosure pertain to character recognition using neural networks. In one embodiment, the present disclosure includes a computer implemented method comprising processing a plurality of characters using a first recurrent machine learning algorithm, such as a neural network, for example. The first recurrent machine learning algorithm sequentially produces a first plurality of internal arrays of values. The first plurality of internal arrays of values are stored to form a stored plurality of arrays of values. The stored plurality of arrays of values are multiplied by a plurality of attention weights to produce a plurality of selection values. An attention array of values is generated from the stored arrays based on the selection values. The attention array of values is processed using a second recurrent machine learning algorithm, the second recurrent machine learning algorithm produces values corresponding to characters of the plurality of characters forming a recognized character sequence.
The following detailed description and accompanying drawings provide a better understanding of the nature and advantages of the present disclosure.
In the following description, for purposes of explanation, numerous examples and specific details are set forth in order to provide a thorough understanding of the present disclosure. Such examples and details are not to be construed as unduly limiting the elements of the claims or the claimed subject matter as a whole. It will be evident to one skilled in the art, based on the language of the different claims, that the claimed subject matter may include some or all of the features in these examples, alone or in combination, and may further include modifications and equivalents of the features and techniques described herein.
Referring again to
Generally, a recurrent neural network is a type of neural network that employs one or more feedback paths (e.g., directed cycles). RNN 110 may have a single layer of weights that are multiplied by an input array, and combined with a result of an internal state array multiplied by feedback weights, for example. An internal state may be updated by combining the weighted sums with a bias as described in more detail below, for example. Accordingly, the output of RNN 110 may sequentially produce a plurality of internal arrays of values (e.g., one for each character received on the input). Features and advantages of the present disclosure include storing the plurality of internal arrays of values from RNN 110 generated during processing of characters to form a stored plurality of arrays of values 112 in memory 111. For example, when a first encoded character array corresponding to a first character from character set 102 is provided to the input of RNN 110, then a first resulting update will occur to an internal state of the RNN 110. A first internal array of values, updated in response to receiving an encode character array on the input of RNN 110, may be stored in memory 111. This first stored array may be denoted as being received at time t0. On a subsequent cycle, the next encoded character array is provided at the input of RNN 110. Accordingly, the internal array in RNN 110 is updated with a new set of values, and the new internal array of values may be stored in memory 111 as t1, for example. Similarly, as each encoded character array representing the characters of the corpus are received, the state of the internal array is stored in memory 111 until all the characters have been processed a tN, at which point N stored arrays of values 112 are in memory 111, where N is an integer representing the integer number of characters in the corpus, for example.
Embodiments of the disclosure include multiplying the stored plurality of arrays of values 112 by a plurality of attention weights 113 to produce a plurality of selection values. Selection values may be used for selecting particular stored arrays of values 112 in memory 111 as inputs to RNN 120. The attention weights 113 may be configured (e.g., during training) to produce selection values comprising a plurality of zero (0), or nearly zero, selection values and one or more non-zero selection values. As an illustrative example, selection values may ideally be as follows: [0,0,0, . . . , 1, . . . , 0,0], where the position of the one (1) in the array is used to select one of the stored arrays of values 112. Accordingly, the number of selection values may be equal to the number of stored arrays of values 112 in memory 111. For example, an array of selection values of [0, 1, 0, . . . , 0] would select stored array t1 (e.g., the second array of values received from RNN 110). Accordingly, one or more of the stored arrays of values 112 may be selected based on the selection values to produce an attention array of values. In some embodiments, selection values may range continuously from 0-1, for example, where stored arrays 112 having corresponding selection values are selected to produce attention arrays input to RNN 120, for example.
In some embodiments, multiplying each stored array of values 112 by attention weights 113 may produce a single selection value (e.g., nearly 1), and one of the stored arrays 112 is selected as an input for RNN 120. For example, after N stored arrays 112 are multiplied by attention weights 113, each of the resulting N values may be zero or nearly zero, and only one selection value may be nearly one (1). For instance, an example of N selection values may be [0.001, 0.023, . . . , 0.95], where the last value in the array is substantially greater than the other near zero values in the array. In this case, last stored array tN is selected and provided as an input to RNN 120, for example. As another example, N selection values may be [0.001, 0.98, . . . , 0.003], where the second value in the array is substantially greater than the other near zero values in the array. In this case, second stored array t1 is selected and provided as an input to RNN 120, for example.
In other embodiments, multiplying each stored array of values 112 by attention weights 113 may produce multiple selection values across a range of values. In some embodiments, the plurality of the largest selection values may be adjacent selection values and correspond to adjacent stored arrays of values 112. For instance, an example of N selection values may be [0.001, 0.023, . . . , 0.25, 0.5, 0.24], where the last 3 values are adjacent to each other in the array and substantially greater than the other near zero values in the array, for example. In one embodiment, each selection value above the threshold is multiplied by a corresponding array of values in the stored arrays of values 112 to produce a plurality of weighted arrays. The weighted arrays may be added to produce an attention array of values 115, which is then provided as an input to RNN 120, for example. For example, for N selection values, where the i−1, i, and i+1 selection values are [ . . . , 0.25, 0.5, and 0.25, . . . ], and where the corresponding i−1, i, and i+1 stored arrays are [Ati−1], [Ati], and [Ati+1] (where Ati is the ith stored array 112 and i=0 . . . N), then the attention array is determined by matrix multiplication and addition as follows:
[attention array]=[Ati−1]*0.25+[Ati]*0.5+[Ati+1]*0.25
In one embodiment, the selection values add to one (1), and selection comprises multiplying each stored array by a corresponding selection value, and adding the weighted arrays as above to produce the attention array of values. In this case, since many selection values may be very small, the sum of stored arrays weighted by corresponding selection values may produce an attention array that is approximately equal to one stored array or a sum of multiple stored arrays weighted by their selection values, for example. More specifically, in one embodiment, all the selection values are multiplied by their corresponding stored array vector and added together to create a weighted sum of all the stored vectors. In some embodiments, the selection values will mostly be very near 0 and one stored array may be near one (1) or a few may have non-zero values that add to almost 1. Some embodiments may apply a threshold at this point to use a subset of selection values, for example. However, other embodiments may use all selection values as follows. If, for example, arrays T0-T4 are created by the input RNN 110, and the selection values are [0,01, 0.05, 0.5, 0.4, 0.04] calculated by the attention model applied to T0-T4, where the selection values sum of 1, then the output would be:
Tout=0.01*T0+0.05*T1+0.5*T2+0.4*T3+0.04*T4,
Where Tout is the attention array and the above weighted sum is performed element-wise. If each array, T, is 3 elements and there are 5 arrays, T0-T4 may be concatenated here into a matrix:
Then Tout may be calculated as follows, where the three values for the Tout vector are on the right, on the left is the resulting calculation of the attention array, for example:
0.01*1+0.05*2+0.5*3+0.4*4+0.04*5=3.41
0.01*5+0.05*4+0.5*3+0.4*2+0.04*1=2.59
0.01*3+0.05*2+0.5*1+0.4*2+0.04*3=1.55.
Attention array of values 115 may be processed using RNN 120 to produce values corresponding to characters from the character set 102 forming a recognized character sequence 150. In one embodiment, RNN 120 may include output layer weights 121. Output layer weights 121 may comprise a matrix of values (N×M) that operate on a second plurality of internal arrays of values in RNN 120, for example. Attention array 115 may be processed by RNN 120 to successively produce the internal arrays of values, which are then provided as inputs to the output layer weights, for example. In one embodiment, the attention array of values 115 is maintained as an input to RNN 120 for a plurality of cycles. The number of cycles may be arbitrary. The RNN may continue until the output is a STOP character. In one example implementation, a maximum possible output length may be selected (e.g., 5 characters for a date {DDMM} and 13 for an amount) and always run the RNN for that many cycles, only keeping the output before the STOP character in the output.
RNN 120 produces a plurality of output arrays 130. The output arrays may comprise likelihood values, for example. In one embodiment, a position of each likelihood value in each of the output arrays may correspond to a different character found in the character set, for example. A selection component 140 may receive the output arrays of likelihoods, for example, and successively produce a character, for each output array, having the highest likelihood value in each of the output arrays, for example. The resulting characters form a recognized character sequence 150, for example.
d=[0,0,0,1,0, . . . , 0]; a=[1,0, . . . , 0]; d=[0,0,0,1,0, . . . , 0].
In the example in
In this example, RNN 405 receives an input array of values (“Array_in”) 410 corresponding to successive characters. Input arrays 410 are multiplied by a plurality of input weights (“Wt_in”) 411 to produce a weighted input array of values at 415, for example. RNN 405 includes an internal array of values (“Aout”) 413, which are multiplied by a plurality of feedback weights (“Wt_fb”) 414 to produce a weighted internal array of values at 416. The weighted input array of values at 415 is added to the weighted internal array of values at 416 to produce an intermediate result array of values at 417. A bias array of values 412 may be subtracted from the intermediate result array of values at 417 to produce an updated internal array of values 413, for example. The internal array of values 413 are also stored in memory 450 to generate stored arrays of values 451.
Similarly, RNN 406 receives an input array of values (“Array_in”) 440 corresponding to successive characters received in reverse order relative to RNN 405. Input arrays 440 are multiplied by a plurality of input weights (“Wt_in”) 441 to produce a weighted input array of values at 445, for example. RNN 406 includes an internal array of values (“Aout”) 443, which are multiplied by a plurality of feedback weights (“Wt_fb”) 444 to produce a weighted internal array of values at 446. The weighted input array of values at 445 is added to the weighted internal array of values at 446 to produce an intermediate result array of values at 447. A bias array of values 442 may be subtracted from the intermediate result array of values at 447 to produce an updated internal array of values 443, for example. The internal array of values 443 are also stored in memory 450 with internal array of values 413 to generate stored arrays of values 451.
Stored arrays of values 451 are multiplied by attention weights 452 to generate selection values. If each character in the corpus of characters 403 is represented as M values in each input array 410 and 440, then there are also M internal values in each internal array generated by RNN 405 and M internal values in each internal array generated by RNN 406. Accordingly, stored arrays 451 in memory 450 are of length 2*M, for example. For N characters in the corpus, each RNN 405 and 406 may generate N stored arrays of length M, for example. To select particular arrays from the N stored internal array of 2*M-values, N selection values may be generated, for example, by determining the dot product of each stored array 451 with 2*M attention weights (“Wt_att”) 452, for example. More particularly, the dimensions of the attention weights 452 may be Mx1, and each of the 2*M-length stored internal values is multiplied by an Mx1 weight set to generate a single value for each of the 2*M-length arrays. The N selection values may be stored in another selection array, for example. After generating a single selection value for all the N stored arrays 451, the array of N selection values may be used to select one or more of the N stored arrays 451.
In an ideal case, the N selection values may be all zeros and only a single one (e.g., [0 . . . 1 . . . 0]) to select the one stored array producing the non-zero selection value, for example. In one example implementation, all but one of the selection values may be near zero, and a single selection value is closer to one. The selection value closer to one corresponds to the desired stored array of values 451 that is sent to the second stage RNN 407. In other instances, multiple selection values may have high values and the remaining selection values nearly zero values. In this case, the selection values with higher values correspond to the desired stored arrays of values 451, each of which is multiplied by the corresponding selection value. The selected stored arrays from 451, having now been weighted by their selection values, are added to form the attention array sent to the input of second stage RNN 407. In one embodiment, characters 403 are processed by one or more first stage RNNs and stored in memory before performing the selection step described above and before the processing the attention array of values using a second stage RNN, for example.
In one embodiment pertaining to recognizing dates or amounts in a corpus of characters from receipts, the first RNN layer learns to simultaneously encode the date or amount in stored output array as well as a signal to the attention layer indicating a confidence that the correct amount or date has been encoded. For example, the amount may be encoded in one part of the stored array and the signal to the attention layer may be encoded in an entirely separate part of the array, for example.
RNN 407 receives an attention array as an input array (“Array_in”) 420. Similar to RNNs 405 and 406, input arrays 420 are multiplied by a plurality of input weights (“Wt_in”) 421 to produce a weighted input array of values at 425, for example. RNN 407 includes an internal array of values (“Aout”) 423, which are multiplied by a plurality of feedback weights (“Wt_fb”) 424 to produce a weighted internal array of values at 426. The weighted input array of values at 425 is added to the weighted internal array of values at 426 to produce an intermediate result array of values at 427. A bias array of values 422 may be subtracted from the intermediate result array of values at 427 to produce an updated internal array of values 423, for example. The internal array of values 423 are then combined with output layer weights 428, to produce result output arrays 429. To produce multiple result arrays, the attention array forming the input array 420 to RNN 407 is maintained as an input to RNN 407 for a plurality of cycles. During each cycle, the weighted attention array at 425 may be combined with new weighted internal arrays at 426 and bias 422 to generate multiple different internal arrays 423. On successive cycles, new internal array values 423 may be operated on by output layer weights 428 to produce new result array values, for example. As mentioned above, the output RNN may run for the number of cycles in the output string until it generates the STOP character or, for efficiency of the calculation, an arbitrary number of cycles based on the expected maximum length of the output, for example.
As mentioned above, in this example implementation, there may be 2*M values in the attention array generated by the selection process and provided as an input to RNN 407. Accordingly, there may be 2*M internal values in internal array 423. In one embodiment, output layer weights 428 may be an M×2M matrix of weight values to convert the 2*M internal values in array 423 into M result values, where each character in the corpus of characters 403 is represented as M values. Thus, each of the M values in the result array corresponds to one character. In one embodiment, RNN 407 successively produces a plurality of result output arrays of likelihood values. For example, a position of each likelihood value in each of the result output arrays corresponds to a different character of the plurality of characters. Accordingly, the system may successively produce a character having a highest likelihood values in each of the output arrays. In this example, for each result output array of likelihood values, a character corresponding to the highest likelihood value in each array may be selected at 460. Accordingly, encoded character arrays generated from sequential result output arrays 429, as described above at the input of RNNs 405 and 406 may be decoded at 461 to produce a recognized character sequence 462, for example.
In one example embodiment, there may be N characters in a corpus. Each character may be represented by an encoded character array of length 128, where each character type in the character set has a corresponding array of all zeros and a single one, for example. Accordingly, the input arrays of each RNN 405 and 406 are multiplied by 128 input weights. Similarly, the internal values 413 and 443 are arrays of length 128, which are each multiplied by 128 feedback weights. The combined internal arrays of 128 values from RNN 405 and 406 produce N stored arrays of 256 values each. These N stored arrays (one for each character) are multiplied by 256 attention weights to produce N selection values, for example. The selection process produces an attention array of length 256, which is provided as an input array to RNN 407. RNN 407 may have an internal array length of 256 values. Thus, output layer weights are a matrix of 128×256 to produce 128 length result output arrays of likelihoods, for example, where each position in the output array corresponds to a particular character.
Computer system 610 may be coupled via bus 605 to a display 612 for displaying information to a computer user. An input device 611 such as a keyboard, touchscreen, and/or mouse is coupled to bus 605 for communicating information and command selections from the user to processor 601. The combination of these components allows the user to communicate with the system. In some systems, bus 605 represents multiple specialized buses for coupling various components of the computer together, for example.
Computer system 610 also includes a network interface 604 coupled with bus 605. Network interface 604 may provide two-way data communication between computer system 610 and a local network 620. Network 620 may represent one or multiple networking technologies, such as Ethernet, local wireless networks (e.g., WiFi), or cellular networks, for example. The network interface 604 may be a wireless or wired connection, for example. Computer system 610 can send and receive information through the network interface 604 across a wired or wireless local area network, an Intranet, or a cellular network to the Internet 630, for example. In some embodiments, a browser, for example, may access data and features on backend software systems that may reside on multiple different hardware servers on-prem 631 or across the Internet 630 on servers 632-635. One or more of servers 632-635 may also reside in a cloud computing environment, for example.
The above description illustrates various embodiments of the present disclosure along with examples of how aspects of the particular embodiments may be implemented. The above examples should not be deemed to be the only embodiments, and are presented to illustrate the flexibility and advantages of the particular embodiments as defined by the following claims. Based on the above disclosure and the following claims, other arrangements, embodiments, implementations and equivalents may be employed without departing from the scope of the present disclosure as defined by the claims.