The present invention and its advantages are now described in conjunction with the accompanying drawings.
A finite state machine (FSM) is a model of behaviour composed of states, transitions and actions. A state stores information about the past, i.e. it reflects the input changes from the start to the present moment. A transition indicates a state change and is described by a condition that would need to be fulfilled to enable the transition. An action is a description of an activity that is to be performed at a given moment. A specific input action is executed when certain input conditions are fulfilled at a given present state. For example, an FSM can provide a specific output (e.g. a string of binary characters) as an input action. An FSM can be represented using a set of (state) transition rules that described a station transition function.
The preferred embodiments of the invention involve a new approach for a so-called transition rule cache that is part of a co-processor or an accelerator engine based on a programmable B-FSM (BaRT-FSM) controller. BaRT (Balanced Routing-Table Search) is a specific hash table lookup algorithm described in a paper of one of the inventors: Jan van Lunteren, “Searching Very Large Routing Tables in Wide Embedded Memory”, Proc. of GLOBECOM '01, pp. 1615-1619. An example of such an accelerator is the ZuXA accelerator concept that is described in a paper co-authored by one the inventors: Jan van Lunteren et al, “XML Accelerator Engine”, Proc. of First International Workshop on High Performance XML Processing, 2004.
A ZuXA controller is an accelerator that can be used to improve the processing of XML (eXtensible Markup Language) code. It is fully programmable and provides high performance in combination with low storage requirements and fast incremental updates. Especially, it offers a processing model optimized for conditional execution in combination with dedicated instructions for character and string-processing functions. The B-FSM technology described a state transition function using a small number of state transition rules, which involve match and wildcard operators for the current state and input symbol values, and a next-state value. The transition rules are assigned priorities to resolve situations in which multiple transition rules are matching simultaneously.
In a ZuXA controller the input to the rule selector 11 consists of a result vector provided by a component called instruction handler, in combination with a general-purpose input value obtained, for example, from an input port. In each cycle, the rule selector 11 will select the highest-priority transition rule that matches the current state stored in the state register 12 and the input vector. The result part 21 of the transition rule vector selected from the transition rule memory 10 will then be used to update the state register 12 and to generate an output value. The output value includes instructions that are dispatched for execution by the instruction handler component. The execution results are provided back to the rule selector 11 and used to select subsequent instructions to be executed by the instruction handler as described above.
The function of the rule selector 11 is based on the BaRT algorithm, which is a scheme for exact-, prefix- and ternary-match searches. BaRT is based on a chaining hash method with a hash function that has the property that the maximum number of collisions for any hash index can be limited by a configurable upper bound. This upper bound is selected to be N=4 in
The rule cache register 42 serves as the memory of the rule cache 40. Therefore the rule cache 40 comprises a single cache line only. A cached address register 43 stores the tag for the cache line. A comparator 44 compares the tag from the cached address register 43 with the address generated by the address generator 14. A valid address register 45 stores bit flags which indicate whether the cached address register contains a valid address and whether the rule cache register 42 contains a valid entry from the transition rule table 13.
The steps of the hash table lookup operation are implemented as follows: The steps S1 and S2 of the hash table lookup operation are performed by the address generator 14. These two steps perform a calculation of the hash index and the memory address, wherein the transition rule memory 10 serves the role of the main memory and the search key is built from a set of registers and an additional input vector. The steps S3a and S3b are implemented by the comparator 44 and controlled by the RCC 41. In these two steps the main memory address is compared with the cache tag. In step S4a the hash table entry is compared with the search key. This step is implemented by the rule selector 11. Each hash table entry can contain four possible matches, which are tested in parallel against the search key. In step 4b a hash table entry is selected in case of a match. This step is implemented by a MUX 46 component, which selects the first hash table entry that matches as the search result. The content loaded to the state register 12, the mask register 15, and the table address register 16 is updated by the RCC 41 based on the search result via the MUX 46. Especially, the search result output vector can be used to generate an instruction vector for the instruction handler.
An additional AND component 62 of a rule cache 61 implements a logical AND function for output signals of the MUX 46, the comparator 44 and the valid address register 45. An OR component 63 implements a logical OR function for all the output signals of all the AND components in the different rule caches. The content of the state register 12, the state mask register 15, and the table address register 16 is updated from the output signals of the OR component 63.
The processor cache 60 exploits the fact that a cache hit can occur in at most one cache line in the following way: Each cache line (a rule cache 61) for which the “delayed evaluation” indicates that there was no cache hit, will reset its output to zero (these are the output signals of the AND component 62). This is also the case when the cache line does not contain a valid address and valid data (as indicated by the content of the valid address register 45). Consequently, only the cache line that detects a cache hit will provide “valid” data at its output signals using a simple logical OR function. These output signals are then provided by the OR component 63. The detection whether there has been a cache hit (one cache line has a match for the search key) or a cache miss(no cache line has a match for the search key) is performed by the RCC 51, which will initiate a read operation on the main memory (the transition rule memory 10) in case of a cache miss.
Several experiments have shown that a significant gain can be achieved in this way for many applications that iterate the same transitions many times. This appears to happen frequently, for example, with applications that “execute” a given transition rule to perform the same processing of a string of input characters (e.g., write in local memory, compare with character string, etc.).
The present invention can also be used in cache hierarchies allowing further performance improvements for hash table lookup operations. For example, in
Another example for using the invention in cache hierarchies would be to use the processor cache 60 in
Especially, in a ZuXA controller the search result can be used to generate an instruction vector for the instruction handler that provides processing results back to the BaRT-FSM as part of an input vector. The instructions contained in the instruction vector can be used for simple (and fast to be implemented) functions that run under tight control of the BaRT-FSM. Examples are character and string processing functions, encoding, conversion, searching, filtering, and general output generating functions.
The invention is not restricted to the B-FSM technology only, but is applicable to a wider range of hash table lookup operations. Also the invention is not restricted to be implemented in hardware entirely. A method in accordance to the present invention can also be implemented as software, a sequence of instructions to be executed on one or more processors of a computer system. While a particular embodiment has been shown and described, various modifications of the present invention will be apparent to those skilled in the art.
Number | Date | Country | Kind |
---|---|---|---|
06113754.3 | May 2006 | DE | national |