1. Field of the Invention
The present invention is directed to communication networks and, more particularly, to routing messages in communication networks.
2. Background of the Related Art
Since the 1990s the Internet has grown substantially in terms of the continuously increasing amount of traffic and number of IP routers and hosts on the network. One of the major functions of IP routers is packet forwarding, which is basically doing a routing table lookup based on an IP destination field in an IP packet header of an incoming packet and identifying a next hop over which the incoming packet should be sent.
Primarily, three approaches have been used for IP route lookup—pure software, pure hardware and a combination of software and hardware. In early-generation routers where line card interfaces were running at low speed, appropriately programmed general-purpose processors were typically used to perform packet forwarding. This is a pure software approach. Its main advantages are that it is flexible, easy to change and easy to upgrade. Its main disadvantages are its poor performance, low efficiency and difficulty in being scaled to high-speed interfaces.
In later-generation routers where speed and performance are critical, the pure hardware approach is taken. Here, customized application-specific integrated circuit (ASIC) hardware is developed to achieve very high performance and efficiency. The main disadvantages of this approach are that it is hard to change or upgrade to accommodate new features or protocols, it is too expensive to develop, and it has a long development cycle—typically, about 18 months.
In the latest generation of routers, a combination software and hardware approach is taken. This is a so-called “network processor”, which uses a special processor optimized for network applications instead of a general purpose processor. The advantage of this approach is that the network processor is programmable, flexible, and can achieve performance comparable to that of the customized ASIC. It also shortens the time for product to market, can be easily changed or upgraded to accommodate new features or protocols, and allows customers to change the product to a limited degree.
For the software approach, one study reports that two million lookups per second (MLPS) can be achieved using a Pentium II 233 MHz with 16 KB L1 data cache and 1 MB L2 cache. It requires 120 CPU cycles per lookup with a three level trie data structure (16/8/8). Further, software has been developed which compresses the routing table into a small forwarding table that can be fit into the cache memory of an ordinary PC. This arrangement requires about 100 instructions per lookup and is claimed to be capable of performing 4 MLPS using a Pentium 200 MHz processor.
The hardware approach has been taken by many IP router vendors. For example, Juniper Networks designed an ASIC called the “Internet Processor” which is a centralized forwarding engine using more than one million gates with a capacity of 40 MLPS. The Gigabit Switch Router (GSR) from Cisco Systems is capable of performing 2.5 MLPS per line card (OC48 interface) with distributed forwarding. The whole system can achieve 80 Gb/s switching capacity.
The network processor approach has recently become popular. For example, the XPIF-300 from MMC Networks supports 1.5 million packets processed per second (MPPS) with a 200 MHz processor optimized for packet processing; another product, the nP3400, supports 6.6 MPPS. The IXP1200 network processor from Intel uses one StrongARM microprocessor with six independent 32-bit RISC microengines. The six microengines can forward 3 MPPS. The Prism from Siterra/Vitesse uses four embedded custom RISC cores with modified instruction sets. The C-5 from C-Port/Motorola uses 16 RISC cores to support an interface capable of supporting a communication speed of up to 5 Gb/s. Ranier from IBM uses 16 RISC cores with embedded MAC & POS framers. Agere/Lucent also has developed a fast pattern processor to support speeds up to the OC-48 level.
Traditionally the IPv4 address space is divided into classes A, B and C. Sites with these classes are allowed to have 24, 16 and 8 bits for addressing, respectively. This partition is inflexible and has caused wastes of address space, especially with respect to class B. So, bundles of class C addresses were furnished instead of a single class B address. This has caused substantial growth of routing table entries. A new scheme called classless inter-domain routing (CIDR) was used to reduce the routing table entries by arbitrary aggregation of network addresses. Routing table lookup requires longest prefix matching, which is a much harder problem than exact matching. The most popular data structure for longest prefix matching is the Patricia trie or level compressed trie, which is basically a binary tree with compressed levels. A similar scheme called reduced radix tree has been implemented in Berkeley UNIX 4.3. Content Addressable Memory (CAM) is used for route lookup, but it only supports fixed length patterns and small routing tables. A technique using expanded trie structures with controlled prefix expansion has been introduced for fast route lookup. Another technique uses a bitmap to compress the routing table so that it can fit into a small SRAM and help to achieve a fast lookup speed. In order to add a new route into the table, the update method requires sorting and preprocessing of all existing routes with the new route, which is very expensive computation. In other words, this method does not support incremental route update.
Upon receiving an IP data packet, IP routers need to perform route lookup and find the next hop for the packet. The aforementioned applications give analyses of the backbone routing table traces and also keen observations about the route distribution. This motivates the design of advanced data structures to store the routing information and to accelerate lookup/update while minimizing the memory requirement. For example, a large DRAM memory may be used in an architecture described in the previous applications to store two-level routing tables. The most significant 24 bits of IP destination address are used as an index into the first level, while the remaining eight bits are used as offset into the second table. This is a so-called 24/8 data structure. The data structure requires 32 MB memory for the first level table but much less memory for the second level.
The applications also discuss a compressed data structure called 24/8c that reduces the memory requirement to about 3 MB. The 24/8 and 24/8c data structures need a fixed number of entries (i.e. 28) for each second level table segment.
This application discloses an advanced data structure that allows lookup based upon the most significant 16 bits and the following K bits of the IP destination address (where K is chosen as discussed herein). This scheme, called 16/K routing, requires less than 2 MB memory to store the whole routing tables of present day backbone routers. It also helps to develop another version of the data structure called 16/Kc routing, which utilizes bitmaps to compress the table to less than 0.5 MB. For the 16/K data structure, each route lookup requires at most 2 memory accesses while the 16/Kc structure requires at most 3 memory accesses. For each individual scheme, the same data structure can be used for both route lookup and update. The data structures also support incremental route update. Lastly, the data structure defined herein can be extended to support multiple next hops, a technique for congestion management and load balancing using equal cost multi-paths.
Cycle-accurate simulation results are reported for a configurable processor implementation. By configuring the processor properly and developing a few customized instructions to accelerate route lookup, one can achieve 85 million lookups per second (MLPS) in a typical case with the processor running at 200 MHz. This performance is much better than 2 MLPS which can typically be achieved by using a general purpose CPU, and is comparable to that of custom ASIC hardware solutions.
The data structures and methods disclosed herein also can be implemented in pure hardware, in which case each route lookup can be designed to have as few as three memory accesses. The routing table can be stored in external SRAM with a typical 10 ns access time. Further, the lookup method can be implemented using pipelining techniques to perform three lookups for three incoming packets simultaneously. Using such techniques, 100 MLPS performance can be achieved.
These and other aspects of an embodiment of the present invention are better understood by reading the following detailed description of the preferred embodiment, taken in conjunction with the accompanying drawings, in which:
Each IPv4 packet has a layer 3 field containing a 32-bit destination IP address. In the following embodiments, the most significant 16 bits are grouped together and called a segment 605 and the remaining K bits are called an offset 610. K is variable ranging from 1 to 16 and is chosen in order to minimize redundancy in the table. The data structure will have two levels of tables in order to store the routing information base (RIB): namely, T1_RIB (first level) and T2_RIB (second level) tables.
It is necessary to store the prefix length 635 of each route entry 625 for route update. That is, a more specific route will overwrite a less specific route. Suppose the initial route table is empty. If a new IP route 38.0.0.0/8/1 (the first field is the 32-bit IP address in dot format, the second field “8” indicates prefix length 635 while the third field “1” is the next hop 630) arrives. This implies that the T1_RIB table 615 from 38.0 to 38.255 (total 28=256 entries) needs to be updated to reflect this newer route. Next, suppose a new IP route 38.170.0.0/16/2 arrives. The entry 625 indexed by 38.170 in the T1_RIB table 615 are overwritten with the new next hop and prefix lengths 2 and 16, respectively. If the order of the two coming routes is reversed, the routing tables would look the same because the less specific route (38.0.0.0/8/1) would not overwrite the more specific route (38.170.0.0/16/2) at the index 38.170 in the T1_RIB. More discussion on how to update the routing table will follow shortly. The format of each entry 625 in the T1_RIB and the T2_RIB tables 620 is shown in
For each T1_RIB entry 625, say T1_Entry[31:0], use the bit fields as follows. T1_Entry[31] is the most significant bit (a marker bit 650) and represents whether this entry 625 stores next hop/prefix length information or a K value/pointer to a T2_RIB table 620. If T1_Entry[31] is 0, T1_Entry[30:16] is not used, T1_Entry[15:6] stores next hop information 630 and T1_Entry[5:0] stores the prefix length 635 associated with this route. Otherwise, T1_Entry[30:27] stores the value 645 of (K−1) (note these 4 hits can represent the value from 0 to 15, thereby indicating the real K value from 1 to 16) and T1_Entry[26:0] stores a base pointer 640 to its T2_RIB. These 27 bits are far more than sufficient for indexing into the second level table T2_RIB since the size of the tables created will never require 128 MB (227 bytes) of memory space).
For each T2_RIB entry 625, the first 10 bits are used to store the next hop 630 while the remaining 6 bits are used to store the prefix length 635 associated with the entry 625.
For each entry 625 in the T1_RIB define an integer K that is in the range from 1 to 16. Consider an entry 625 indexed by i (representing a 16 bit IP address, say a.b) (in this application, the dot format a.b of the most significant 16 bits of the IP address is used interchangeably with its decimal value i=a*256+b to denote the index to the table T1_RIB) in T1_RIB. For example, for the first entry 625 in T1_RIB, its index i is 0 representing the 16 bit IP prefix 0.0. For the 32,772nd entry 625 in T1_RIB, its index i is 32,771 representing 128.3. The maximum prefix length 635, say Pl_Max[i], is found for all the routes in the routing table whose prefix begins with 16 bits a.b. If this maximum prefix length 635 is no more than 16, then K is not defined for this entry 625 indexed by i. Otherwise, K[i]=Pl_Max[i]−16. If K[i] is defined, the value of K[i]−1 will be stored at the 4 bits T1_Entry[30:27] at the entry 625 indexed by i. For example, suppose that the whole routing table contains only 2 entries with prefix beginning with 128.3: 128.3/16/1; 128.3.255/24/3. In this case the maximum prefix length 635 is Pl_Max[128.3]=24. So, the K value associated with the entry 625 indexed by 128.3 is K[128.3]=24−16=8. It should be noted that the K value may change dynamically as new routes are added into or deleted from the routing table. Suppose a new route, say 128.3.255.252/30/2, is added to the routing table. Then the maximum prefix length 635 Pl_max[128.3] becomes 30 and its associated K value K[128.3]=30−16=14. From analyzing exemplary routing table traces of backbone routers, the number of entries in the T1_RIB whose K value ranges from 1 to 16 is shown in TABLE 1.
The total number of entries in the T1_RIB table 615 whose marker bit is set to 1 is also given in the table. It shows that the percentage of those entries with marker bit set to 1 is less than 6%. This implies that most of the entries in the T1_RIB table 615 store next hop and prefix length 635 instead of a base pointer to the second level table T2_RIB. From the table, also observe that more than 50% of those entries in the T1_RIB table 615 with marker bit set to 1 have a K value of 8, which means the maximum prefix length 635 is 24. This is in accordance with a prior art observation that more than 50% of the routes in backbone routers have a prefix length 635 of 24.
As noted earlier, for entry i in the T1_RIB table 615, let T1_Entry[31:0]=T1_RIB[i]. If the marker bit is set to 1, T1_Entry[30:27] stores the value of K[i]−1 and T1_Entry[26:0] stores a base address 640 pointing to the beginning address of a second level table T2_RIB. There are 2K[1] entries in T2_RIB and each entry 625 is 2 bytes. Note that the size of this second level table may change dynamically as new routes are added or deleted, causing changes in the K value. The total size of the T2_RIB table 620s is the sum of 2*2K[i] (bytes) over i for K[i]>0.
Consequently, the total T2_RIB table 620 size varies depending on the route distribution. For all of the shown backbone routers, the total table size including the T1_RIB (with a fixed size of 256 KB) and T2_RIB is no more than 1.8 MB if the 16/K data structure is used to store these routing tables.
Note that the size of the second level table T2_RIB is much bigger than that of the T1_RIB table 615. The T2_RIB table 620 may store a lot of redundant information. For example, suppose there is only one route (128.3.0.0/16/1) with the 128.3 prefix existing in the routing table. If a new route (128.3.255.0/24/2) comes in, this requires the creation of a second level table with 224−16=256 entries. From entry 1 to entry 255 in this second level table, each entry 625 will store the same information (next hop 630/prefix length 635=1/16) associated with the route (128.3.0.0/16/1). Only the last entry (entry 256) will store the information (next hop 630/prefix length 635=2/24) associated with the new route (128.3.255.0/24/2). So, one can compress the second level table by using the same technique as described in the aforementioned applications. This compressed data structure is called a 16/Kc table.
For the sake of analysis and to motivate the design for the 16/Kc scheme, imagine dividing the T2_RIBs of the 16/K scheme into 64-entry blocks (the block size can be chosen to be any value). If the T2_RIB table 620 has less than 64 entries, use one block to represent it. Call the number of unique next hop/prefix length entries in a block to be its “dimension”, and represent it with dim(NHPL). TABLE 2 gives the number of blocks in the T2_RIB table 620 whose dim(NHPL) is equal to 1, 2, 3 and 4 for the aforementioned sampled routers. It shows that most of the blocks have a dim(NHPL) between 1 and 4. The maximum dim(NHPL), namely Dmax, is also reported in the table.
The table shows that Dmax can be quite large (from 33 to 44 for the five backbone routers). Note that Dmax represents the maximum dim(NHPL) among all the blocks in T2_RIB.
For the 16/Kc scheme shown in
T2_Entry[127:64] is a 64 bit bitmap 660 (
T2_Entry[63:0] stores one or two NHPLs 655 or a 32-bit address. If dim(NHPL)<2, T2_Entry[63:0] stores NHPL information 655 (in the order of NHPL[1], NHPL[2]). Otherwise, T2_Entry[63:32] stores a 32-bit address which points to where the extended NHPL array 665 is stored, i.e., T2_Entry[63:32]=&NHPL[1].
The least significant 32 bits in the T2_RIB entry T2_Entry[31:0] are not used. They can be used to store NHPL[3] and NHPL[4] for fast lookup. Then, in this case, the extended T2_RIB table 620 will be needed if dim(NHPL) is more than 4, rather than 2.
For illustration purposes, one can generate the bitmap 660 and NHPL array for the 16/Kc scheme by scanning a 16/K T2_RIB table 620. For the 16/K T2_RIB table 620, scan through one block of 64 entries at a time. For each block create a 64-bit bitmap 660 with one bit representing each entry 625 in the block. For the first entry 625 in the block, always set the most significant bit in the bitmap 660 to 1 and store its associated NHPL content 655 into the first part of an NHPL array, say NHPL[1]. Then, check whether the second entry 625 shares the same NHPL information 655 as the first entry 625. If it does, set the second bit in the bitmap 660 to 0. Otherwise, set the second bit to 1 and add its NHPL content 655 to NHPL[2]. This process continues until all the 64 entries in the block are finished. For example, suppose there are only two routes in the routing table, namely, 128.3.0.0/16/1 and 128.3.255/24/2. So, there are 224−16=256 entries in the T2_RIB table 620. All entries in this T2_RIB table 620 except for the last store 1/16. The last entry 625 stores 2/24. In the 16/Kc scheme, divide this T2_RIB table 620 into 256/64=4 blocks. For the first 3 blocks, since all the entries in the blocks store the same NHPL information 655, a bitmap 0x8000000000000000 and associated NHPL array with NHPL[1]=1/16 are used. For the last block, use a bitmap 0x8000000000000001 is used (since only the last entry 625 will be different from all the other 63 entries) and an associated NHPL array with NHPL[1]=1/16, NHPL[2]=2/24. These bitmaps 660 and NHPL arrays will be stored in the 16/Kc T2_RIB table 620 in the order corresponding to the blocks in the original 16/K T2_RIB table 620. That is, the first entry 625 in the 16/Kc T2_RIB table 620 corresponds to the first block in the 16/K T2_RIB table 620, the second entry 625 in the 16/Kc T2_RIB table 620 corresponds to the second block in the 16/K T2_RIB table 620, and so forth.
Note that the total number of 1's in the 64-bit bitmap 660 denotes dim(NHPL). If the total number of one's in the 64-bit bitmap 660 is between 1 and 2, store the NHPL array at the field [63:32]. Otherwise, the NHPL array will be stored in an extended T2_RIB table 620.
In the 16/K scheme, for each entry 625 in the T1_RIB table 615 whose marker bit is set to 1, there is an associated T2_RIB table 620 with 2K entries (each entry 625 is 2 bytes), where K is stored in the T1_RIB entry 625. For the compressed 16/Kc scheme, this associated T2_RIB table 620 is compressed to have 2max(0, K−6) entries (each entry 625 is 16 bytes). If the K value is less than 6, then there is only one entry 625 in T2_RIB. Otherwise, there are 2K−6 entries in the T2_RIB table 620.
Note that each entry 625 in the extended T2_RIB is 2 bytes storing next hop and prefix length information. By analyzing the routing traces from the aforementioned backbone routers, one observes that the size of the extended T2_RIB tables 620 is no more than 40 Kbytes. Note that the size of a T2_RIB table 620 is compressed by a factor of 8 in the 16/Kc scheme compared to the 16/K scheme.
For the 16/Kc scheme, the total table size is less than 0.5 MB to store the backbone routing tables. If needed, one can furthermore compress the table to less than 256 KB by using fewer bits for each entry 625 in T1_RIB, using all the 128 bits in T2_RIB entry 625 for bitmaps 660, and storing all NHPL information 655 in the extended T2_RIB table 620. The next section presents a route update algorithm to create the bitmap 660 and NHPL array for the 16/Kc scheme without creating the T2_RIB of the 16/K scheme first.
The pseudo code of the complete 16/K route lookup algorithm is given in APPENDIX A. Note that each route lookup will need two memory accesses in the worst case. Typically, a lookup will require only one access to the T1_RIB table 615. An illustrative example will be given later as the route update algorithm is presented in the next sub-section.
Moving on to the 16/K route update algorithm, as shown in
For the case where the prefix length 635 is less than or equal to 16 (S755), i.e., Prefix_Length<=16, determine how many entries in the T1_RIB table 615 are matched by the new route (S760). Consider a new route update (128/8/1). This new route matches 216−8=256 entries in T1_RIB from 128.0 to 128.255. For each matched entry 625 in the T1_RIB, the marker bit needs to be examined (S760). If the marker bit is 0 (S770), then check whether the Prefix_Length is equal to or larger than the old prefix length 635 which is stored in the table (S775). If Prefix_Length>=Old_Prefix_Length (S775), then change the old next hop and prefix length 635 information stored in the entry 625 with the new next hop and prefix length information (S780) since the new route is newer and at least as specific. If the marker bit is 1 (S770), retrieve the pointer stored in the T1_RIB table 615 entry 625 (S792) and scan through the 2K entries in the T2_RIB table 620 (S794) to see whether the new route update is more specific than the old route stored in the T2_RIB table 620. Again, if Prefix_Length>=Old_Prefix_Length (S796), update the entry 625 in T2_RIB with the new next hop and prefix length information (S798).
As an illustrative example, suppose the routing table is initially empty and a new route update of (128/8/1) arrives as shown in
For the second case, i.e., Prefix_Length>16, use the most significant 16 bits of the IP address to match one entry 625 in the T1_RIB. First, compute the New_K value given by prefix_Length−16. If the marker bit is 0, the new route is more specific than the current route and one needs to build a new T2_RIB for this entry 625, turn on the T1_RIB entry's marker bit, set its K field, and update the T1_RIB entry 625 to point to the new T2_RIB table 620. Lastly, the new T2_RIB table 620 needs to be populated with data, one route of which will be this new route. To populate the new T2_RIB table 620, the remaining New_K bits of the prefix are used as an index and the next hop/prefix length information is loaded into this entry 625 in the T2_RIB table 620. All other entries are set to the next hop/prefix length values that were previously in the T1_RIB entry 625.
If the marker bit is 1, depending on the current size of the T2_RIB, it is possible numerous T2_RIB entries may be matched or the T2_RIB may have to be grown. If the New_K value is less than or equal to the Old_K value, there is no need to expand the T2_RIB table 620. In this case, there is only a need to update the matched entries in T2_RIB with the new next hop and prefix length information if Prefix_Length>=Old_Prefix_Length. The remaining unmatched entries in T2_RIB will be untouched. If the New_K value is greater than the Old_K value, change the K value in the T1_RIB to the New_K value, create a new T2_RIB table 620 with 2New
As a continuation of the example, suppose a third route update (128.3.128/20/3) arrives as shown in
If a fourth route update (128.3.255/24/4) arrives as shown in
To illustrate how the 16/K route lookup algorithm works, assume that the T1_RIB and T2_RIB tables 620 have been filled up as in
Another packet with a 128.3.254.2 destination address matches a T1_RIB[128.3] entry 625 whose marker bit is set to 1 and K value is 8. The following 8 bits (254d=11111110b) are used as an offset into the T2_RIB table 620 which gives next hop “2” (associated with the second route 128.3/16/2). Indeed the second route gives the longest prefix match.
Another packet with 128.3.128.4 destination address matches T1_RIB[128.3] with marker bit set. The following 8 bits (128d=10000000b) are used to index T2_RIB and find the next hop “3”, which is associated with the third route 128.3.128/20/3. The third route does give the longest prefix match.
Pseudocode for the 16/K route update algorithm is shown in APPENDIX B.
An example lookup algorithm for the 16/Kc data structure is described below in conjunction with the flowchart in
The pseudo code of the 16/Kc lookup algorithm is given in APPENDIX C. For the 16/Kc data structure, each lookup will need at most three memory accesses in the worst case.
The following is a description of an example 16/Kc update algorithm with reference to
If Prefix_Length<=16 (S906), one or more T1_RIB entries need to be examined as candidates for being updated. 216−Prefix
If Prefix_Length>16 (S906), this corresponds to a single T1_RIB entry 625 indexed by the 16-bit Ip_Addr[31:16]. If the T1_RIB entry 625 at this index has its marker bit off (
For the rest of this section Old_K and New_K will be used to refer to the K value found in the T1_RIB entry 625 and the new K value New_K=Prefix_Length−16.
As shown in
If Old_K<=6 (S936) and New_K<=Old_K (S933) then the current T2_RIB has only one entry 625. This entry 625 contains a bitmap 660 with only 2Old
Otherwise, if Old_K>6 (S936) then multiple T2_RIB entries currently exist. Multiple entries may have been matched and may need to be updated. If New_K<=Old_K−6, (S960), then multiple entries may be matched. In this case, there is no need to change the bitmap 660. The NHPL array only need be updated if necessary (S963). If, on the other hand, New_K>Old_K−6, (S960), then only one T2_RIB entry 625 has been matched. Furthermore, only a portion of the entry's bitmap 660 has been matched and needs to be examined. Once again, the bitmap 660 and NHPL data 655 are updated with the new route's next hop information 630 according to the previously specified rules (S966).
If the New_K is greater than the Old_K (S933), the new route is more specific than the stored route and more bits should be used (New_K bits from the IP addresses) to index into the T2_RIB table 620. The T2_RIB that exists already needs to be grown in size. If New_K<=6 (S945), then since Old_K<New_K<=6 (S933), there exists only one entry 625 in the T2_RIB table 620 and its bitmap 660 needs to be grown by a factor of 2New
The following is an example illustrating the data structure created due to the update procedure. There are many cases that can occur but only a few of the most important ones are illustrated. The T1_RIB is completely initialized to all zeros, specifying no routes. First, in
Now assume a new route (128.3.240/20/3) arrives as in
Next, assume that the route (128.254.248/22/4) arrives as in
Finally, assume that the route (128.255.255/24/5) arrives as in
Pseudocode for the 16/Kc route lookup algorithm is shown in APPENDIX D.
To evaluate the performance of the 16/K and 16/Kc data structures and algorithms, they have been implemented in the C language. The evaluation software can run on any processor platform that supports C. In the simulation described below, the aforementioned processor called Xtensa, which is a high-performance and configurable 32-bit RISC-like microprocessor core, is used. Xtensa allows the designer to configure the processor with respect to bus width, cache size, cache line size, the number of interrupts, amount of on chip memory, etc. It also supports the Tensilica Instruction Extension (TIE) language (its syntax is similar to Verilog) which can be used to describe new instructions that complement the core. Using TIE to add customized instructions is quite useful for optimizing performance in many applications.
To accelerate the 16/K lookup process, the following 4 customized instructions for the Xtensa processor have been developed:
In order to facilitate fast lookups, the T1_RIB table 615, which is 256 KB in size, is placed into a special 256 KB 1-cycle latency on-chip memory. The remaining portions of the table are placed into off-chip SRAM memory for fast access.
The assembly coded procedure containing these special lookup instructions is shown in TABLE 3.
The instruction t2_lookup is essentially a conditional load that will either load from off-chip RAM or return the result previously loaded in t1_lookup. The disadvantage of this implementation is that the instruction t2_addr and t2_lookup will always be executed regardless of whether the next hop 630 is stored in T1_RIB or not. However, due to micro-architectural issues, branching around these instructions would not yield any better performance. Furthermore this code is optimized for worst case performance.
The Xtensa Instruction Set Simulator (ISS) was used to perform cycle-accurate simulation using a route table trace of Class A, B, Swamp and C addresses from the MAE-EAST database collected on Oct. 3, 2000. These addresses constituted a routing table database of 19,000 entries. A data packet trace with 250,000 entries (made available from NLANR Network Analysis Infrastructure), is used for route lookups. In the simulation, the processor is configured with the following key parameters: 128 bit processor interface (PIF) to memory, 32 registers, 2-way set associate caches with 16 bytes line size, a cache size of 16 Kbytes, and a clock frequency of 200 MHz. The T1_RIB is a static 256 KB and was thus placed into a fast on chip memory that can be accessed in 1 cycle. Through simulation, instruction-level profile data for a trace of 250,000 lookups shown in TABLE 3 above was obtained.
Totally there are 2,335,421 cycles for 250,000 route lookups. Equivalently, this is about 9.34 cycles/lookup. Note that even though it is shown that the instruction t1_lookup needs 1 cycle/lookup, it actually has a 2 cycle latency since it is a load instruction from on-chip memory. The extra cycle is counted in the next instruction t2_addr which depends on the results from t1_lookup. Since the size of the T2_RIB table 620 is significant, it is stored in external memory. Instruction t2_lookup will have 2 cycles latency per lookup if the data loaded by t2_lookup is in the cache. Otherwise, there is a cache miss that causes the processor to stall. This will require 7 cycles plus physical memory access time. These cache miss cycles are reflected in the cycle count for the t2_lookup instruction. Notice that there are 750,000 cycles for entry to the function rt_lookup. If macro or inline code is used, these cycles can be avoided. Therefore, it suffices to say that actually about 6.34 cycles/lookup are needed. Without using customized instructions, it would need about 40 cycles/lookup. Thus, about a 7× performance improvement can be achieved by adding specialized instructions. Furthermore, if one does two lookups for two different packets in the instruction sequence, 2 cycles of memory latency can be hidden, thus yielding 4.34 cycles/lookup. At last, the two instructions t1_index and t2_addr, into t1_lookup and t2_lookup, respectively, can be embedded. This will save another 2 cycles. Thus 2.34 cycles/lookup that is equivalent to 85 MLPS in the typical case for an Xtensa processor running at 200 MHz can be achieved.
Consider the worst case where the t2_lookup load instruction is always a cache miss. Suppose the second level table T2_RIB is stored in external SRAM memory which typically has an access time of 10 ns (this is 2 cycles for a processor at 200 MHz). Based upon micro architecture issues, the t2_lookup instruction will need 7 cycles plus 2 cycles of physical memory access time. Totally, 13 cycles/lookup are needed, including 1 for t1_index, 2 for t1_lookup, 1 for t2_addr, and 9 for t2_lookup. Again, if two lookups are coded into each instruction sequence, 2 cycles of memory latency can be hidden.
For the 16/Kc scheme, the following 6 customized instructions were designed:
TABLE 4 below gives the simulation results of 16/Kc scheme.
It shows 3,447,378/250,000≅13.79 cycles/lookup are needed. Excluding the 3 cycle overhead of function entry and exit, 10.79 cycles/lookup are needed. Consider the worst case where there is a cache miss and the processor is stalled. Both the T1_RIB and extended T2_RIB tables 620 could be put in on-chip memory since they are quite small while the T2_RIB table 620 could be placed in external SRAM. So, in total 16 cycles/lookup are needed, including one for instruction t1_index, 2 for t1_lookup, 1 for t2_addr, 7 for t2_lookup plus 2 cycles for physical memory access to external SRAM, 1 for t3_addr, and 2 for t3_lookup. Since the total table size from the 16/Kc data structure is less than 0.5 MB, it is feasible to put the whole table (T1_RIB, T2_RIB, and extended T2_RIB) in on-chip memory. In this case, there will be no processor stalls and 9 cycles/lookup can be obtained in the worst case. Performing route lookups for 3 packets at the same time similar to the 16/K case mentioned previously, the 3 cycles of memory load latency can be hidden. Moreover, the instructions T1_Index, T2_Addr, and T3_Addr can be embedded into T1_Lookup, t2_lookup and t3_lookup, which will save another 3 cycles. Thus, in the worst case for the 16/Kc scheme, 9−3−3=3 cycles/lookup are needed, which translates to 66 MLPS for an Xtensa processor running at 200 MHz. In the future, the ability to do multiple loads per cycle will scale the performance linearly.
Hardware synthesis for this processor has been performed with the added instructions. It needs about 65K gates for the configured Xtensa core processor (excluding the memory) and an additional 6.5K gates for the added TIE instructions.
The preferred embodiments described above have been presented for purposes of explanation only, and the present invention should not be construed to be so limited. Variations on the present invention will become readily apparent to those skilled in the art after reading this description, and the present invention and appended claims are intended to encompass such variations as well.
This application is related to and claims priority to U.S. Provisional Patent Application No. 60/264,667 filed on Jan. 25, 2001, the contents of which are incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
5951651 | Lakshman et al. | Sep 1999 | A |
6018524 | Turner et al. | Jan 2000 | A |
6141738 | Munter et al. | Oct 2000 | A |
6243720 | Munter et al. | Jun 2001 | B1 |
6341130 | Lakshman et al. | Jan 2002 | B1 |
6434144 | Romanov | Aug 2002 | B1 |
6563823 | Przygienda et al. | May 2003 | B1 |
6571313 | Filippi et al. | May 2003 | B1 |
6631419 | Greene | Oct 2003 | B1 |
6658482 | Chen et al. | Dec 2003 | B1 |
6697363 | Carr | Feb 2004 | B1 |
6717946 | Hariguchi et al. | Apr 2004 | B1 |
6782382 | Lunteren | Aug 2004 | B2 |
6798777 | Ferguson et al. | Sep 2004 | B1 |
6963924 | Huang et al. | Nov 2005 | B1 |
6970462 | McRae | Nov 2005 | B1 |
6975631 | Kastenholz | Dec 2005 | B1 |
6980552 | Belz et al. | Dec 2005 | B1 |
20020002549 | Lunteren | Jan 2002 | A1 |
20020080798 | Hariguchi et al. | Jun 2002 | A1 |
20020118682 | Choe | Aug 2002 | A1 |
20060039374 | Belz et al. | Feb 2006 | A1 |
Number | Date | Country | |
---|---|---|---|
20020172203 A1 | Nov 2002 | US |
Number | Date | Country | |
---|---|---|---|
60264667 | Jan 2001 | US |