The present application relates to the field of communications technologies, and in particular, to a data packet extraction method and apparatus.
In the field of communications technologies, data information is exchanged and transmitted between different network devices in basic units of data packets. When transmitting data information, a network device adds a packet header to the data information that needs to be transmitted, so as to encapsulate the data information into a data packet for transmission. When the data information that needs to be transmitted is being encapsulated, the added packet header carries quintuple information. The quintuple information includes a source Internet Protocol IP address, a destination IP address, a source port number, a destination port number, and a transport layer protocol number.
When a transmission status of data information in a network is being analyzed, sampling analysis is performed on a data packet transmitted in the network. Generally, a time packet in the network is sampled in basic sampling units of data streams. Quintuple information of multiple data packets that belong to a same data stream is the same, that is, source IP addresses are the same, destination IP addresses are the same, source port numbers are the same, destination port numbers are the same, and transport layer protocol numbers are the same.
A data packet collected in basic units of data streams may be used to analyze duration of a data stream in a network, a packet length of the data stream in the network, and information such as an IP address of the data stream in the network. However, if a data packet extracted based on a data stream is analyzed, a transmission status of only a part of data in a network can be obtained by means of analysis.
A technical problem to be resolved by embodiments of the present application is to provide a data packet extraction method and apparatus, so to resolve a technical problem.
A first aspect of the embodiments of the present application provides a data packet extraction method, where the method includes:
receiving a data packet;
parsing quintuple information of the data packet;
calculating a first hash value and a second hash value of the data packet according to the quintuple information by using a first hash function, where the first hash value is a hash value that is calculated by using the first hash function and by using the quintuple information arranged in a preset order as an input, and the second hash value is a hash value that is calculated by using the first hash function and by using, as an input, quintuple information obtained after in the quintuple information arranged in the preset order, a source IP address and a destination IP address are interchanged and a source port number and a destination port number are interchanged;
calculating a first remainder obtained by dividing the first hash value by a denominator of a preset session sampling ratio, and calculating a second remainder obtained by dividing the second hash value by the denominator of the preset session sampling ratio;
querying whether the first remainder or the second remainder is a preset sampling remainder, where a quantity of the preset sampling remainders is the same as a numerator value of the preset session sampling ratio; and
extracting the data packet when the first remainder or the second remainder is the preset sampling remainder.
In a first possible implementation manner of the first aspect of the embodiments of the present application, before the extracting the data packet, the method further includes:
extracting at least one preset feature field from the data packet, where the preset feature field is a character string of a preset offset length at a preset position in the data packet;
calculating a feature hash value of each preset feature field by using the second hash function and by using the preset feature field as an input;
querying whether the feature hash value of each preset feature field is the same as a preset hash value of the preset feature field; and
extracting the data packet when the feature hash value of each preset feature field is the same as the preset hash value of the preset feature field.
A second aspect of the embodiments of the present application provides a data packet extraction method, where the method includes:
receiving a data packet;
parsing quintuple information of the data packet;
determining whether another data packet belonging to a session to which the data packet belongs has been received; and
when another data packet belonging to the session to which the data packet belongs has not been received, determining that the session to which the data packet belongs is a newly received session, adding 1 to a session count value, and determining whether the session count value is equal to a preset threshold; and when the session count value is equal to the preset threshold, determining that the data packet belongs to a newly recognized to-be-sampled session, extracting the data packet, and updating a first mapping table by using the quintuple information of the data packet; or when another data packet belonging to the session to which the data packet belongs has been received, determining that the session to which the data packet belongs is a received session, and determining whether the quintuple information of the data packet matches the first mapping table; and extracting the data packet when the quintuple information of the data packet matches the first mapping table, where the first mapping table stores quintuple information of all to-be-sampled sessions that are recognized before the data packet is received or a Bloom Filter mapping element that uses, as an input, quintuple information of all to-be-sampled sessions that are recognized before the data packet is received.
In a first possible implementation manner of the second aspect of the embodiments of the present application, the determining whether another data packet belonging to a session to which the data packet belongs has been received includes:
parsing a flag field carried in the data packet;
determining whether the flag field is an SYN flag field; and
when the flag field is the SYN flag field, determining that another data packet belonging to the session to which the data packet belongs has not been received; or when the flag field is not the SYN flag field, determining that another data packet belonging to the session to which the data packet belongs has been received.
In a second possible implementation manner of the second aspect of the embodiments of the present application, the determining whether another data packet belonging to a session to which the data packet belongs has been received includes:
determining whether the quintuple information of the data packet matches a second mapping table, where the second mapping table stores quintuple information of all sessions that are received before the data packet is received or a Bloom Filter mapping element that uses, as an input, quintuple information of all sessions that are received before the data packet is received; and
when the quintuple information of the data packet does not match the second mapping table, determining that another data packet belonging to the session to which the data packet belongs has not been received, and updating the second mapping table by using the quintuple information of the data packet; or when the quintuple information of the data packet matches the second mapping table, determining that another data packet belonging to the session to which the data packet belongs has been received.
With reference to any one of the second aspect of the embodiments of the present application to the second possible implementation manner of the second aspect, in a third possible implementation manner, the first mapping table stores the Bloom Filter mapping element that uses, as an input, the quintuple information of all the to-be-sampled sessions that are recognized before the data packet is received; and
the determining whether the quintuple information of the data packet matches the first mapping table includes:
using, as a first hash value group, multiple hash values that are calculated by using a preset hash function group and by using, as an input, the quintuple information that is of the data packet and arranged in a preset order, where the preset hash function group is a hash function group used when the first mapping table is generated, and includes multiple preset hash functions;
querying whether values at positions in the first mapping table that are corresponding to all hash values in the first hash value group are 1; and
when the values at the positions in the first mapping table that are corresponding to all the hash values in the first hash value group are 1, determining that the quintuple information of the data packet matches the first mapping table; or when not all the values at the positions in the first mapping table that are corresponding to all the hash values in the first hash value group are 1, using, as a second hash value group, multiple hash values that are calculated by using the preset hash function group and by using, as an input, quintuple information obtained after in the quintuple information that is of the data packet and arranged in the preset order, a position of a source IP address and a position of a destination IP address are interchanged and a position of a source port number and a position of a destination port number are interchanged;
querying whether values at positions in the first mapping table that are corresponding to all hash values in the second hash value group are 1; and
when the values at the positions in the first mapping table that are corresponding to all the hash values in the second hash value group are 1, determining that the quintuple information of the data packet matches the first mapping table; or when not all the values at the positions in the first mapping table that are corresponding to all the hash values in the second hash value group are 1, determining that the quintuple information of the data packet does not match the first mapping table.
With reference to any one of the second aspect of the embodiments of the present application to the third possible implementation manner of the second aspect, in a fourth possible implementation manner, the first mapping table stores the Bloom Filter mapping element that uses, as an input, the quintuple information of all the to-be-sampled sessions that are recognized before the data packet is received;
and
the updating a first mapping table by using the quintuple information of the data packet includes:
using, as a third hash value group, multiple hash values that are calculated by using the preset hash function group and by using, as an input, the quintuple information that is of the data packet and arranged in the preset order; and
setting values at positions in the first mapping table that are corresponding to all hash values in the third hash value group to 1.
A third aspect of the embodiments of the present application provides a data packet extraction apparatus, where the apparatus includes a receiving unit and a processing unit connected to the receiving unit,
where
the receiving unit is configured to receive a data packet and send the data packet to the processing unit; and
the processing unit is configured to:
In a first possible implementation manner of the third aspect of the embodiments of the present application, before extracting the data packet, the processing unit is further configured to:
extract at least one preset feature field from the data packet, where the preset feature field is a character string of a preset offset length at a preset position in the data packet;
calculate a feature hash value of each preset feature field by using the second hash function and by using the preset feature field as an input;
query whether the feature hash value of each preset feature field is the same as a preset hash value of the preset feature field; and
extract the data packet when the feature hash value of each preset feature field is the same as the preset hash value of the preset feature field.
A fourth aspect of the embodiments of the present application provides a data packet extraction apparatus, where the apparatus includes a receiving unit and a processing unit connected to the receiving unit,
where
the receiving unit is configured to receive a data packet and send the data packet to the processing unit; and
the processing unit is configured to:
In a first possible implementation manner of the fourth aspect of the embodiments of the present application, that the processing unit is configured to determine whether another data packet belonging to a session to which the data packet belongs has been received includes:
parsing a flag field carried in the data packet;
determining whether the flag field is an SYN flag field; and
when the flag field is the SYN flag field, determining that another data packet belonging to the session to which the data packet belongs has not been received; or when the flag field is not the SYN flag field, determining that another data packet belonging to the session to which the data packet belongs has been received.
In a second possible implementation manner of the fourth aspect of the embodiments of the present application, that the processing unit is configured to determine whether another data packet belonging to a session to which the data packet belongs has been received includes:
determining whether the quintuple information of the data packet matches a second mapping table, where the second mapping table stores quintuple information of all sessions that are received before the data packet is received or a Bloom Filter mapping element that uses, as an input, quintuple information of all sessions that are received before the data packet is received; and
when the quintuple information of the data packet does not match the second mapping table, determining that another data packet belonging to the session to which the data packet belongs has not been received data packet, and updating the second mapping table by using the quintuple information of the data packet; or when the quintuple information of the data packet matches the second mapping table, determining that another data packet belonging to the session to which the data packet belongs has been received.
With reference to any one of the fourth aspect of the embodiments of the present application to the second possible implementation manner of the fourth aspect, in a third possible implementation manner, the first mapping table stores the Bloom Filter mapping element that uses, as an input, the quintuple information of all the to-be-sampled sessions that are recognized before the data packet is received; and
that the processing unit is configured to determine whether the quintuple information of the data packet matches the first mapping table includes:
using, as a first hash value group, multiple hash values that are calculated by using a preset hash function group and by using, as an input, the quintuple information that is of the data packet and arranged in a preset order, where the preset hash function group is a hash function group used when the first mapping table is generated, and includes multiple preset hash functions;
querying whether values at positions in the first mapping table that are corresponding to all hash values in the first hash value group are 1; and
when the values at the positions in the first mapping table that are corresponding to all the hash values in the first hash value group are 1, determining that the quintuple information of the data packet matches the first mapping table; or when not all the values at the positions in the first mapping table that are corresponding to all the hash values in the first hash value group are 1, using, as a second hash value group, multiple hash values that are calculated by using the preset hash function group and by using, as an input, quintuple information obtained after in the quintuple information that is of the data packet and arranged in the preset order, a position of a source IP address and a position of a destination IP address are interchanged and a position of a source port number and a position of a destination port number are interchanged;
querying whether values at positions in the first mapping table that are corresponding to all hash values in the second hash value group are 1; and
when the values at the positions in the first mapping table that are corresponding to all the hash values in the second hash value group are 1, determining that the quintuple information of the data packet matches the first mapping table; or when not all the values at the positions in the first mapping table that are corresponding to all the hash values in the second hash value group are 1, determining that the quintuple information of the data packet does not match the first mapping table.
With reference to any one of the fourth aspect of the embodiments of the present application to the third possible implementation manner of the fourth aspect, in a fourth possible implementation manner, the first mapping table stores the Bloom Filter mapping element that uses, as an input, the quintuple information of all the to-be-sampled sessions that are recognized before the data packet is received; and
that the processing unit is configured to update a first mapping table by using the quintuple information of the data packet includes:
using, as a third hash value group, multiple hash values that are calculated by using the preset hash function group and by using, as an input, the quintuple information that is of the data packet and arranged in the preset order; and
setting values at positions in the first mapping table that are corresponding to all hash values in the third hash value group to 1.
It can be learned from the foregoing technical solutions that the embodiments of the present application have the following beneficial effects:
According to the data packet extraction method and apparatus provided in the embodiments of the present application, a session may be established between a first network device and a second network device, so that multiple data packets are transmitted between the first network device and the second network device. Quintuple information of multiple data packets of a same session has the following characteristics: Source IP addresses in the multiple data packets of the same session are an IP address of the first network device or an IP address of the second network device, destination IP addresses in the multiple data packets of the same session are the IP address of the first network device or the IP address of the second network device, source port numbers in the multiple data packets of the same session are a port number of the first network device or a port number of the second network device, destination port numbers in the multiple data packets of the same session are the port number of the first network device or the port number of the second network device, and transport layer protocol numbers used for the multiple data packets of the same session are the same.
Therefore, two hash values calculated based on quintuple information of different data packets of a same session are the same, that is, two calculated remainders are also the same at a same sampling ratio. When one remainder of the two calculated remainders is a preset sampling remainder, all the data packets in a network that belong to the session are extracted, so as to implement data packet extraction based on a session.
A first mapping table stores quintuple information of all to-be-sampled sessions that are recognized before the data packet is received or a Bloom Filter mapping element that uses, as an input, quintuple information of all to-be-sampled sessions that are recognized before the data packet is received. Therefore, when the quintuple information of the different data packets of the same session matches the first mapping table, either all the data packets of the same session can match the first mapping table, or none of the data packets of the same session can match the first mapping table, so as to implement data packet extraction based on a session.
Embodiments of the present application provide a data packet extraction method and apparatus. To make the purpose, technical solutions, and advantages of the embodiments of the present application clearer, the following clearly describes the technical solutions of the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application.
Data information is transmitted in a network in basic units of data packets. A network device that sends a data packet is a source device, and a device that receives the data packet is a destination device. A packet header of each data packet carries quintuple information. The quintuple information includes a source IP address and a source port number of a source device, a destination IP address and a destination port number of a destination device, and a transport layer protocol number used for transmitting the data packet between the source device and the destination device.
A session refers to communication interaction between two network devices within a particular continuous operation time. During a session, all data packets that are mutually transmitted between two network devices belong to the session. In quintuple information carried in a data packet sent by a first network device to a second network device, a source IP address is an IP address of the first network device, a source port number is a port number of the first network device, a destination address is an address of the second network device, and a destination port number is a port number of the second network device. In quintuple information carried in a data packet sent by the second network device to the first network device, a source IP address is an IP address of the second network device, a source port number is the port number of the second network device, a destination address is an address of the first network device, and a destination port number is the port number of the first network device. Transport layer protocol numbers used for mutually sending the data packets between the two network devices are the same.
Quintuple information of multiple data packets of a same session has the following characteristics: Source IP addresses in the multiple data packets of the same session are the IP address of the first network device or the IP address of the second network device, destination IP addresses in the multiple data packets of the same session are the IP address of the first network device or the IP address of the second network device, source port numbers in the multiple data packets of the same session are the port number of the first network device or the port number of the second network device, destination port numbers in the multiple data packets of the same session are the port number of the first network device or the port number of the second network device, and transport layer protocol numbers used for the multiple data packets of the same session are the same.
That is, the quintuple information of the data packet sent from the first network device to the second network device is (the IP address of the first network device, the port number of the first network device, the IP address of the second network device, the port number of the second network device, and the transport layer protocol number), that is, the source IP address in the data packet sent from the first network device to the second network device is the IP address of the first network device, the source port number in the data packet sent from the first network device to the second network device is the port number of the first network device, the destination IP address in the data packet sent from the first network device to the second network device is the IP address of the second network device, the destination port number in the data packet sent from the first network device to the second network device is the port number of the second network device, and the transport layer protocol number in the data packet sent from the first network device to the second network device is a number of the transport layer protocol used for transmitting these data packets between the first network device and the second network device. The quintuple information of the data packet sent from the second network device to the first network device is (the IP address of the second network device, the port number of the second network device, the IP address of the first network device, the port number of the first network device, and the transport layer protocol number), that is, the source IP address in the data packet sent from the second network device to the first network device is the IP address of the second network device, the source port number in the data packet sent from the second network device to the first network device is the port number of the second network device, the destination IP address in the data packet sent from the second network device to the first network device is the IP address of the first network device, the destination port number in the data packet sent from the second network device to the first network device is the port number of the first network device, and the transport layer protocol number in the data packet sent from the second network device to the first network device is a number of the transport layer protocol used for transmitting these data packets between the first network device and the second network device. The transport layer protocol number carried in the data packet sent from the first network device to the second network device is the same as that carried in the data packet sent from the second network device to the first network device.
Step 101: Receive a data packet.
Step 102: Parse quintuple information of the data packet.
Step 103: Calculate a first hash value of the data packet and a second hash value of the data packet according to the quintuple information by using a first hash function, where the first hash value is a hash value that is calculated by using the first hash function and by using the quintuple information arranged in a preset order as an input, and the second hash value is a hash value that is calculated by using the first hash function and by using, as an input, quintuple information obtained after in the quintuple information arranged in the preset order, a source IP address and a destination IP address are interchanged and a source port number and a destination port number are interchanged.
A network processor (NP) in a network system successively receives a large quantity of data packets transmitted in a network. Each time the network processor receives a data packet, the NP duplicates the data packet, parses quintuple information of the duplicated data packet, and forwards the original data packet according to a transmission path. Persons skilled in the art may understand that, according to the data packet extraction method provided in the present application, the duplicated data packet rather than the original data packet transmitted in the network is extracted. If the original data packet transmitted in the network is extracted, a destination device cannot receive the original data packet, which causes a service error or a service interruption.
A hash function is a function for compressing, by using a hash algorithm, an arbitrary-length input into a fixed-length hash value for output. The hash function is compression mapping, that is, space of a hash value is generally much less than space of an input. In specific implementation, the first hash function in this embodiment of the present application may be a cyclic redundancy check 16 (CRC 16) hash function. Certainly, the first hash function may be a hash function of another type, which is specifically set according to an actual requirement and is not limited herein.
After the quintuple information of the data packet is parsed, the first hash value and the second hash value of the data packet are calculated by using the first hash function. The first hash value is the hash value that is calculated by using the first hash function and by using the quintuple information arranged in the preset order as the input. The second hash value is the hash value that is calculated by using the first hash function and by using, as the input, the quintuple information obtained after in the quintuple information arranged in the preset order, the source IP address and the destination IP address are interchanged and the source port number and the destination port number are interchanged.
For example, a hash value that is calculated by using the first hash function and by using, as an input of the first hash function, a character string obtained after the quintuple information of the data packet is arranged in an order listed in Table 1 is used as the first hash value. Then a hash value that is calculated by using the first hash function and by using a character string arranged in an order listed in Table 2 as another input of the first hash function is used as the second hash value, where the character string arranged in the order listed in Table 2 is obtained after in the character string arranged in the order listed in Table 1, a source IP address and a destination IP address are interchanged and a source port number and a destination port number are interchanged.
It should be noted that when the first hash value and the second hash value are calculated, and when the quintuple information of the data packet is arranged in the preset order and then used as an input, the arrangement order is not limited to the arrangement orders listed in Table 1 and Table 2, provided that it is ensured that a new character string is used as an input for calculating the second hash value, where the new character string is obtained after in a character string that is input for calculating the first hash value, a position of the source IP address and a position of the destination IP address are interchanged, a position of the source port number and a position of the destination port number are interchanged.
For different regions, distribution of the quintuple information of the data packet is quite uneven. To further optimize evenness of extracted data packets, several-bit data may be separately selected from the quintuple information of the data packet, and the several-bit data is arranged in a preset order and then used as an input of the hash function. For example, for different regions, low 8-bit data in the source IP address is evenly distributed, and low 14-bit data in the source port number is evenly distributed. Low 8 bits of the source IP address and those of the destination IP address, low 14 bits of the source port number and those of the destination port number, and all bits of the transport layer protocol number may be selected and arranged in a preset order, to obtain a character string as an input of the first hash function. Certainly, a position and a bit quantity of a character string selected for each of the source IP address, the destination IP address, the source port number, the destination port number, and the transport layer protocol number may be separately set according to an actual requirement. However, it is required to ensure that a bit quantity and a position selected for the source IP address are the same as those selected for the destination IP address, and a bit quantity and a position selected for the source port number are the same as those selected for the destination port number.
Step 104: Calculate a first remainder obtained by dividing the first hash value by a denominator of a preset session sampling ratio, and calculate a second remainder obtained by dividing the second hash value by the denominator of the preset session sampling ratio.
Step 105: Query whether the first remainder or the second remainder is a preset sampling remainder, where a quantity of the preset sampling remainders is the same as a numerator value of the preset session sampling ratio.
Step 106: Extract the data packet when the first remainder or the second remainder is the preset sampling remainder.
In this embodiment of the present application, the data packet is extracted in basic units of sessions. The preset session sampling ratio refers to a proportion of extracted data packets of sessions to data packets that are of a large quantity of sessions and that are transmitted in a network. The first remainder is obtained by means of calculation by dividing the first hash value by the denominator of the preset session sampling ratio, and the second remainder is obtained by means of calculation by dividing the second hash value by the denominator of the preset session sampling ratio. The first remainder and the second remainder are integers that are greater than or equal to 0 and less than or equal to an integer obtained by subtracting 1 from the denominator of the preset session sampling ratio.
For example, it is assumed that the preset session sampling ratio is M/N. When data packets transmitted in the network are data packets of t×N sessions, all data packets that are of t×M sessions and that are transmitted in the network are extracted, where t is an integer greater than 0. A value of the first remainder and the second remainder ranges from an integer greater than or equal to 0 to an integer less than or equal to N−1. M integers are selected as preset sampling remainders from integers greater than or equal to 0 and less than or equal to N−1.
Whether the first remainder or the second remainder belongs to the preset sampling remainders is queried. The data packet is extracted when the first remainder or the second remainder belongs to the preset sampling remainders. The data packet is not extracted when neither the first remainder nor the second remainder belongs to the preset sampling remainders. Step 101 is returned to receive a next data packet, and step 102 to step 105 are repeatedly performed.
Each data packet transmitted in the network is received, the foregoing operations are performed on each data packet, and the data packet is extracted, in basic units of sessions, from a large quantity of data packets transmitted in the network, so as to implement data packet sampling based on a session.
It may be understood that when a first sampling function is selected, and the preset session sampling ratio is determined, each integer in the preset sampling remainders represents quintuple information of all data packets in a type of to-be-sampled session. It is assumed that any integer in the preset sampling remainders is X. A first hash value and a second hash value are calculated by using the first hash function and based on quintuple information of any data packet in a type of to-be-sampled session represented by X, a remainder obtained by dividing the first hash value by the denominator of the preset session sampling ratio is used as a first remainder, and a remainder obtained by dividing the second hash value by the denominator of the preset session sampling ratio is used as a second remainder. One of the first remainder and the second remainder is X.
Multiple data packets sent from a first network device to a second network device and multiple data packets sent from the second network device to the first network device are put into one group. Multiple data packets in each group belong to a same session. Each session refers to communication between the two network devices. Therefore, for different data packets in a same session, two hash values calculated based on quintuple information by using the first hash function are the same, and two remainders obtained by dividing the two hash values by the denominator of the preset session sampling ratio are also the same. If a data packet that belongs to a session is extracted, it indicates that at least one remainder of two remainders that are calculated based on the data packet belongs to the preset sampling remainders. Two remainders that are calculated based on quintuple information of another data packet in the session are the same as the two remainders that are calculated based on quintuple information of the extracted data packet, that is, at least one remainder of the two remainders that are calculated based on the quintuple information of the another data packet in the session also belongs to the preset sampling remainders. In this case, it is ensured that the another data packet in the received session is also extracted, so as to implement data packet extraction in basic units of sessions.
For example, if a session C between a network device A and a network device B is established, in a data packet that is in the session C and sent from the network device A to the network device B, a source IP address is an IP address of the network device A, a destination IP address is an IP address of the network device B, a source port number is a port number of the network device A, and a destination port number is a port number of the network device B. In a data packet that is in the session C and sent from the network device B to the network device A, a source IP address is the IP address of the network device B, a destination IP address is the IP address of the network device A, a source port number is the port number of the network device B, and a destination port number is the port number of the network device A.
As listed in Table 3, quintuple information of the data packet that is in the session C and that is sent from the network device A to the network device B is arranged in a preset order, and a first hash value that is calculated by using the first hash function and by using, as an input, a character string shown in Table 3 is D. As listed in Table 4, in the quintuple information arranged in the preset order, the source IP address and the destination IP address are interchanged and the source port number and the destination port number are interchanged, and a second hash value that is calculated by using the first hash function and by using a character string constituted in Table 4 as an input is E.
As listed in Table 5, quintuple information of the data packet that is in the session C and that is sent from the network device B to the network device A is arranged in a preset order. A character string constituted in Table 5 is used as an input, and the character string listed in Table 5 is the same as the character string listed in Table 4; therefore, a first hash value calculated by using the first hash function is E. As listed in Table 6, in the quintuple information arranged in the preset order, the source IP address and the destination IP address are interchanged and the source port number and the destination port number are interchanged. A character string constituted in Table 6 is used as an input, and the character string listed in Table 6 is the same as the character string listed in Table 3; therefore, a second hash value calculated by using the first hash function is D.
In this case, the hash values that are calculated based on the quintuple information of all the data packets in the session C are D and E. Two remainders that are respectively calculated by dividing the two hash values D and E by the denominator of the preset session sampling ratio are F and G. When either of F and G belongs to the preset sampling remainders, all the data packets that belong to the session C are extracted.
In another embodiment, before the extracting the data packet, the data packet extraction method described in this embodiment of the present application further includes:
extracting at least one preset feature field from the data packet, where the preset feature field is a character string of a preset offset length at a preset position in the data packet;
calculating a feature hash value of each preset feature field by using the second hash function and by using the preset feature field as an input;
querying whether the feature hash value of each preset feature field is the same as a preset hash value of the preset feature field; and
extracting the data packet when the feature hash value of each preset feature field is the same as the preset hash value of the preset feature field.
The preset feature field is a character string that is of the preset offset length and that is extracted at the preset position in the data packet. The used second hash function is set, and the preset hash value that is of each preset feature field and calculated by using the second hash function is set. A position and an offset length of each preset feature field may be specifically set according to an actual requirement. After the data packet is received, a preset feature field is extracted. A hash value of each extracted preset feature field is calculated by using the second hash function and by using the preset feature field as an input. The data packet is extracted when the hash value of each preset feature field is equal to the preset hash value of the preset feature field.
For example, as shown in
In actual application, the preset feature field may be specifically set according to an actual case. For example, the preset feature field may be set according to a sample of a data packet received when a session attack occurs, so as to effectively recognize the session attack. Optionally, a source IP address and a destination IP address may be selected as preset feature fields to extract a data packet of a session between two particular network devices.
The data packet extraction method provided in this embodiment of the present application may further be implemented in another manner: receiving a data packet; parsing quintuple information of the data packet; calculating a fourth hash value of the data packet by using a first hash function and by using, as an input, quintuple information that is of the data packet and arranged in descending order; calculating a third remainder obtained by dividing the fourth hash value of the data packet by a denominator of a preset session sampling ratio; querying whether the third remainder is a preset sampling remainder; and extracting the data packet when the third remainder is the preset sampling remainder.
When the foregoing implementation manner is used, each time a data packet is received, a hash value needs to be calculated only once by using, as an input, quintuple information that is of the data packet and arranged in descending order. Input character strings that are obtained by arranging quintuple information of different data packets in a same session in descending order are the same, fourth hash values calculated by using the first hash function are the same, and third remainders obtained by dividing the fourth hash values by the denominator of the preset session sampling ratio also are the same. Therefore, all the data packets in the same session can be extracted. Certainly, in specific implementation, the quintuple information may be arranged in ascending order, and an implementation manner is similar.
It can be learned from the foregoing content that the present application further has the following beneficial effects:
At least one preset feature field is extracted from the data packet, and a data packet in which a hash value of each preset feature field is the same as a preset hash value of the preset feature field is extracted, so as to intentionally extract a data packet in a session of interest, pertinently recognize a session attack in a network, analyze a particular session in a network, or the like.
Step 301: Receive a data packet.
Step 302: Parse quintuple information of the data packet.
A network processor (NP) in a network system successively receives a large quantity of data packets transmitted in a network. Each time the network processor receives a data packet, the NP duplicates the data packet, parses quintuple information of the duplicated data packet, and forwards the original data packet according to a transmission path. Persons skilled in the art may understand that, according to the data packet extraction method provided in the present application, the duplicated data packet rather than the original data packet transmitted in the network is extracted. If the original data packet transmitted in the network is extracted, a destination device cannot receive the original data packet, which causes a service error or a service interruption.
Data information is transmitted in a network in basic units of data packets. A network device that sends a data packet is a source device, and a device that receives the data packet is a destination device. A packet header of each data packet carries quintuple information. The quintuple information includes a source IP address and a source port number of a source device, a destination IP address and a destination port number of a destination device, and a transport layer protocol number used for transmitting the data packet between the source device and the destination device.
Step 303: Determine whether another data packet belonging to a session to which the data packet belongs has been received; if another data packet belonging to the session to which the data packet belongs has not been received, perform step 304; or if another data packet belonging to the session to which the data packet belongs has been received, perform step 306.
In this embodiment of the present application, when another data packet belonging to the session to which the data packet belongs has been received, the session to which the data packet belongs is a received session. When another data packet belonging to the session to which the data packet belongs has not been received, the data packet is the first received data packet in the session, and the session to which the data packet belongs is a newly received session.
It should be noted herein that a newly received session is a relative concept. For a currently received data packet, when another data packet belonging to a session to which the data packet belongs has not been received, the session to which the data packet belongs is a newly received session. For a next received data packet, because a received data packet exists in the newly received session, the newly received session is a received session relative to the next received data packet.
Step 303 has at least two possible implementation manners:
In a first possible implementation manner, the determining whether another data packet belonging to a session to which the data packet belongs has been received, includes:
parsing a flag field carried in the data packet;
determining whether the flag field is an SYN flag field; and
when the flag field is the SYN flag field, determining that another data packet belonging to the session to which the data packet belongs has not been received; or when the flag field is not the SYN flag field, determining that another data packet belonging to the session to which the data packet belongs has been received.
A data packet carrying an SYN flag field is a handshake data packet sent when two network devices establish a TCP session, that is, the first data packet sent when the TCP session is established. When the data packet carries the SYN flag field, another data packet belonging to the session to which the data packet belongs has not been received, and the session is a newly received session. When a flag field carried in the data packet is not an SYN flag field, at least one data packet belonging to the session to which the data packet belongs has been received and the at least one received data packet carries the SYN flag field, and the session is a received session.
In a second possible implementation manner, the determining whether another data packet belonging to a session to which the data packet belongs has been received, includes:
determining whether the quintuple information of the data packet matches a second mapping table; and
when the quintuple information of the data packet does not match the second mapping table, determining that another data packet belonging to the session to which the data packet belongs has not been received, and updating the second mapping table by using the quintuple information of the data packet; or when the quintuple information of the data packet matches the second mapping table, determining that another data packet belonging to the session to which the data packet belongs has been received.
The second mapping table stores quintuple information of all sessions that are received before the data packet is received or a Bloom Filter mapping element that uses, as an input, quintuple information of all sessions that are received before the data packet is received.
It may be understood that the second mapping table is obtained by means of update with continuous receiving of data packets. When the first data packet is received, there is no received session, and no information is stored in the second mapping table. As received data packets increase, that is, received sessions increase, the second mapping table stores increasing pieces of quintuple information of received sessions or Bloom Filter mapping elements.
When the second mapping table stores the quintuple information of all the sessions that are received before the data packet is received, the second mapping table stores quintuple information of the first received data packet of each received session. The second mapping table is traversed to query whether the quintuple information of the data packet is the same as a piece of quintuple information stored in the second mapping table. If the quintuple information of the data packet is the same as apiece of quintuple information stored in the second mapping table, the quintuple information of the data packet matches the second mapping table. If the quintuple information of the data packet is not the same as a piece of quintuple information stored in the second mapping table, in the quintuple information of the data packet, a source IP address and a destination IP address are interchanged and a source port number and a destination port number are interchanged, to obtain quintuple information of a data packet that belongs to the same session as the data packet. Whether the quintuple information of the data packet is the same as a piece of quintuple information stored in the second mapping table is queried. If the quintuple information of the data packet is the same as a piece of quintuple information stored in the second mapping table, the quintuple information of the data packet matches the second mapping table. If the quintuple information of the data packet is not the same as a piece of quintuple information stored in the second mapping table, the quintuple information of the data packet does not match the second mapping table, and the session to which the data packet belongs is a newly received session.
The second mapping table stores only quintuple information of the first received data packets of all the received sessions. When another data packet of the received session is further received, a source IP address in the data packet is the same as a source IP address in the first received data packet of the received session, a destination IP address in the data packet is the same as a destination IP address in the first received data packet of the received session, a source port number in the data packet is the same as a source port number in the first received data packet of the received session, and a destination port number in the data packet is the same as a destination port number in the first received data packet of the received session; or a source IP address in the data packet is the same as a destination IP address in the first received data packet of the received session, a destination IP address in the data packet is the same as a source IP address in the first received data packet of the received session, a source port number in the data packet is the same as a destination port number in the first received data packet of the received session, and a destination port number in the data packet is the same as a source port number in the first received data packet of the received session.
Therefore, when whether the quintuple information of the data packet matches the second mapping table is being determined, if either piece of quintuple information of the quintuple information of the data packet or the quintuple information obtained after in the quintuple information of the data packet, the source IP address and the destination IP address are interchanged and the source port number and the destination port number are interchanged is the same as a piece of quintuple information stored in the second mapping table, the quintuple information of the data packet matches the second mapping table, and the data packet belongs to a received session; if neither of the two pieces of quintuple information is the same as quintuple information stored in the second mapping table, the quintuple information of the data packet does not match the second mapping table, and the data packet belongs to a newly received session.
When the quintuple information of the data packet matches the second mapping table, another data packet belonging to the session to which the data packet belongs has been received, and the data packet belongs to a received session. When the quintuple information of the data packet does not match the second mapping table, another data packet belonging to the session to which the data packet belongs has not been received, the data packet belongs to a newly received session, and the quintuple information of the data packet is stored in the second mapping table to update the second mapping table.
When the second mapping table stores the Bloom Filter mapping element that uses, as an input, the quintuple information of all the sessions that are received before the data packet is received, the second mapping table is a Bloom Filter table. Multiple hash values are calculated by using multiple preset hash functions and by using the quintuple information of the first received data packet of each received session as an input, and values at positions in the Bloom Filter table that are corresponding to all the hash values are set to 1, to obtain the second mapping table.
A Bloom Filter table is a space-efficient probabilistic data structure, and concisely indicates a set by using a bit array. In an initial state, a Bloom Filter is a bit array including m bits. As shown in
To express a set of n elements S={x1, x2, . . . , xn}, the Bloom Filter uses k mutually independent hash functions to respectively map each element in the set to the m-bit bit array {1, . . . , m} in the Bloom Filter table. For any element x therein, a bit at the position at which a hash value hj (x) that is calculated by using x as an input and by using the jth hash function is mapped to the Bloom Filter table is set to 1 (1≦j≦k). It should be noted herein that if a value at a position in the Bloom Filter table is set to 1 for many times, only the first setting is effective, and subsequent several settings have no effect.
For example, if the Bloom Filter uses three mutually independent hash functions, that is, k=3, when the elements x1 and x2 in S are mapped to the Bloom Filter table, values at positions at which h1(x1), h2(x1), and h3(x1) are mapped to the Bloom Filter table are set to 1, and values at positions at which h1(x2), h2(x2), and h3 (x2) are mapped to the Bloom. Filter table are set to 1, as shown in
It should be noted herein that a quantity and type of hash functions used by the Bloom Filter may be set according to an actual requirement, which is not specifically limited herein.
When whether the quintuple information of the data packet matches the second mapping table is being determined, k hash values are respectively calculated by using k mutually independent hash functions and by using, as an input, the quintuple information that is of the data packet and arranged in a preset order, and whether values at positions in the second mapping table that are corresponding to the k hash values are set to 1 is queried. If the positions in the second mapping table that are corresponding to the k hash values are set to 1, the quintuple information of the data packet matches the second mapping table. If not all the positions in the second mapping table that are corresponding to the k hash values are set to 1, k hash values are respectively calculated by using the k mutually independent hash functions and by using, as an input, quintuple information obtained after in the quintuple information that is of the data packet and arranged in the preset order, a source IP address and a destination IP address are interchanged and a source port number and a destination port number are interchanged, and whether values at positions in the second mapping table that are corresponding to the k hash values are set to 1 is queried. If the positions in the second mapping table that are corresponding to the k hash values are set to 1, the quintuple information of the data packet matches the second mapping table. If not all the positions in the second mapping table that are corresponding to the k hash values are set to 1, the quintuple information of the data packet does not match the second mapping table.
The second mapping table stores only a Bloom Filter element that uses quintuple information of the first received data packets of all the received sessions as an input. When another data packet of the received session is further received, a source IP address in the data packet is the same as a source IP address in the first received data packet of the received session, a destination IP address in the data packet is the same as a destination IP address in the first received data packet of the received session, a source port number in the data packet is the same as a source port number in the first received data packet of the received session, and a destination port number in the data packet is the same as a destination port number in the first received data packet of the received session; or a source IP address in the data packet is the same as a destination IP address in the first received data packet of the received session, a destination IP address in the data packet is the same as a source IP address in the first received data packet of the received session, a source port number in the data packet is the same as a destination port number in the first received data packet of the received session, and a destination port number in the data packet is the same as a source port number in the first received data packet of the received session.
When at least one value at the positions at which the k hash values are mapped to the second mapping table is 0, and at least one value at the positions at which other k hash values are mapped to the second mapping table is 0, the quintuple information of the data packet does not match the second mapping table, and the data packet belongs to a newly received session, where the k hash values are calculated by using, as the input, the quintuple information that is of the data packet and arranged in the preset order, and the other k hash values are calculated by using, as the input, the quintuple information obtained after in the quintuple information that is of the data packet and arranged in the preset order, the source IP address and the destination IP address are interchanged and the source port number and the destination port number are interchanged. The k hash values that are calculated by using, as the input, the quintuple information that is of the data packet and arranged in the preset order are mapped to the second mapping table, that is, the values at the positions in the second mapping table that are corresponding to the k hash values are set to 1, to update the second mapping table.
In another embodiment, the second mapping table stores the Bloom Filter mapping element that uses, as an input, the quintuple information of each session that is received before the data packet is received, multiple hash values are calculated by using multiple preset hash functions and by using, as an input, quintuple information that is of each received session and arranged in descending order, and values at positions in the Bloom Filter table that are corresponding to all the hash values are set to 1, to obtain the second mapping table.
When whether the quintuple information of the data packet matches the second mapping table is being determined, k hash values are respectively calculated by using k mutually independent hash functions and by using, as an input, quintuple information that is of the data packet and arranged in descending order, and whether values at positions in the second mapping table that are corresponding to the k hash values are set to 1 is queried. If the positions in the second mapping table that are corresponding to the k hash values are set to 1, the quintuple information of the data packet matches the second mapping table. If not all the positions in the second mapping table that are corresponding to the k hash values are set to 1, the quintuple information of the data packet does not match the second mapping table. In this embodiment, k hash values are calculated by using, as an input, the quintuple information that is of the received session and arranged in descending order and are mapped to the Bloom Filter table, to generate the second mapping table. Because character strings that are obtained by arranging quintuple information of different data packets in a same session in descending order are the same, when whether the data packet matches the second mapping table is being determined, k hash values need to be calculated only once by using k hash functions and by using, as an input, the quintuple information that is of the data packet and arranged in descending order.
When the data packet does not match the second mapping table, another data packet belonging to the session to which the data packet belongs has not been received, the session is a newly received session, k hash values are calculated by using the k hash functions and by using, as an input, the quintuple information that is of the data packet and arranged in descending order, and positions in the second mapping table that are corresponding to the k hash values are set to 1, to update the second mapping table.
It should be noted herein that when the quintuple information of the data packet is being sorted, the quintuple information may alternatively be arranged in ascending order.
Step 304: Determine that the session to which the data packet belongs is a newly received session, add 1 to a session count value, and determine whether the session count value is equal to a preset threshold. If the session count value is equal to the preset threshold, perform step 305; or if the session count value is not equal to the preset threshold, return to step 301.
Whether another data packet belonging to a session to which the data packet belongs has been received is determined according to step 303. When another data packet belonging to the session to which the data packet belongs has not been received, the session to which the data packet belongs is a newly received session. In this case, 1 is added to the session count value, which indicates that the received session is increased by 1.
The preset threshold is to control a proportion of extracted sessions, and may be set according to an actual case. When the session count value is equal to the preset threshold, the session to which the data packet belongs is a to-be-sampled session. For example, when the preset threshold is set to 100, one session is extracted from each of 100 sessions. Each time the session count value is equal to the preset threshold, the session count value is reset to 0 and recounted. When the session count value is not equal to the preset threshold, the session to which the data packet belongs is not a to-be-sampled session, and step 101 is returned to extract a next data packet.
Step 305: Determine that the data packet belongs to a newly recognized to-be-sampled session, extract the data packet, and update a first mapping table by using the quintuple information of the data packet, where the first mapping table stores quintuple information of all to-be-sampled sessions that are recognized before the data packet is received or a Bloom Filter mapping element that uses, as an input, quintuple information of all to-be-sampled sessions that are recognized before the data packet is received.
When another data packet belonging to the session to which the data packet belongs has not been received, and the session count value is equal to the preset threshold, the data packet belongs to a newly recognized to-be-sampled session. The data packet is extracted, and the first mapping table is updated by using the quintuple information of the data packet.
When the first mapping table stores quintuple information of a recognized to-be-sampled session, the quintuple information of the data packet is stored in the first mapping table to update the first mapping table.
When the first mapping table stores the Bloom Filter mapping element that uses the quintuple information of all the recognized to-be-sampled sessions as an input, the updating a first mapping table by using the quintuple information of the data packet includes:
using, as a third hash value group, multiple hash values that are calculated by using a preset hash function group and by using, as an input, the quintuple information that is of the data packet and arranged in the preset order; and
setting values at positions in the first mapping table that are corresponding to all hash values in the third hash value group to 1.
It should be noted herein that the updating a first mapping table by using the quintuple information of the data packet is similar to the updating the second mapping table by using the quintuple information of the data packet described in step 303. The hash function group includes k hash functions, k hash values are calculated by using the k hash functions and by using, as an input, the quintuple information that is of the data packet and arranged in a preset order, and values at positions in the first mapping table that are corresponding to the k hash values are set to 1. For details, refer to step 303, which are not described herein again.
In another embodiment, the first mapping table stores a Bloom Filter mapping element that uses, as an input, quintuple information of each to-be-sampled session that is recognized before the data packet is received, multiple hash values are calculated by using multiple preset hash functions and by using, as an input, quintuple information that is of each recognized to-be-sampled session and arranged in descending order, and values at positions in the Bloom Filter table that are corresponding to all the hash values are set to 1, to obtain the first mapping table.
When whether the quintuple information of the data packet matches the first mapping table is being determined, k hash values are respectively calculated by using k mutually independent hash functions and by using, as an input, the quintuple information that is of the data packet and arranged in descending order, and whether values at positions in the first mapping table that are corresponding to the k hash values are set to 1 is queried. If the positions in the first mapping table that are corresponding to the k hash values are set to 1, the quintuple information of the data packet matches the first mapping table. If not all the positions in the first mapping table that are corresponding to the k hash values are set to 1, the quintuple information of the data packet does not match the first mapping table.
In this embodiment, k hash values are calculated by using, as an input, the quintuple information that is of the received session and arranged in descending order and are mapped to the Bloom Filter table, to generate the first mapping table. Because character strings that are obtained by arranging quintuple information of different data packets in a same session in descending order are the same, when whether the data packet matches the first mapping table is being determined, k hash values need to be calculated only once by using k hash functions and by using, as an input, the quintuple information that is of the data packet and arranged in descending order.
It should be noted herein that when the quintuple information of the data packet is being sorted, the quintuple information may alternatively be arranged in ascending order.
Step 306: Determine that the session to which the data packet belongs is a received session, and determine whether the quintuple information of the data packet matches the first mapping table. If the quintuple information of the data packet matches the first mapping table, perform step 307; or if the quintuple information of the data packet does not match the first mapping table, return to step 301.
The determining whether the quintuple information of the data packet matches the first mapping table is similar to the determining whether the quintuple information of the data packet matches the second mapping table in step 303.
When the first mapping table stores quintuple information of a recognized to-be-sampled session, whether the quintuple information of the data packet matches the first mapping table is determined, and whether the quintuple information of the data packet is the same as a piece of quintuple information stored in the first mapping table is queried. If the quintuple information of the data packet is the same as a piece of quintuple information stored in the first mapping table, the quintuple information of the data packet matches the first mapping table. If the quintuple information of the data packet is not the same as a piece of quintuple information stored in the first mapping table, in the quintuple information of the data packet, the source IP address and the destination IP address are interchanged and the source port number and the destination port number in the data packet are interchanged, to obtain another piece of quintuple information, and whether the another piece of quintuple information is the same as a piece of quintuple information stored in the first mapping table is queried. If the another piece of quintuple information is the same as a piece of quintuple information stored in the first mapping table, the quintuple information of the data packet matches the first mapping table. If the another piece of quintuple information is not the same as a piece of quintuple information stored in the first mapping table, the data packet does not match the first mapping table.
When the first mapping table stores the Bloom Filter mapping element that uses the quintuple information of all the recognized to-be-sampled sessions as an input, the determining whether the quintuple information of the data packet matches the first mapping table includes:
using, as a first hash value group, multiple hash values that are calculated by using a preset hash function group and by using, as an input, the quintuple information that is of the data packet and arranged in a preset order, where the preset hash function group is a hash function group used when the first mapping table is generated, and includes multiple preset hash functions;
querying whether values at positions in the first mapping table that are corresponding to all hash values in the first hash value group are 1; and
when the values at the positions in the first mapping table that are corresponding to all the hash values in the first hash value group are 1, determining that the quintuple information of the data packet matches the first mapping table; or when not all the values at the positions in the first mapping table that are corresponding to all the hash values in the first hash value group are 1, using, as a second hash value group, multiple hash values that are calculated by using the preset hash function group and by using, as an input, quintuple information obtained after in the quintuple information that is of the data packet and arranged in the preset order, a position of a source IP address and a position of a destination IP address are interchanged and a position of a source port number and a position of a destination port number are interchanged;
querying whether values at positions in the first mapping table that are corresponding to all hash values in the second hash value group are 1; and
when the values at the positions in the first mapping table that are corresponding to all the hash values in the second hash value group are 1, determining that the quintuple information of the data packet matches the first mapping table; or when not all the values at the positions in the first mapping table that are corresponding to all the hash values in the second hash value group are 1, determining that the quintuple information of the data packet does not match the first mapping table.
K hash functions included in the hash function group used in step 306 are the same as the k hash functions used in step 303. In addition, the determining whether the quintuple information of the data packet matches the first mapping table is similar to step 303. For details, refer to the description in step 303, which are not described herein again.
When the session to which the data packet belongs has a received data packet, and the quintuple information of the data packet matches the first mapping table, the data packet belongs to a recognized to-be-sampled session, and the data packet is extracted. When the data packet does not match the first mapping table, the data packet does not belong to a recognized to-be-sampled session, and step 301 is returned to receive a next data packet.
Step 307: Extract the data packet.
In the data packet extraction method provided in this embodiment of the present application, when the first mapping table and the second mapping table are Bloom Filter tables, a large amount of storage space may be saved compared with a case in which the first mapping table and the second mapping table store quintuple information. The following describes several points about technical implementation when the first mapping table and the second mapping table are Bloom Filter tables.
First, when the first mapping table and the second mapping table are the Bloom Filter tables, selection of k hash functions in a used hash function group is as follows:
It is relatively complex to select k different hash functions. A simple method is selecting one hash function and then setting k different inputs. For example, a manner such as setting k different arrangement orders for quintuple information arranged in a preset order or adding several bits at k different positions is used.
Second, selection of values of m, n, and k is as follows.
Because a Bloom Filter algorithm is used to compress a width of a flow table, some errors caused by hash calculation conflicts are eliminated to reduce consumption of NP resources. A Bloom Filter is a space-efficient probabilistic data structure, concisely indicates a set by using a bit array, and can determine whether an element belongs to the set. However, when whether an element belongs to a set is being determined, an element that does not belong to the set may be mistaken for belonging to the set (false positive). Therefore, the Bloom Filter is inapplicable to those “error-free” application scenarios. However, in an application scenario in which a low error rate can be tolerated, the Bloom Filter makes great savings in storage space with extremely few errors.
It is assumed that kn<m and all hash functions are completely random. When all elements in a set S={x1, x2, . . . , xn} are mapped to a bit array of m bits by using the k hash functions, a probability that a bit in the bit array is still 0 is:
A false positive probability is:
When k=ln 2×m/n, a minimum false positive probability is P=(½) k.
It is assumed that when network bandwidth is 400 G, concurrent traffic is n=10 M (which may reach to 50 M in an extreme case) in a normal case. To meet a condition that a statistical deviation is lower than 1%, a quantity k of hash functions is set to 7. Calculation of a value of m is m=K×n/(ln 2) 110 Mbit=13.75 MB, that is, the first mapping table needs to occupy a memory of 68.75 MB, which reduces storage space by 10 times compared with directly storing quintuple information of a data packet.
When a preset threshold is 1000, a session sampling ratio is 1:1000, concurrent traffic of a concurrent session that needs to be sampled is 50K, n=50K in the Bloom Filter, and according to previous speculation, required m bits are: m=K×n/(ln 2)=7×50K/ln 2≈550 Kbit=70 KB.
To delay time at which the Bloom Filter table overflows, a scale of the Bloom Filter table needs to be multiplied. Herein because the scale does not need to be quite precise, the scale may be increased by 10 times, and an NP memory of 700 KB is needed. Therefore, a memory required by the second mapping table is 1.4 MByte, which reduces storage space by 500 times compared with directly storing the quintuple information of the data packet.
Third, a storage manner of the first mapping table and the second mapping table is as follows.
The first mapping table or the second mapping table consists of V subtables, and a size of each subtable is Wbit. When a load capacity of each subtable (where the load capacity is defined as a quantity of bits in the table that are 1) is α, a quantity of sessions that can be represented by each subtable is:
where k is a quantity of hash functions. By using a head pointer, the V subtables form a ring for cycle use, as shown in
a receiving unit 601 and a processing unit 602 connected to the receiving unit 601.
The receiving unit 601 is configured to receive a data packet and send the data packet to the processing unit 602.
The processing unit 602 is configured to: parse quintuple information of the data packet; calculate a first hash value and a second hash value of the data packet according to the quintuple information by using a first hash function, where the first hash value is a hash value that is calculated by using the first hash function and by using the quintuple information arranged in a preset order as an input, and the second hash value is a hash value that is calculated by using the first hash function and by using, as an input, quintuple information obtained after in the quintuple information arranged in the preset order, a source IP address and a destination IP address are interchanged and a source port number and a destination port number are interchanged; calculate a first remainder obtained by dividing the first hash value by a denominator of a preset session sampling ratio, and calculate a second remainder obtained by dividing the second hash value by the denominator of the preset session sampling ratio; query whether the first remainder or the second remainder is a preset sampling remainder, where a quantity of the preset sampling remainders is the same as a numerator value of the preset session sampling ratio; and extract the data packet when the first remainder or the second remainder is the preset sampling remainder.
In an embodiment provided in this embodiment of the present application, before extracting the data packet, the processing unit 602 is further configured to:
extract at least one preset feature field from the data packet, where the preset feature field is a character string of a preset offset length at a preset position in the data packet; calculate a feature hash value of each preset feature field by using the second hash function and by using the preset feature field as an input; query whether the feature hash value of each preset feature field is the same as a preset hash value of the preset feature field; and extract the data packet when the feature hash value of each preset feature field is the same as the preset hash value of the preset feature field.
The data packet extraction apparatus shown in
a receiving unit 701 and a processing unit 702 connected to the receiving unit 701.
The receiving unit 701 is configured to receive a data packet and send the data packet to the processing unit 702.
The processing unit 702 is configured to: parse quintuple information of the data packet; determine whether another data packet belonging to a session to which the data packet belongs has been received; and;
when another data packet belonging to the session to which the data packet belongs has not been received, determine that the session to which the data packet belongs is a newly received session, add 1 to a session count value, and determine whether the session count value is equal to a preset threshold; and when the session count value is equal to the preset threshold, determine that the data packet belongs to a newly recognized to-be-sampled session, extract the data packet, and update a first mapping table by using the quintuple information of the data packet; or when another data packet belonging to the session to which the data packet belongs has been received, determine that the session to which the data packet belongs is a received session, and determine whether the quintuple information of the data packet matches the first mapping table; and extract the data packet when the quintuple information of the data packet matches the first mapping table, where the first mapping table stores quintuple information of all to-be-sampled sessions that are recognized before the data packet is received or a Bloom Filter mapping element that uses, as an input, quintuple information of all to-be-sampled sessions that are recognized before the data packet is received.
In an embodiment provided in this embodiment of the present application, that the processing unit 702 is configured to determine whether another data packet belonging to a session to which the data packet belongs has been received, includes:
parsing a flag field carried in the data packet; determining whether the flag field is an SYN flag field; and when the flag field is the SYN flag field, determining that another data packet belonging to the session to which the data packet belongs has not been received; or when the flag field is not the SYN flag field, determining that another data packet belonging to the session to which the data packet belongs has been received.
In another embodiment provided in this embodiment of the present application, that the processing unit 702 is configured to determine whether another data packet belonging to a session to which the data packet belongs has been received, includes:
determining whether the quintuple information of the data packet matches a second mapping table, where the second mapping table stores quintuple information of all sessions that are received before the data packet is received or a Bloom Filter mapping element that uses, as an input, quintuple information of all sessions that are received before the data packet is received; and when the quintuple information of the data packet does not match the second mapping table, determining that another data packet belonging to the session to which the data packet belongs has not been received, and updating the second mapping table by using the quintuple information of the data packet; or when the quintuple information of the data packet matches the second mapping table, determining that another data packet belonging to the session to which the data packet belongs has been received.
In another embodiment provided in this embodiment of the present application, the first mapping table stores the Bloom Filter mapping element that uses, as an input, the quintuple information of all the to-be-sampled sessions that are recognized before the data packet is received.
That the processing unit 702 is configured to determine whether the quintuple information of the data packet matches the first mapping table includes:
using, as a first hash value group, multiple hash values that are calculated by using a preset hash function group and by using, as an input, the quintuple information that is of the data packet and arranged in a preset order, where the preset hash function group is a hash function group used when the first mapping table is generated, and includes multiple preset hash functions;
querying whether values at positions in the first mapping table that are corresponding to all hash values in the first hash value group are 1; and
when the values at the positions in the first mapping table that are corresponding to all the hash values in the first hash value group are 1, determining that the quintuple information of the data packet matches the first mapping table; or when not all the values at the positions in the first mapping table that are corresponding to all the hash values in the first hash value group are 1, using, as a second hash value group, multiple hash values that are calculated by using the preset hash function group and by using, as an input, quintuple information obtained after in the quintuple information that is of the data packet and arranged in the preset order, a position of a source IP address and a position of a destination IP address are interchanged and a position of a source port number and a position of a destination port number are interchanged;
querying whether values at positions in the first mapping table that are corresponding to all hash values in the second hash value group are 1; and
when the values at the positions in the first mapping table that are corresponding to all the hash values in the second hash value group are 1, determining that the quintuple information of the data packet matches the first mapping table; or when not all the values at the positions in the first mapping table that are corresponding to all the hash values in the second hash value group are 1, determining that the quintuple information of the data packet does not match the first mapping table.
In another embodiment provided in this embodiment of the present application, the first mapping table stores the Bloom Filter mapping element that uses, as an input, the quintuple information of all the to-be-sampled sessions that are recognized before the data packet is received.
That the processing unit 702 is configured to update a first mapping table by using the quintuple information of the data packet includes:
using, as a third hash value group, multiple hash values that are calculated by using the preset hash function group and by using, as an input, the quintuple information that is of the data packet and arranged in the preset order; and
setting values at positions in the first mapping table that are corresponding to all hash values in the third hash value group to 1.
The data packet extraction apparatus shown in
triggering the receiver 802 to receive a data packet and send the data packet to the processor 803; and
triggering the processor 803 to: parse quintuple information of the data packet; calculate a first hash value and a second hash value of the data packet according to the quintuple information by using a first hash function, where the first hash value is a hash value that is calculated by using the first hash function and by using the quintuple information arranged in a preset order as an input, and the second hash value is a hash value that is calculated by using the first hash function and by using, as an input, quintuple information obtained after in the quintuple information arranged in the preset order, a source IP address and a destination IP address are interchanged and a source port number and a destination port number are interchanged; calculate a first remainder obtained by dividing the first hash value by a denominator of a preset session sampling ratio, and calculate a second remainder obtained by dividing the second hash value by the denominator of the preset session sampling ratio; query whether the first remainder or the second remainder is a preset sampling remainder, where a quantity of the preset sampling remainders is the same as a numerator value of the preset session sampling ratio; and extract the data packet when the first remainder or the second remainder is the preset sampling remainder.
In an embodiment provided in this embodiment of the present application, before extracting the data packet, the processor 803 is further configured to:
extract at least one preset feature field from the data packet, where the preset feature field is a character string of a preset offset length at a preset position in the data packet; calculate a feature hash value of each preset feature field by using the second hash function and by using the preset feature field as an input; query whether the feature hash value of each preset feature field is the same as a preset hash value of the preset feature field; and extract the data packet when the feature hash value of each preset feature field is the same as the preset hash value of the preset feature field.
The data packet extraction apparatus shown in
triggering the receiver 902 to receive a data packet and send the data packet to the processor 903.
triggering the processor 903 to: parse quintuple information of the data packet; calculate a first hash value and a second hash value of the data packet according to the quintuple information by using a first hash function, where the first hash value is a hash value that is calculated by using the first hash function and by using the quintuple information arranged in a preset order as an input, and the second hash value is a hash value that is calculated by using the first hash function and by using, as an input, quintuple information obtained after in the quintuple information arranged in the preset order, a source IP address and a destination IP address are interchanged and a source port number and a destination port number are interchanged; calculate a first remainder obtained by dividing the first hash value by a denominator of a preset session sampling ratio, and calculate a second remainder obtained by dividing the second hash value by the denominator of the preset session sampling ratio; query whether the first remainder or the second remainder is a preset sampling remainder, where a quantity of the preset sampling remainders is the same as a numerator value of the preset session sampling ratio; and extract the data packet when the first remainder or the second remainder is the preset sampling remainder.
In an embodiment provided in this embodiment of the present application, that the processor 902 is configured to determine whether another data packet belonging to a session to which the data packet belongs has been received includes:
parsing a flag field carried in the data packet; determining whether the flag field is an SYN flag field; and when the flag field is the SYN flag field, determining that another data packet belonging to the session to which the data packet belongs has not been received; or when the flag field is not the SYN flag field, determining that another data packet belonging to the session to which the data packet belongs has been received.
In another embodiment provided in this embodiment of the present application, that the processor 902 is configured to determine whether another data packet belonging to whether a session to which the data packet belongs has been received includes:
determining whether the quintuple information of the data packet matches a second mapping table, where the second mapping table stores quintuple information of all sessions that are received before the data packet is received or a Bloom Filter mapping element that uses, as an input, quintuple information of all sessions that are received before the data packet is received; and when the quintuple information of the data packet does not match the second mapping table, determining that another data packet belonging to the session to which the data packet belongs has not been received, and updating the second mapping table by using the quintuple information of the data packet; or when the quintuple information of the data packet matches the second mapping table, determining that another data packet belonging to the session to which the data packet belongs has been received.
In another embodiment provided in this embodiment of the present application, the first mapping table stores the Bloom Filter mapping element that uses, as an input, the quintuple information of all the to-be-sampled sessions that are recognized before the data packet is received.
That the processor 902 is configured to determine whether the quintuple information of the data packet matches the first mapping table includes:
using, as a first hash value group, multiple hash values that are calculated by using a preset hash function group and by using, as an input, the quintuple information that is of the data packet and arranged in a preset order, where the preset hash function group is a hash function group used when the first mapping table is generated, and includes multiple preset hash functions;
querying whether values at positions in the first mapping table that are corresponding to all hash values in the first hash value group are 1; and
when the values at the positions in the first mapping table that are corresponding to all the hash values in the first hash value group are 1, determining that the quintuple information of the data packet matches the first mapping table; or when not all the values at the positions in the first mapping table that are corresponding to all the hash values in the first hash value group are 1, using, as a second hash value group, multiple hash values that are calculated by using the preset hash function group and by using, as an input, quintuple information obtained after in the quintuple information that is of the data packet and arranged in the preset order, a position of a source IP address and a position of a destination IP address are interchanged and a position of a source port number and a position of a destination port number are interchanged;
querying whether values at positions in the first mapping table that are corresponding to all hash values in the second hash value group are 1; and
when the values at the positions in the first mapping table that are corresponding to all the hash values in the second hash value group are 1, determining that the quintuple information of the data packet matches the first mapping table; or when not all the values at the positions in the first mapping table that are corresponding to all the hash values in the second hash value group are 1, determining that the quintuple information of the data packet does not match the first mapping table.
In another embodiment provided in this embodiment of the present application, the first mapping table stores the Bloom Filter mapping element that uses, as an input, the quintuple information of all the to-be-sampled sessions that are recognized before the data packet is received.
That the processor 902 is configured to update a first mapping table by using the quintuple information of the data packet includes:
using, as a third hash value group, multiple hash values that are calculated by using the preset hash function group and by using, as an input, the quintuple information that is of the data packet and arranged in the preset order; and
setting values at positions in the first mapping table that are corresponding to all hash values in the third hash value group to 1.
The data packet extraction apparatus shown in
Optionally, the processor may be a central processing unit (CPU), the memory may be an internal memory of a random access memory (RAM) type, the receiver may include a common physical interface, and the physical interface may be an Ethernet interface or an asynchronous transfer mode (ATM) interface. The processor, the receiver, and the memory may be integrated into one or more independent circuits or one or more pieces of hardware, for example, an application-specific integrated circuit (ASIC).
Persons of ordinary skill in the art may understand that all or some of the steps in the method embodiments may be implemented by program instructing relevant hardware. The foregoing program may be stored in a computer readable storage medium. When the program runs, the steps included in the method embodiments are performed. The foregoing storage medium may be at least one of the following media: media that can store program code, such as a read-only memory (ROM), a RAM, a magnetic disk, or an optical disc.
It should be noted that the embodiments in this specification are all described in a progressive manner. For same or similar parts in the embodiments, reference may be made to these embodiments, and each embodiment focuses on a difference from other embodiments. Especially, device and system embodiments are basically similar to method embodiments, and therefore are described briefly. For related parts, reference may be made to partial descriptions in the method embodiments. The described device and system embodiments are merely examples. The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the embodiments. Persons of ordinary skill in the art may understand and implement the embodiments of the present application without creative efforts.
The foregoing descriptions are merely optional implementation manners of the present application, but are not intended to limit the protection scope of the present application. It should be noted that persons of ordinary skill in the art may make improvements and polishing without departing from the principle of the present application and the improvements and polishing shall fall within the protection scope of the present application.
This application is a continuation of International Application No. PCT/CN2014/095639, filed on Dec. 30, 2014. The disclosure of the aforementioned application is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2014/095639 | Dec 2014 | US |
Child | 15639180 | US |