System log files may be used to diagnose and resolve system failures and performance bottlenecks in computer systems. Such log files may be generated by the software modules included in the system. Software developers may insert source code in these modules to create log messages at different points of the program. These messages may allow support engineers to determine the status of a system's components when a failure or bottleneck occurred.
As noted above, software modules of a system may be encoded with instructions to produce log messages at different points in the program. These log messages may assist a system engineer in diagnosing system failures or performance bottlenecks. Unfortunately, textual log formats are not sufficiently standardized. There are thousands of log formats in use today, some of which are unique to a certain system. Without knowing the log-format in advance, it is difficult to parse the log into separate records (e.g., log-messages). Log-analysis software may not operate correctly, unless the rules for parsing the records are re-programmed for each format.
In view of the increasing volumes and variability of log files handled by massive log-analysis systems, various examples disclosed herein provide a system, non-transitory computer readable medium, and method for automatic discovery of records in data and the rules to partition them. In one example, substrings may be detected in the input data. Each substring may comprise at least one character. In one example, rules for parsing records in the input data may be formulated based at least partially on the patterns of semantic tokens. The aspects, features and advantages of the present disclosure will be appreciated when considered with reference to the following description of examples and accompanying figures. The following description does not limit the application; rather, the scope of the disclosure is defined by the appended claims and equivalents.
The computer apparatus 100 may also contain a processor 110 and memory 112. Memory 112 may store instructions that are retrievable and executable by processor 110. In one example, memory 112 may be a random access memory (“RAM”) device. In a further example, memory 112 may be divided into multiple memory segments organized as dual in-line memory modules (DIMMs). Alternatively, memory 112 may comprise other types of devices, such as memory provided on floppy disk drives, tapes, and hard disk drives, or other storage devices that may be coupled to computer apparatus 100 directly or indirectly. The memory may also include any combination of one or more of the foregoing and/or other devices as well. The processor 110 may be any number of well known processors, such as processors from Intel® Corporation. In another example, the processor may be a dedicated controller for executing operations, such as an application specific integrated circuit (“ASIC”). Although all the components of computer apparatus 100 are functionally illustrated in
The instructions residing in memory 112 may comprise any set of instructions to be executed directly (such as machine code) or indirectly (such as scripts) by processor 110. In that regard, the terms “instructions,” “scripts,” “applications,” and “programs” may be used interchangeably herein. The computer executable instructions may be stored in any computer language or format, such as in object code or modules of source code. Furthermore, it is understood that the instructions may be implemented in the form of hardware, software, or a combination of hardware and software and that the examples herein are merely illustrative.
Parsing rules generator module 115 may implement the techniques described in the present disclosure. In that regard, parsing rules generator module 115 may be realized in any non-transitory computer-readable media for use by or in connection with an instruction execution system such as computer apparatus 100, an ASIC or other system that can fetch or obtain the logic from non-transitory computer-readable media and execute the instructions contained therein. “Non-transitory computer-readable media” may be any media that can contain, store, or maintain programs and data for use by or in connection with the instruction execution system. Non-transitory computer readable media may comprise any one of many physical media such as, for example, electronic, magnetic, optical, electromagnetic, or semiconductor media. More specific examples of suitable non-transitory computer-readable media include, but are not limited to, a portable magnetic computer diskette such as floppy diskettes or hard drives, a read-only memory (“ROM”), an erasable programmable read-only memory, or a portable compact disc.
As will be explained below, parsing rules generator module 115 may configure processor 110 to read input data, such as input data 120, and formulate parsing rules even when the format of the data is unknown. While the examples herein make reference to log files, it is understood that the techniques herein may be used to parse any type of data whose format does not adhere to any standard or format.
One working example of the system, method, and non-transitory computer-readable medium is shown in
As shown in
SPACE TAB \/|!@$%̂,.:;&=˜_-.
Some substrings in the input data may be predetermined substrings associated with a predetermined type. In one example, a predetermined substring may be a substring that is presumed to appear in the input data. Such presumption may be based on advanced knowledge of the input data. For example, in the context of log files, “new line” characters, also termed line-feed (LF) characters in the ASCII standard, may be presumed, since they improve visibility and readability. It may also be presumed that the majority of lines contain at least one delimiter. Based on these assumptions, the plausibility that each candidate is the delimiter increases as the percentage of lines in which each candidate appears approaches 100%. However, this criterion may not be sufficient, since there may be other candidates that appear in the majority of lines at least once. Thus, in one example, the frequency of appearances of each candidate in the entire input data may also be considered. Each of the candidates listed above may have a plausibility score associated therewith that measures the plausibility that each candidate is a delimiter. In one example, the delimiter plausibility score may account for both considerations noted above and may be defined as the following: N×−log(P+R×(1−P)), where N is the frequency of a candidate's appearances in the input data, P is the percentage of lines in the input data that did not contain the candidate delimiter, and R is a regularization constant to avoid divergence of the logarithm. In one example, R is approximately 0.01. The chosen delimiter may be the delimiter with the highest plausibility score. During this first pass, it may be assumed that each line is delimited by the new line character.
SPACE “,” “/” “]” “[” “:”
The SPACE occurs 12 times in the input data. The “,” “!” and “:” occur 6 times; and, the “[” and “]” both occur 4 times. Each candidate appears in all three lines. Thus, the percentage of lines in which each candidate does not appear is 0. Inserting these numbers into the example plausibility score formula above results in the following:
SPACE=12×−log(0+0.01(1−0)=12×log(0.01)=24
“,”=6×−log(0+0.01×(1−0)=6×log(0.01)=12
“/”=6×−log(0+0.01×(1−0)=6×log(0.01)=12
“:”=6×−log(0+0.01×(1−0)=6×log(0.01)=12
“[”=4×−log(0+0.01×(1−0)=6×log(0.01)=8
“]”=4×−log(0+0.01×(1−0)=6×log(0.01)=8
Thus, in this example, the SPACE has the highest plausibility score and may be deemed the delimiter that separates the substrings in input data 120.
As noted above, appearances of some substrings may be presumed in the input data. In addition to new line characters, timestamps and dates may be presumed to appear in the context of log files, since this information may assist in diagnosing problems arising in a computer system. In the example input data 120, the substring “12/12/20” may be a predetermined substring categorized as a date substring. The substrings “08:01:27,233,” “08:01:28,098,” and “08:01:28,632” may be predetermined substrings categorized as timestamp substrings. An end of data character (not shown) that indicates the end of the input data may also be a predetermined substring presumed to be in the input data.
Referring back to
In the example of
The intermediate semantic token string 310 may then be further abstracted by determining whether any of the unique semantic tokens should be switched to a generic semantic token. In one example, this determination may include an evaluation of whether each unique semantic token is associated with a recurring substring. In a further example, a recurring substring may be defined as a substring that appears at least once between each pair of predetermined substrings. Each recurring substring may also be associated with its own plausibility score that measures the plausibility that a significant pattern of the substring exists in the input data such that the recurring substring merits its own unique semantic token. In one example, the number of times a recurring substring appears between each pair of predetermined substrings may be determined. The number of appearances that is most frequent (i.e., the mode of the number of appearances) may be detected. Thus, in one example, the plausibility score for the recurring substring may be defined as: Mn/Ps, where Mn is the number of predetermined substring pairs in which the number of a recurring substring therein is equal to the mode of the appearances and Ps is the total number of predetermined substring pairs. If the plausibility score for the recurring substring exceeds a predetermined threshold, it may be associated with its own unique semantic token. Otherwise, if the plausibility score falls below the predetermined threshold, the recurring substring may be associated with the generic semantic token, such as the “G” semantic token illustrated earlier. In one example, the predetermined threshold is 0.6. Furthermore, substrings that do not appear at least once between each pair of predetermined strings may also be associated with a generic semantic token.
Referring to the intermediate semantic token 310 in
“]”=[1, 2, and 1]
“[”=[1, 2, and 1]
“,”=[2, 1, and 1]
“N”=[0, 1, and 0]
“!”=[0, 0, 1]
As shown above, the “]” substring appears once between the first pair of predetermined substrings, twice between the second pair of predetermined substrings, and once between the third pair of predetermined substrings. The mode, which is 1, appears between two pairs of predetermined substrings and the total number of pairs is three. Thus, using the example formula Mn/Ps for the “]” substring results in ⅔=.66. Assuming a threshold of 0.6, the substring “]” may be deemed worthy of its own unique semantic token. As with the “]” substring, the plausibility formula for the “[” substring is also ⅔=.66 and may also be deemed worthy of its own unique semantic token in view of the example threshold 0.6. Similarly, the “,” substring appears once between 2 out of 3 pairs, which results in ⅔=.66. Thus, the “,” also exceeds the example threshold of 0.6 and may be deemed worthy of its own unique semantic token. The substring “20,” which is represented by the semantic token “N,” and “!” do not appear between each pair of predetermined substrings. As such, the semantic token “N” and “!” may be switched to the example “G” generic semantic token. Referring back to
Referring back to
The rest of suffix tree data structure 400 may be arranged similarly. Each leaf node may contain the starting position of its corresponding branch and each intermediate node may contain the frequency of the substrings associated with the branches that precede them. Leaf node 418 may contain the number 10, since the suffix string “L [D, T] [G] GL” of its corresponding branch begins at position 10 in semantic token string 320. The intermediate nodes 408, 410, and 412 may represent the suffixes “[D, T] G, G” and “[D, T] G$.” The former beginning at position 2, as indicated by leaf node 420, and the later beginning at position 21, as indicated by leaf node 422. The branch beginning at root node 402 and ending at leaf node 424 may represent the “[D, T] [G] GL” suffix, which begins at position 11 as indicated in leaf node 424. The branch beginning at root node 402 and ending at leaf node 426 may represent the “[G] GL [D,” suffix, which begins at position 16 as indicated by leaf mode 426. The branch beginning at root node 402 and ending at leaf node 426 may simply represent the “$” semantic token, which is located at position 27 as indicated by leaf node 428. Different combinations of suffix strings may be stored in this manner in suffix tree data structure 400. Once the suffix string combinations have been exhausted and arranged in suffix tree data structure 400, a cycle discovery algorithm may be executed to derive the parsing rules.
Referring back to
Referring back to
Advantageously, the above-described computer apparatus, non-transitory computer readable medium, and method derive parsing rules for data that does not adhere to any known format. In this regard, data that is not readily interpretable by a user may be parsed even when the boundaries between the records and fields are not known in advance. In turn, users can be rest assured that the data will be readable regardless of the changes made in the format of the data.
Although the disclosure herein has been described with reference to particular examples, it is to be understood that these examples are merely illustrative of the principles of the disclosure. It is therefore to be understood that numerous modifications may be made to the examples and that other arrangements may be devised without departing from the spirit and scope of the disclosure as defined by the appended claims. Furthermore, while particular processes are shown in a specific order in the appended drawings, such processes are not limited to any particular order unless such order is expressly set forth herein. Rather, processes may be performed in a different order or concurrently and steps may be added or omitted.