This application claim priority from Chinese Patent Application Number CN201510589748.4, filed on Sep. 16, 2015 at the State Intellectual Property Office, China, titled “METHOD AND APPARATUS FOR LOG STORAGE OPTIMIZATION,” the contents of which is herein incorporated by reference in entirety.
Embodiments of the present disclosure generally relate to data storage technologies.
Computer systems are constantly improving in terms of speed, reliability, and processing capability. As is known in the art, computer systems which process and store large amounts of data typically include a one or more processors in communication with a shared data storage system in which the data is stored. The data storage system may include one or more storage devices, usually of a fairly robust nature and useful for storage spanning various temporal requirements, e.g., disk drives. The one or more processors perform their respective operations using the storage system. Mass storage systems (MSS) typically include an array of a plurality of disks with on-board intelligent and communications electronics and software for making the data on the disks available.
Companies that sell data storage systems are very concerned with providing customers with an efficient data storage solution that minimizes cost while meeting customer data storage needs. It would be beneficial for such companies to have a way for reducing the complexity of implementing data storage.
In view of the above, embodiments of the present disclosure provide a method and apparatus for log storage optimization, which can reduce storage space of the log and improve log analysis efficiency. According to an embodiment of the present disclosure, an apparatus and a method for log storage optimization includes receiving log data; converting the log data into structured data using a parsing rule; and encoding the structured data to reduce storage space of the log.
Features, advantages and other aspects of embodiments of the present disclosure will be made more apparent in combination with figures and with reference to the following detailed description. Several embodiments of the present disclosure are illustrated here in an exemplary and unrestrictive manner. In the figures,
Exemplary embodiments of the present disclosure will be described in detail with reference to figures. The flowcharts and block diagrams in the figures illustrate system architecture, functions and operations executable by a method and system according to the embodiments of the present disclosure. It should be appreciated that each block in the flowcharts or block diagrams may represent a module, a program segment, or a part of code, which contains one or more executable instructions for performing specified logic functions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown consecutively may be performed in parallel substantially or in an inverse order, depending on involved functions. It should also be noted that each block in the block diagrams and/or flow charts and a combination of blocks in block diagrams and/or flow charts may be implemented by a dedicated hardware-based system for executing a prescribed function or operation or may be implemented by a combination of dedicated hardware and computer instructions.
The terms “comprising”, “including” and their variants used herein should be understood as open terms, i.e., “comprising/including, but not limited to”. The term “based on” means “at least partly based on”. The term “an embodiment” represents “at least one embodiment”; the terms “another embodiment” and “a further embodiment” represent “at least one additional embodiment”. Relevant definitions of other terms will be given in the description below.
Generally, logs may be records of a transaction or an operation occurring at a system (such as software, application) or an apparatus (such as a server or terminal equipment). Log data may contain a definitive record of all activities and behaviors of a system or an apparatus, and may be generally semi-structured data, such as a single-line log and a complex multi-line log. Technicians usually search, correlate, visualize, analyze and record log data may be used to identify and resolve operation and security issues of the system or apparatus.
Modern Software-Defined Data Center (SDDC) infrastructure may be constantly generating log data at a rate faster than a rate that technical technicians can handle. Since the amount and activities and data increases exponentially, a number of generated logs may also increases rapidly. For example, some storage servers may generate log data of up to several TBs each day. The modern SDDC infrastructure may have an automated and dynamic deployment capability for multi-tier applications, and thus it may necessitate real-time log analytics. Effective analysis of a log may be a key guarantee for complex troubleshooting, dynamic high performance and superior security.
Generally, methods of performing search and analysis for a log may be very inefficient. Additionally, although existing processing approaches use compression or deduplication processing, entropy of a log may not been reduced, and hence processing current massive logs and improving log analysis efficiency remains an potential issue.
It should be appreciated that these exemplary embodiments are presented here to enable those skilled in the art to better understand and thereby implement embodiments of the present disclosure, not to limit the scope of the present disclosure in any manner.
According to one embodiment, there is disclosed a method for log storage optimization. A further embodiment may include receiving log data. A further embodiment may include converting log data into structured data using a parsing rule. A further embodiment may include encoding structured data to reduce storage space of the log. A further embodiment may include traversing a log profile repository after receiving log data. A further embodiment may include determining whether a log profile repository includes a structured log profile corresponding to a log data to generate a parsing rule, wherein a structured log profile repository may be used to store converted structured data.
In a further embodiment, determining whether a log profile repository includes a structured log profile corresponding to a log data to generate the parsing rule may include if a log profile repository includes a structured log profile corresponding to a log data, generating a corresponding parsing rule according to a corresponding structured log profile.
In a further embodiment, determining whether a log profile repository includes a structured log profile corresponding to a log data to generate the parsing rule may include: if a structured log profile corresponding to a log data is missed in a log profile repository, obtaining a structured log profile and parsing rule corresponding to a log data through an adaptive learning process. A further embodiment may include receiving a structured log profile and parsing rule corresponding to a log data from a user.
A further embodiment may include prior to traversing a log profile repository, generating a structured log profile and a corresponding parsing rule according to a log configuration accessible to a device receiving a log data. In a further embodiment, a structured log profile may at least include a timestamp or content data of the log. In a further embodiment, a parsing rule may be a regular expression or string template.
In a further embodiment, converting a log data into structured data using a parsing rule further may include setting a base time after converting a log data into a structured data using a parsing rule. A further embodiment may include determining a time difference between a timestamp of each log and a base time. A further embodiment may include replacing a timestamp data in a structured data with a time difference. In a further embodiment, a base time may be a timestamp of a first log or periodicity-based time.
In a further embodiment, encoding structured data may include: for various types of value in the structured data, determining an occurrence frequency of each value in the same type of values to generate an encoding rule. In a further embodiment, generating encoding rules may include: encoding a value having a larger occurrence frequency as a number having a shorter length, wherein an occurrence frequency may be proportional to occurrence times. In a further embodiment, encoding a value having a larger occurrence frequency as a number having a shorter length may include: encoding a value having a maximum occurrence frequency as a number “1”.
In a further embodiment generating encoding rules may include: generating automatically an encoding rule according to an adaptive learning process of the encoding rule. In a further embodiment, an encoding rule may be implemented by Huffman encoding. A further embodiment may include storing an encoded structured data in a form of a log vector after encoding a structured data using an encoding rule.
In one embodiment an apparatus for log storage optimization may include a receiving unit that may be configured to receive log data; a converting unit that may be configured to convert log data into structured data using a parsing rule; and an encoding unit that may be configured to encode structured data to reduce storage space of a log.
Exemplary embodiments of the present disclosure may bring about at least one of the following technical effects: since structured conversion may be performed for a log data, converted data may be encoded in a column-based manner, and/or a timestamp of a log may be encoded, and thus an analysis efficiency of a log may be significantly improved and an entropy of a log may be effectively reduced, thereby achieving an effect of reducing storage space of logs and thereby improving archiving efficiency of logs.
Referring to
After step 102 illustrated in
In a further embodiment, a regular expression may be a matching tool for operating and checking string data, it may be a sting of special characters and may perform operations such as matching for a text. In a further embodiment, reference for matching grammar of a regular expression may be made to a web site http://www.regular-expressions.info/. In an example embodiment, a regular expression for matching may be built for a log record in
P1=[“Processing\s+(\w+)#(\w+)\s\(for\s+((\d+\.){3}\d+)\s+at\s+(\d+−\d+−\d+\s\d+:\d+:\d+)\)\s+\[(\w+)\]\n+(Parameters:\.+)”, controller, method, client_IP, timestamp, http_method, content].
In a further embodiment, data matched by a regular expression may be considered as data at respective field positions in a log profile. In a further embodiment, in the above example of a regular expression, a value of a controller field may correspond to “\w+”, a value of method field may correspond to “\w+”, a value of client_IP field may correspond to “(\d+\.){3}\d+”, a value of timestamp field may correspond to “\d+−\d+−\d+\s\d+:\d+:\d+”, a value of http_method field may correspond to “\w+”, and a value of content field may correspond to “Parameters:\.+”. In a further embodiment, by using matching rules of a regular expression, matching may be performed for a log record in the example of
In one embodiment, both a regular expression and a string template matching rules may be received from a user, or may be automatically obtained by executing adaptive learning process according to historical log records and log profiles. In an example embodiment, an original log record and a structured log profile may be compared in terms of text in order to obtain positions of changing data in an original log record, that is, to obtain variables in a log record. In a further embodiment, a learning process may be repeated to generate a respective parsing rule, such as a regular expression and a string template.
According to an embodiment, determining whether a log profile repository includes a structured log profile corresponding to a log data to generate a parsing rule may include: if a log profile repository includes a structured log profile corresponding to a log data, generating a corresponding parsing rule according to a corresponding structured log profile. In an example embodiment, if a log profile repository already includes a corresponding log profile, a corresponding parsing rule may be generated according to this log profile.
According to another embodiment, determining whether a log profile repository includes a structured log profile corresponding to a log data to generate a parsing rule may include: if a structured log profile corresponding to a log data is missed in a log profile repository, obtaining a structured log profile and parsing rule corresponding to a log data through an adaptive learning process, or receiving a structured log profile and parsing rule corresponding to a log data from a user. In an example embodiment, if a log profile repository does not include a corresponding log profile, a parsing rule may be received from a user. In an alternative embodiment, a corresponding parsing rule may be generated automatically through an adaptive learning process of a parsing rule.
According to an embodiment, method 100 further includes: prior to traversing a log profile repository, generating a structured log profile and a corresponding parsing rule according to a log configuration accessible to a device receiving a log data. In one embodiment, during software development, Log4j tool may be usually used to assist in generating a log. In a further embodiment, if a configuration file for Log4j can be obtained, a log generating rule may be obtained. In a further embodiment, Log4j(http://logging.apache.org/log4j) is a powerful log recording software and uses a grammar description layout. In an example embodiment, “%-5p [% t]: % m % n” may generate log severity in 5 characters+[thread name]: message+line breaks. In a further embodiment, during a process of log conversion, if a log configuration (such as Log4j configuration) of generating a log data can be obtained, Log4j configuration may be used to generate a log profile and a corresponding parsing rule. In a further embodiment, if Log4j configuration cannot be obtained, a log profile repository continues to be traversed to obtain a structured log profile and a parsing rule.
Further referring to
Further referring to
According to an embodiment, an encoding rule may be Huffman encoding. In one embodiment, Huffman encoding, according to occurrence frequency of characters, constructs a code word of a different prefix with a shortest average length, and may be a typical lossless compression encoding. In a further embodiment, Huffman encoding may be used to encode converted structured data in order to further reduce information entropy of a log.
In a further embodiment, for a timestamp of a log, a Timestamp Formalization Module may determine a time offset based on a base time, and a offset may be, for example, a time difference between a current log record and a previous log record. In a further embodiment, for each segment (column) of a log record, a segment encoding module uses a corresponding encoding rule to generate an encoded formulized log. In a further embodiment, an encoded log may be a series of encoded log vectors. In an example embodiment, encoding rules may be generated by a training process as disclosed in
In one embodiment. a method for log storage optimization according to principles of the present disclosure improves a log storage efficiency (information entropy is reduced) and log analysis efficiency (log is converted into the form of a structured vector) through a context-aware matching approach and segment (column)-based encoding approach. In a further embodiment, a method may be adopted to perform compression and deduplication for logs again. A further embodiment may include generating encoded structured data, and structured data may be easily compared. In an example embodiment, an encoded structured data may be further analyzed by using analysis technologies such as an association rule. In a further embodiment, a method according to embodiments of the present disclosure may fit log analysis very well.
According to one embodiment, apparatus 1400 may further include: a traversing unit that may be configured to traverse a log profile repository after receiving log data, and a determining unit that may be configured to determine whether a log profile repository includes a structured log profile corresponding to a log data to generate a parsing rule, wherein the structured log profile repository may be used to store a converted structured data.
According to another embodiment, the determining unit may be further configured to: if a log profile repository includes a structured log profile corresponding to a log data, generate a corresponding parsing rule according to a corresponding structured log profile. According to another embodiment, the determining unit may be further configured to: if a structured log profile corresponding to a log data is missed in a log profile repository, obtain a structured log profile and parsing rule corresponding to the log data through an adaptive learning process, or receive a structured log profile and parsing rule corresponding to the log data from a user.
According to one embodiment, apparatus 1400 may further include: a log configuration detecting unit that may be configured to, prior to traversing a log profile repository, generate a structured log profile and a corresponding parsing rule according to a log configuration accessible to a device receiving the log data. According to another embodiment, a structured log profile may at least include a timestamp or content data of a log. According to a further embodiment, a parsing rule may be a regular expression or a string template.
According to one embodiment, apparatus 1400 may further include: a timestamp encoding unit that may be configured to set a base time after a parsing rule is used to convert a log data into a structured data, to determine a time difference between a timestamp of each log and a base time, and to replace a timestamp data in a structured data with a time difference. According to another embodiment, a base time may be a timestamp of a first log or periodicity-based time.
According to one embodiment, encoding unit 1406 may be further configured to: for each type of value in a structured data, determine an occurrence frequency of each value in a same type of value to generate an encoding rule. According to another embodiment, encoding unit 1406 may be further configured to encode a value having a larger occurrence frequency as a number having a shorter length, wherein an occurrence frequency may be proportional to occurrence times. According to a further embodiment, encoding unit 1406 may be further configured to encode a value having a maximum occurrence frequency as a number “1”. According to one embodiment, encoding unit 1406 may be further configured to: generate automatically an encoding rule according to an adaptive learning process of the encoding rule. According to another embodiment, an encoding rule may be Huffman encoding. According to one embodiment, optionally, apparatus 1400 may further include: a storage unit 1408 that may be configured to store an encoded structured data in the form of a log vector after encoding a structured data using an encoding rule.
It should be appreciated that apparatus 1400 may be implemented in various manners. For example, in some embodiments, apparatus 1400 may be implemented in software, hardware or the combination thereof. The hardware part can be implemented by a special logic; the software part can be stored in a memory and executed by a proper instruction execution system such as a microprocessor or a design-specific hardware. Those skilled in the art may understand that the above method and system may be implemented with a computer-executable instruction and/or in a processor controlled code, for example, such code is provided on a carrier medium such as a magnetic disk, CD, or DVD-ROM, or a programmable memory such as a read-only memory or a data carrier such as an optical or electronic signal carrier. The apparatus and their units in the embodiments of the present disclosure may be implemented by hardware circuitry of a programmable hardware device, such as a very large scale integrated circuit or gate array, a semiconductor such as logical chip or transistor, or a field-programmable gate array, or a programmable logical device, or implemented by software executed by various kinds of processors, or implemented by combination of the above hardware circuitry and software.
It should be noted that although a plurality of units or sub-units of the apparatus have been mentioned in the above detailed description, such partitioning is merely exemplary and non-compulsory. Actually, according to embodiments of the present invention, features and functions of the above described two or more units may be embodied in one unit. In turn, features and functions of the above described one unit may be further embodied in many units.
The computer device as shown in
What are described above are only optional embodiments of the present disclosure and not intended to limit embodiments of the present disclosure. For those skilled in the art, embodiments of the present disclosure may have various modifications and variations. Any modifications, equivalent substitutions and improvements made within the spirit and principle of embodiments of the present disclosure all should be included in the protection scope of embodiments of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
CN201510589748.4 | Sep 2015 | CN | national |