This application claims the benefit of Korean Patent Application No. 10-2019-0140901 filed on Nov. 6, 2019, in the Korean Intellectual Property Office, the disclosure of which is incorporated herein by reference in its entirety.
The present inventive concept relates to a method and apparatus for generating a summary of an URL. More specifically, it relates to a method and apparatus for generating a summary of an URL for URL clustering.
As the threat of cyber attack or hacking increases in recent years, many organizations and companies are making great efforts to detect cyber attacks or hacking attempts in advance by analyzing URL (Uniform Resource Locator) logs accessed from the outside through networks. This is done in a manner such that normal logs and malicious logs are classified among the collected URL logs, and when a malicious log is detected, a warning is issued or a corresponding action is taken.
In order to effectively detect malicious logs in URL logs that are collected in a large amount of billions or more per day, a technology that may automatically cluster similar URLs is essential. Various methods have been tried in the related art for URL clustering. For example, the following methods were commonly used: clustering URLs having similar texts through natural language processing algorithms, clustering similar URLs using an algorithm for calculating a distance between strings such as the Euclidean distance calculation formula, or clustering similar URLs using machine learning algorithms such as K-means clustering.
However, these conventional methods divide characters included in a URL into word units or process and analyze texts based on a morpheme of a natural language. Therefore, structural characteristics of the URL were not properly reflected in the process of preprocessing the text. Moreover, they were somewhat unsuitable for the security log field where it is necessary to analyze URL logs in character units.
In addition, the conventional methods mainly determined a degree of similarity based on a vector distance of texts included in the URL. In this regard, in general, URLs are characterized by the type, shape, or length of the text rather than the vector distance (or semantic similarity) of the text. Therefore, it was difficult to properly cluster URLs in the conventional way.
In particular, among the conventional clustering methods, machine learning-based methods have a problem that a lot of time is required for deep learning training. Moreover, when a new URL is collected, it was necessary to relearn the entire URL including the existing URL to reflect it. Therefore, there was a problem that it was not suitable for the security log field requiring real-time clustering.
Aspects of the inventive concept provide a method and apparatus for generating a summary of a URL for URL clustering by reflecting structural characteristics of the URL.
Aspects of the inventive concept also provide a method and apparatus for analyzing a URL log in character units and generating a summary of a URL based on the type, shape, or length of a URL text.
Aspects of the inventive concept also provide a method and apparatus for generating a summary of a URL that may contribute to real-time clustering because an operation time is short and new data may be immediately reflected.
However, aspects of the inventive concept are not restricted to the one set forth herein. The above and other aspects of the inventive concept will become more apparent to one of ordinary skill in the art to which the inventive concept pertains by referencing the detailed description of the inventive concept given below.
According to aspects of the inventive concept, a method for generating a summary of URL (Uniform Resource Locator) is performed by a computer device and comprises obtaining a URL, parsing the URL to extract a plurality of fields from the URL, generating attribute information indicating characteristics of each field for the plurality of fields, and generating a summary of the URL using the attribute information.
According to aspects of the inventive concept, an apparatus for generating a summary of a URL comprises a processor, a memory for loading a computer program executed by the processor, and a storage for storing the computer program, wherein the computer program comprises instructions for performing operations to obtain a URL, parse the URL to extract a plurality of fields from the URL, generate attribute information indicating characteristics of each field for the plurality of fields, and generate a summary of the URL using the attribute information.
According to aspects of the inventive concept, a computer program is stored on a computer-readable recording medium, the computer program is combined with a computing device to execute a method for generating a summary of a URL, and the computer program executes the method to obtain a URL, parse the URL to extract a plurality of fields from the URL, generate attribute information indicating characteristics of each field for the plurality of fields, and generate a summary of the URL using the attribute information.
According to various embodiments of the inventive concept described above, a summary of a URL for URL clustering may be generated by reflecting structural characteristics of the URL.
In addition, a URL log may be analyzed in character units, and a summary of a URL may be generated based on the type, shape, or length of a URL text and provided for URL clustering.
Also, since an operation time required to generate a URL summary is short and new data may be immediately reflected, it is possible to contribute to real-time URL clustering.
The benefits of the inventive concept are not limited to the benefits mentioned above, and other benefits not mentioned may be clearly understood by those skilled in the art from embodiments of the inventive concept.
These and/or other aspects will become apparent and more readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings in which:
Hereinafter, preferred embodiments of the present disclosure will be described with reference to the attached drawings. Advantages and features of the present disclosure and methods of accomplishing the same may be understood more readily by reference to the following detailed description of preferred embodiments and the accompanying drawings. The present disclosure may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete and will fully convey the concept of the disclosure to those skilled in the art, and the present disclosure will only be defined by the appended claims.
In adding reference numerals to the components of each drawing, it should be noted that the same reference numerals are assigned to the same components as much as possible even though they are shown in different drawings. In addition, in describing the present invention, when it is determined that the detailed description of the related well-known configuration or function may obscure the gist of the present invention, the detailed description thereof will be omitted.
Unless otherwise defined, all terms used in the present specification (including technical and scientific terms) may be used in a sense that can be commonly understood by those skilled in the art. In addition, the terms defined in the commonly used dictionaries are not ideally or excessively interpreted unless they are specifically defined clearly. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In this specification, the singular also includes the plural unless specifically stated otherwise in the phrase.
In addition, in describing the component of this invention, terms, such as first, second, A, B, (a), (b), can be used. These terms are only for distinguishing the components from other components, and the nature or order of the components is not limited by the terms. If a component is described as being “connected,” “coupled” or “contacted” to another component, that component may be directly connected to or contacted with that other component, but it should be understood that another component also may be “connected,” “coupled” or “contacted” between each component.
Hereinafter, some embodiments of the present invention will be described in detail with reference to the accompanying drawings.
The preprocessing unit 110 collects a URL from the outside, parses the URL, and preprocesses it in a form suitable for generating a URL summary. To this end, the preprocessing unit 110 first collect a URL through various external paths, for example, an IDS/IPS log 10, a web access log 20, a firewall log 30, or an APT log 40. Then, by parsing the URL, a plurality of predefined fields (e.g., a domain field, a path field, a file and extension field, or a parameter field) are extracted from text constituting the URL. The fields extracted by the preprocessing unit 110 are provided to the URL summary generation unit 120.
The URL summary generation unit 120 generates attribute information indicating characteristics of each field for the plurality of fields provided by the preprocessing unit 110. Here, the URL summary generation unit 120 generates the attribute information so as to abbreviate the type, form, or length of the text included in each field without giving much weight to a linguistic meaning indicated by the text included in each field. A detailed method for generating attribute information for each field by the URL summary generation unit 120 will be described later in detail with reference to
The URL summary generation unit 120 generates a summary of the URL based on the generated attribute information. Here, the URL summary generation unit 120 may generate the summary of the URL by combining the generated attribute information of each of the fields. The URL summary generation unit 120 provides the generated URL summary to the clustering unit 130.
The clustering unit 130 clusters the URL based on the provided summary of the URL. Here, the clustering unit 130 clusters URLs such that URLs having the same summary belong to the same cluster. As an embodiment, if the URL summary is provided from the URL summary generation unit 120, the clustering unit 130 compares whether the provided URL summary is the same as a summary of the existing URL, and clusters the provided URL into a cluster of the existing URL when the provided URL is the same. On the other hand, when the URL summary provided is different from that of the existing URL, the clustering unit 130 clusters the provided URL summary into a new cluster different from the cluster of the existing URL. The URL for which the clustering unit 130 has completed clustering and clustering information of the URL may be stored in the URL storage 140.
According to the configurations of the embodiment described above, a summary of an URL is generated by reflecting structural characteristics of the URL, and the summary is provided to URL clustering. Therefore, URL clustering in which the structural characteristics of the URL are fully reflected becomes possible.
In addition, when generating the URL summary, the type, shape, or length of a URL text is analyzed in character units. Therefore, it may overcome the problems of the existing clustering method, which was not suitable for the security log field.
Furthermore, unlike existing machine learning-based clustering, a URL summary is generated based on rules and applied to URL clustering. Therefore, an operation time required for URL summarization or clustering is short, and it is possible to immediately reflect new data.
In
In step S110, the apparatus 100 for generating the URL summary obtains a URL through various paths. For example, the apparatus 100 for generating the URL summary may obtain a plurality of URLs from an IDS/IPS log 10, a web access log 20, a firewall log 30, or an APT log 40.
In step S120, the apparatus 100 for generating the URL summary parses the obtained URL to extract a plurality of predefined fields from text constituting the URL. Here, the plurality of fields to be extracted may include a domain field, a path field, a file and extension field, or a parameter field.
For a more detailed description of this, the related description will be continued with reference to
First, referring to
For example, as shown in
With reference to this point, the preprocessing unit 110 parses the URL 50 and analyzes which fields correspond to text portions 51, 52, 53, 54, and 55 of the URL 50. Then, based on a result of analysis, fields to be extracted are extracted from the URL 50 according to a predetermined criterion. For example, in the case of the preamble field 51, since it does not contribute to characterizing the URL, it is not necessary to extract it and is not included in an extraction target. On the other hand, the domain field 52, the path field 53, the file and extension field 54, and the parameter field 55 are included in the extraction target because the corresponding URL 50 may be characterized accordingly. In this way, the preprocessing unit 110 extracts predetermined fields 52, 53, 54, and 55 from an original text of the URL 50.
In step S130, the apparatus 100 for generating the URL summary generates attribute information representing characteristics of each field for the extracted fields 52, 53, 54, and 55.
First, referring to
Then, in step S132, the URL summary generation unit 120 generates attribute information indicating characteristics of each field based on the type or length of characters included in the extracted fields 52, 53, 54, and 55. Here, the URL summary generation unit 120 may generate the attribute information so as to abbreviate the type, form, or length of the text included in each field without giving much weight to a linguistic meaning indicated by the text included in each field. However, exceptionally, since the domain field 52 itself represents a unique characteristic, it is assumed that text included in the domain field 52 is maintained as it is when generating attribute information.
When generating attribute information for the remaining fields 53, 54, and 55, the attribute information may be generated by applying a different rule to each of the fields 53, 54, and 55 so as to reflect unique characteristics of each of the fields 53, 54, and 55.
As an embodiment, when generating attribute information for the path field 53, the URL summary generation unit 120 may refer to one or more characters consecutively positioned in the path field 53, and construct the attribute information with an identification character representing the type and a number representing a length. Here, the identification character may be a character indicating whether the type of characters included in the path field 53 is an alphabetic character, a numeric character, or a special character. For example, if the type of characters is an alphabetical character, the identification character is “A” referring to an acronym of an alphabet, if the type of characters is a number, the identification character becomes “N” referring to an acronym of a number, and if the type of characters is a special character, the identification character is “S” referring to an acronym of a special character.
For example, referring to
As an embodiment, when generating attribute information for the file and extension field 54, the URL summary generation unit 120 may configure the attribute information such that a file name of the file and extension field 53 is represented by an identification character indicating the type of characters constituting the file name and a number indicating a length of the characters, and an original text is maintained as an extension itself has a certain characteristic meaning.
For example, referring to
As an embodiment, when generating attribute information for the parameter field 55, the URL summary generation unit 120 may include, in the attribute information, only identification characters indicating the type of characters of the same type consecutive to each other in the parameter field 55, and not the length of the characters. This is because a length of a key and value included in the parameter field 55 is generally not important in the attack parsing of the URL. Meanwhile, since special characters included in the parameter field 55 may have meaning in URL classification, it is assumed that the original text is maintained.
For example, referring to
Alphabetical character “A” corresponding to the value 55b appearing next is also represented by the identification character “A” indicating the type. Also, the rest of the special characters remain the same. Accordingly, in the attribute information 65 of the parameter field 55, the special character “?,” the “A14,” the special character “=,” and the “A” are combined in order of “?A=A.” A series of processes for generating attribute information described above may be performed by an attribute information generation unit 122 of
Returning to
Referring to
Such a URL summary may itself function as a URL cluster. For example, if summaries of two URLs are identical, it means that the two URLs have the exact same text structure such as a domain name as well as a file path, a file name, or a query string. This is because the two URLs are likely to have a deep relationship with each other due to the nature of a URL syntax. Therefore, it is possible to manage URLs having the same URL summary in the same cluster.
Hereinafter, a method for managing a cluster of URLs based on a URL summary will be described.
In step S151, the apparatus 100 for generating the URL summary clusters URLs so that URLs having the same URL summary are grouped (or clustered). As described above, since the URL summary is the same means that characteristics of URL syntax are the same, such URLs may be classified and managed in the same cluster. According to this method, if only a summary of a URL is generated, a separate operation process for clustering is not required. Therefore, the effect of automatically clustering a URL may be obtained. Accordingly, URLs may be quickly clustered in real time. In addition, even when a new URL is collected, clustering may be performed immediately by just generating its URL summary. Step S151 may be performed by a cluster management unit 132 of
In step S152, the apparatus 100 for generating the URL summary may store the URL and its summary in the URL storage 140 as a result of clustering. Referring to
In step S153, the apparatus 100 for generating the URL summary labels the URL according to the result of clustering the URL. In some cases, separate labeling may be required for URLs clustered by a URL summary. For example, when it is determined that URLs with a specific summary are related to malicious logs (or when a summary of URLs determined to be related to malicious logs is identified), one may manage potential cyber attacks or threats by labeling the “malicious log” URLs with that summary. In this case, since URLs are clustered and labeled based on a URL summary, the same labeling is made for all URLs having the same URL summary. Step S153 may be performed by a labeling unit 133 of
In step S210, the apparatus 100 for generating the URL summary generates a summary of a new URL. Since a method for generating a new URL summary is the same as the method for generating the URL summary described in
In step S220, the apparatus 100 for generating the URL summary determines whether the summary of the new URL is identical to each other by comparing the summary of the existing URL. If they are the same, the present embodiment proceeds to step S230. Otherwise, the present embodiment proceeds to step S240.
In step S230, the apparatus 100 for generating the URL summary clusters the new URL into the same cluster as the existing URL. In embodiments of the present disclosure, clustering is performed based on a summary of a URL. Therefore, when a summary of a new URL is the same as a summary of an existing URL, it is automatically clustered into the same cluster (i.e., a cluster grouped by one summary as shown in
Meanwhile, when it proceeds to step S240, the apparatus 100 for generating the URL summary clusters the new URL into a new cluster grouped by its URL summary. In this case, since there is no existing URL summary identical to the new URL, a new cluster is naturally created with the summary of the new URL.
In step S250, the apparatus 100 for generating the URL summary detects whether there is abnormal access or cyber attack from the outside based on an occurrence trend of new clusters including the new cluster. For description of step S250, refer to
Referring to
Hereinafter, an exemplary computing device 2000 that can implement an apparatus and a system, according to various embodiments of the present disclosure will be described with reference to
As shown in
The processor 2100 controls overall operations of each component of the computing device 2000. The processor 2100 may be configured to include at least one of a Central Processing Unit (CPU), a Micro Processor Unit (MPU), a Micro Controller Unit (MCU), a Graphics Processing Unit (GPU), or any type of processor well known in the art. Further, the processor 2100 may perform calculations on at least one application or program for executing a method/operation according to various embodiments of the present disclosure. The computing device 2000 may have one or more processors.
The memory 2200 stores various data, instructions and/or information. The memory 2200 may load one or more programs 2210 from the storage 2300 to execute methods/operations according to various embodiments of the present disclosure. An example of the memory 2200 may be a RAM, but is not limited thereto.
The bus 2500 provides communication between components of the computing device 2000. The bus 2500 may be implemented as various types of bus such as an address bus, a data bus and a control bus.
The communication interface 2400 supports wired and wireless internet communication of the computing device 2000. The communication interface 2400 may support various communication methods other than internet communication. To this end, the communication interface 2400 may be configured to comprise a communication module well known in the art of the present disclosure.
The storage 2300 can non-temporarily store one or more computer programs 2310. The storage 2300 may be configured to comprise a non-volatile memory, such as a Read Only Memory (ROM), an Erasable Programmable ROM (EPROM), an Electrically Erasable Programmable ROM (EEPROM), a flash memory, a hard disk, a removable disk, or any type of computer readable recording medium well known in the art.
The computer program 2210 may include one or more instructions, on which the methods/operations according to various embodiments of the present disclosure are implemented.
An example, the computer program 2210 may comprise instructions for performing operations to obtain a URL, parse the URL to extract a plurality of fields from the URL, generate attribute information indicating characteristics of each field for the plurality of fields, generate a summary of the URL using the attribute information, and clustering the URL based on the summary of the URL.
Another example, the computer program 2210 may comprise instructions for performing operations to generate a summary of a new URL, compare the summary of the new URL with a summary of an existing URL, cluster the new URL into the same cluster as the existing URL when the summary of the new URL is the same as the summary of the existing URL, cluster the new URL into a new cluster when the summary of the new URL is different from the summary of the existing URL, and detect abnormal access or cyber attack from the outside based on a trend of occurrence of clusters including the new cluster.
When the computer program 2210 is loaded on the memory 221)0, the processor 2100 may perform the methods/operations in accordance with various embodiments of the present disclosure by executing the one or more instructions.
The technical features of the present disclosure described so far may be embodied as computer readable codes on a computer readable medium. The computer readable medium may be, for example, a removable recording medium (CD, DVD, Blu-ray disc, USB storage device, removable hard disk) or a fixed recording medium (ROM, RAM, computer equipped hard disk). The computer program recorded on the computer readable medium may be transmitted to other computing device via a network such as internet and installed in the other computing device, thereby being used in the other computing device.
Although the operations are shown in a specific order in the drawings, those skilled in the art will appreciate that many variations and modifications can be made to the preferred embodiments without substantially departing from the principles of the present invention. Therefore, the disclosed preferred embodiments of the invention are used in a generic and descriptive sense only and not for purposes of limitation. The scope of protection of the present invention should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of the technical idea defined by the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
10-2019-0140901 | Nov 2019 | KR | national |