Data from data sources may be maintained in tables in a predetermined format. In many instances, the format of the data, such as names of columns of tables, may vary based on the data sources.
Features of the present disclosure are illustrated by way of example and not limited in the following figure(s), in which like numerals indicate like elements, in which:
FIG, 1 depicts a block diagram of an apparatus that may extract a characteristic of a column based on tabular data of a data source and determine, through application of modeling, a recommended column type for the column from a predefined table format, in accordance with an embodiment of the present disclosure;
FIG, 3 shows a block diagram of example tables, which may be implemented in the system depicted in
For simplicity and illustrative purposes, the principles of the present disclosure are described by referring mainly to embodiments and examples thereof. In the following description, numerous specific details are set forth in order to provide an understanding of the embodiments and examples. It will be apparent, however, to one of ordinary skill in the art, that the embodiments and examples may be practiced without limitation to these specific details. In some instances, well known methods and/or structures have not been described in detail so as not to unnecessarily obscure the description of the embodiments and examples. Furthermore, the embodiments and examples may be used together in various combinations.
Throughout the present disclosure, the terms “a” and “an” are intended to denote at least one of a particular element. As used herein, the term “includes” means includes but not limited to, the term “including” means including but not limited to. The term “based on” means based at least in part on.
A computing device may receive and store data from a data source or from multiple data sources to provide services based on analysis of the received data. In some examples, the computing device may provide security services, for instance, next-generation security information and event management solutions (NG-SIEM), which may provide real-time analysis of security alerts generated by applications and network hardware of the data sources.
A concern associated with maintaining and analyzing data from data sources may be “understanding the data.” For instance, different data sources may include various types of data and in various types of formats, which may be different across each of the different data sources. In such instances, providing out-of-the-box (OOTB) functionality to provide services, such as NG-SIEM, based on the various data formats from different data sources may be difficult because of the various types of formats used by the different data sources.
By way of particular example to illustrate such issues, a computing device may receive data from a virtual private network (VPN) data source to provide advanced analytics to the VPN data source. In this example, the computing device may not be aware of a format of the received data because, for instance, each data source may have one of a number of data formats. For instance, a log received from a first data source may have a predefined data format, such as a predefined set of columns and/or fields, which may be different than that of a second data source. In this example, the first data source and the second data source may both have a column for user names, but the first data source may have a predefined data format in which the user name column is called “username,” whereas the second data source may call the column “user.name.” In these instances, if the computing device is not aware of the data format for the particular data source, the computing device may not be able to provide all of the services that it may be able to provide. As such, it may be difficult to scale support for a large number of different data sources, which may have different data formats.
To address such issues, the computing device may normalize the data from the different data sources to a predefined format that can be understood by the computing device. However, normalization may be a difficult process because, in many cases, an administrator may need to perform the normalization process manually. For instance, a user at a particular data source that wishes to gain all of the OOTB value provided by an NG-SIEM solution at the computing device may parse different fields in their data based on a normalized schema of the computing device. As used herein, a normalized schema may refer to a predefined format for the data that the service on the computing device expects, such as a predefined format for table columns and fields, and/or the like. In some examples, the data from different data sources may be normalized by parsing the data, for instance, by rearranging or reformatting the data to be in the same format as the normalized schema.
However, in these instances, it may be inefficient to normalize the incoming data, particularly in cases where an administrator may need to manually identify the incoming data. A technical issue with normalizing incoming data may be that conventional techniques for normalizing the incoming data may be time and/or computing resource intensive in instances in which the incoming data includes a large volume of data for which the format may be different and/or unknown.
Disclosed herein are apparatuses, systems, methods, and computer-readable media that may enable efficient normalization of data from a data source to a predefined format of a normalized schema. As discussed herein, a processor may receive tabular data of a data source and extract a characteristic of a column based on the received tabular data, Based on the extracted characteristic of the column, the processor may determine, through application of modeling, a recommended column type from a predefined table format of the normalized schema. In some examples, the processor may output multiple recommended column types for a particular column in order to enable a user to more quickly and easily decide how to normalize the tabular data. The recommended column type may have at least a predetermined level of match to the extracted characteristic of the column. The processor may assign the recommended column type as a type of the column of the received tabular data to normalize the received tabular data to the predefined table format.
Through implementation of the features of the present disclosure, a processor may enable improved normalization of data from data sources, which may reduce latency and consumption of processing resources by leveraging machine learning models to automate the normalization process rather than performing manual normalization, which in turn may improve efficiency in on-boarding new data sources. A technical improvement afforded through implementation of the features of the present disclosure may be that the speed and accuracy in which managed services, such as security information and event management (SIEM) services may be provided, may be improved, which may also reduce energy and resource consumption in the normalization of the data to the predefined table format.
Reference is made to
The apparatus 100 may include a processor 102 and a memory 110. The apparatus 100 may be a computing device, including a server, a node in a network (such as a data center or a cloud computing resource), a desktop computer, a laptop computer, a tablet computer, a smartphone, an electronic device such as Internet of Things (IoT) device, and/or the like. The processor 102 may include a semiconductor-based microprocessor, a central processing unit (CPU), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), and/or other hardware device. In some examples, the apparatus 100 may include multiple processors and/or cores without departing from a scope of the apparatus. In this regard, references to a single processor as well as to a single memory may be understood to additionally or alternatively pertain to multiple processors and multiple memories.
The memory 110 may be an electronic, magnetic, optical, or other physical storage device that contains or stores executable instructions. The memory 110 may be, for example, Read Only Memory (ROM), flash memory, solid state drive, Random Access memory (RAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a storage device, an optical disc, or the like. The memory 110 may be a non-transitory computer-readable medium. The term “non-transitory” does not encompass transitory propagating signals.
As shown in
The apparatus 100 may be connected via a network 202, which may be the Internet, a local area network, and/or the like, to a server 204. In addition, a data store 206 may be connected to the server 204. In some examples, the server 204 may maintain a data source, such as the data sources 402 depicted in
The processor 102 may fetch, decode, and execute the instructions 112 to receive a table 208 of a data source. As used herein, the table 208 may be the same as “tabular data” that may form the table 208, and these terms may be used interchangeably. As used herein, tabular data may include content of columns/fields of tables as well as information about the columns/fields, such as information stored in metadata for the columns/fields. In some examples, the table 208 may be a log maintained at the data source. In some examples, the table 208 may include security and event information.
The processor 102 may fetch, decode, and execute the instructions 114 to extract a characteristic 214 of the column 210 in the received table 208 (or tabular data). In some examples, the column 210 in the table 208 may be arranged in a certain format, which may vary based on the data source. For instance, the column 210 may have an assigned column type 212, and in this case, the data in the table 208 may be arranged based on the column type 212. By way of particular examples, the column type 212 may include a name of the column, a data type of the column, and/or the like, such as a date of an operation, a type of an operation, an outcome of an operation, a user name, a user type, a device name, a domain, an address, and/or the like. In this regard, the format of the table 208, including the column type 212, may be unknown to the processor 102 when the table 208 is received and/or may be different than column types 218 of a predefined table format 216.
The column 210 may also include the characteristic 214 of the column 210. The characteristic 214 of the column 210 may include various types of properties of the column 210. In some instances, the characteristic 214 may be based on the column type 212. By way of particular examples and for purposes of illustration, the characteristic 214 may include properties of the column 210 such as a name of the column 210, a type of data in the column 210 such as numbers or text, a cardinality of distinct values of a field of the column 210, a property based on a regex such as how well a content of the column 210 fits a search pattern, a column data type, and/or the like. In some examples, the characteristic 214 may be extracted from metadata for the table 208.
The processor 102 may fetch, decode, and execute the instructions 116 to, based on the extracted characteristic 214, determine, through application of modeling, a recommended column type 220 from a predefined table format 216. As depicted in
By way of particular example and as depicted in
In some examples, the model 222 may be trained using a sample log from a data source. For instance, the received table 208 of the data source may be a sample log from the data source that includes a subset of the data from the data source, including the column 210, column type 212, and characteristics 214 of the column 210. The processor 102 may extract the characteristic 214 of the column 210, for instance, through metadata for the column 210 in the table 208, and may use the extracted characteristic 214 to train the model 222. The model 222 may be trained using any suitable machine learning technique, such as linear regression, logistic regression, decision tree, naive Bayes, kNN, and/or the like.
In some examples, the processor 102 may create a feature set 224 based on the extracted characteristic 214 to train the model 222. The feature set 224 may include the features correlated to the extracted characteristic 214 of the column 210. In some examples, the features of the feature set 224 of the column 210 may include a field type, a data type, a value of content in the column 210, a number of distinct values of data in the column 210, a regular expression (regex) of content in the column 210, and/or the like.
The processor 102 may generate a feature vector 226 for the column 210 based on the feature set 224. In some examples, the feature vector 226 may represent characteristics of the column 210 based on the features of the feature set 224. The feature vector 226 may be unique to the column 210. In some examples, the processor 102 may generate a unique feature vector 226 for each of the columns in the table 208 of the data source.
The characteristic 214 of the column 210, which may be a feature of the feature set 224, may include various types of characteristics. For instance, the features in the feature set 224 may include column names correlated to the column, cardinality of values of data in the column, patterns of characters in the column, column data type correlated to the column, column content correlated to the column, and/or the like. By way of particular examples and for purposes of illustration,
Referring first to
Referring to
In some examples, multiple characteristics 214 or features may be applied to determine the recommended column type 220. Continuing with the previous example in which the Data Source 1 includes two columns that include the keyword “ID,” the processor 102 may apply the characteristic 214, for instance, based on the cardinality of values to further narrow the match to a column type 218 in the predefined table format 216. For instance, “EVENT_ID” in the received column 210 may have a limited number of possible values, while “EVENT_RECORD_ID” may have a different value for each event. In this case, based on the cardinality of value for “EVENT_ID”, the processor 102 may determine that “EVENT ID” in the received column 210 has a relatively higher level of match with “EVENT_CODE” 410 in the predefined table format 216, rather than “EVENT_ID” 408, for example.
Referring to
In some examples, based on the extracted characteristic 214 of the column 210, the processor 102 may determine a ranking of column types 218 of the predefined table format 216 based on a respective level of match to the extracted characteristic 214 of the column 210. The processor 102 may select, as the recommended column type 220, one or more than one column types 218 of the predefined table format 216 having at least the predetermined level of match to the extracted characteristic 214 of the column 210. In some examples, the processor 102 may apply multiple features or characteristics 214 to rank the column types 218 for a match against the column 210.
In some examples, the processor 102 may output the recommended column type 220. In some examples, the processor 102 may output the recommended column type 220 for selection or confirmation by a user. For instance, the recommended column type 220 may be one among a plurality of recommended column types 220 that is output to the user. In some examples, the processor 102 may output the recommended column types 220 to a display device at the apparatus 100, at the server 204 for the data source, and/or the like. The recommended column type 220 that is output may be a predetermined number of top ranked column types 218 of the predefined table format 216 as determined via the model 222. In cases in which multiple recommended column types 220 are output to the user, the user may select one of output recommended column types 220 to confirm the recommended column type 220 that matches the column 210.
Based on a selection of one of the plurality of recommended column types 220, the processor 102 may update or retrain the model 222 to account for the selection, in order to improve the accuracy of the model 222. In this regard, the processor 102 may generate subsequent recommendations for column types based on the updated model 222.
The processor 102 may fetch, decode, and execute the instructions 118 to assign the recommended column type 220 as the column type 212 of the received table 208 to normalize the received table 208 to the predefined table format 216. In some examples, the processor 102 may replace the existing column type 212 with the recommended column type 220 in order to normalize the received table 208 according to the predefined table format 216. In some examples, the processor 102 may assign the recommended column type 220 to an appropriate column 210 in the table 208, without user intervention, to automate normalization of the table 208 to the predefined table format 216.
Various manners in which a processor implemented on the apparatus 100 may operate are discussed in greater detail with respect to the method 500 depicted in
At block 502, the processor 102 may receive a table 208 of a data source. The table 208 may be tabular data for a plurality of columns and characteristics of the columns. In some examples, the characteristics of the columns may be stored in metadata.
At block 504, the processor 102 may extract a characteristic 214 of the column 210 based on the tabular data. In some examples, the extracted characteristic 214 may include a plurality of characteristics of the column 210. The extracted characteristic 214 may be a feature of a feature set 224 for a machine learning model, such as the model 222 in
At block 506, the processor 102 may determine the feature set 224 based on the extracted characteristic 214 to train the model 222. The processor 102 may run the model 222 to match the extracted characteristic 214 of the column 210 to a column type 218 from a predefined table format 216.
At block 508, the processor 102 may determine, through application of the model 222, a recommended column type 220 from the predefined table format 216. The recommended column type 220 may have at least a predetermined level of match to the extracted characteristic 214 of the column 210. In some examples, the processor 102 may identify multiple recommended column types 220 among the column types 218 in the predefined table format 216 as a match for the column 210.
At block 510, the processor 102 may assign the recommended column type 220 to a type of the column of the received table, such as the column type 212 depicted in
In some examples, features of the determined feature set 224 may include a field type, a data type, a value of content in the column, a number of distinct values of data in the column, a regular expression (regex) of content in the column, and/or the like.
The processor 102 may generate a feature vector 226 correlated to the column 210 based on the feature set 224. The feature vector 226 may represent the column 210 based on the features of the feature set 224. In some examples, the features in the feature set 224 may include column names correlated to the column 210, cardinality of values of data in the column 210, patterns of characters in the column 210, column data type correlated to the column 210, column content correlated to the column 210, and/or the like.
The processor 102 may determine a ranking of column types 218 of the predefined table format 216 based on a respective level of match to the extracted characteristic 214 of the column 210. The processor 102 may select, as the recommended column type 220, one or more than one column type 218 of the predefined table format 216 having at least the predetermined level of match to the extracted characteristic 214 of the column 210.
The processor 102 may output the recommended column type 220. In some examples, the recommended column type 220 may include a plurality of recommended column types 220. Based on a selection of one of the plurality of recommended column types 220, the processor 102 may assign the selected column type as the column type 212 to normalize the column 210 in the received table 208.
The processor 102 may update the model 222 based on the selected column type. In some examples, the processor 102 may retrain the model 222 based on the selected column type. The processor 102 may generate subsequent recommendations for column types based on the updated model 222.
Some or all of the operations set forth in the method 500 may be included as utilities, programs, or subprograms, in any desired computer accessible medium. In addition, the method 500 may be embodied by computer programs, which may exist in a variety of forms both active and inactive. For example, they may exist as machine-readable instructions, including source code, object code, executable code or other formats. Any of the above may be embodied on a non-transitory computer-readable storage medium.
Examples of non-transitory computer-readable storage media include computer system RAM, ROM, EPROM, EEPROM, and magnetic or optical disks or tapes. It is therefore to be understood that any electronic device capable of executing the above-described functions may perform those functions enumerated above.
Turning now to
The computer-readable medium 600 may have stored thereon machine-readable instructions 602-610 that a processor disposed in an apparatus 100 may execute. The computer-readable medium 600 may be an electronic, magnetic, optical, or other physical storage device that contains or stores executable instructions. The computer-readable medium 600 may be, for example, Random Access memory (RAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a storage device, an optical disc, and the like.
The processor may fetch, decode, and execute the instructions 602 to receive a table 208 of a data source. The received table 208 may include tabular data, which may include content of the table 208. The tabular data may also include metadata for the table 208. The data source may be one of a plurality of data sources 402.
The processor may fetch, decode, and execute the instructions 604 to extract a feature set 224 of a column 210 based on the received tabular data. The feature set 224 may include a plurality of features of the column 210. The plurality of features of the columns may correlate to characteristics of the column 210.
The processor may fetch, decode, and execute the instructions 606 to generate, based on the extracted feature set 224, a feature vector 226 for the column 210. The feature vector 226 may represent characteristics of the column 210, such as the characteristic 214, based on the extracted feature set 224.
The processor may fetch, decode, and execute the instructions 608 to determine, through application of modeling using the feature vector 226, a recommended column type 220 from a predefined table format 216. The recommended column type 220 may have at least a predetermined level of match to the characteristics of the column 210.
The processor may fetch, decode, and execute the instructions 610 to assign the recommended column type 220 to a column type 212 in the received table 208 to normalize the received table 208 to the predefined table format 216.
In some examples, the feature set 224 may include column names correlated to the column, cardinality of values of data in the column, patterns of characters in the column, column data type correlated to the column, column content correlated to the column, and/or the like.
In some examples, the processor may determine a ranking of column types 218 of the predefined table format 216 based on a respective level of match to the characteristics of the column 210. The processor may select, as the recommended column type 220, one or more than one column type 218 of the predefined table format 216 having at least the predetermined level of match to the characteristics of the column.
In some examples, the processor may output the recommended column type 220. The recommended column type 220 may include a plurality of recommended column types 220 among the column types 218 for the predefined table format 216. Based on a selection of one of the plurality of recommended column types 220, the processor may assign the selected column type as the column type 212 of the column 210 to normalize the column 210 in the received table 208.
In some examples, the processor may update a machine learning model, such as the model 222 depicted in
Although described specifically throughout the entirety of the instant disclosure, representative examples of the present disclosure have utility over a wide range of applications, and the above discussion is not intended and should not be construed to be limiting, but is offered as an illustrative discussion of aspects of the disclosure.
What has been described and illustrated herein is an example of the disclosure along with some of its variations. The terms, descriptions and figures used herein are set forth by way of illustration and are not meant as limitations. Many variations are possible within the scope of the disclosure, which is intended to be defined by the following claims and their equivalents—in which all terms are meant in their broadest reasonable sense unless otherwise indicated.