The Information technology (IT) infrastructure of organizations may vary in scale and scope based on the organization's size and respective requirements. For example, the number of software applications deployed in an organization may vary from a few basic software applications (for example, email) to a large number of applications.
For a better understanding of the solution, examples will now be described, purely by way of example, with reference to the accompanying drawings, in which:
The IT environment of an enterprise may comprise of a handful of software applications to hundreds of applications. In some cases, complex license models combined with easily installable software may drive the management of software assets to become uncontrollable, causing failed audits and unexpected spending.
Accurate and fast software recognition may provide a number of benefits to an enterprise. For example, it may help prevent software overspend, avoid new purchases, respond quickly to external and internal software audits, and reduce manual effort involved with Software Asset Management (SAM) activities. However, identifying software applications installed in an enterprise environment and the ability to know what and where software is being used may pose technical challenges.
To address these technical challenges, the present disclosure describes various examples for classifying software (machine-executable instructions). In an example, a determination may be made whether a software installation directory includes a file to run software. In response to a determination that the software installation directory includes a file to run the software, information may be extracted from text data associated with the software installation directory using named entity recognition technique. Further, respective relevance scores of the files in the software installation directory may be determined, wherein the respective relevance scores may represent respective relevance of the files against the extracted information. The files may be classified as one of a primary file, a secondary file, or a tertiary file based on their respective relevance scores.
Computing device 102 may represent any type of system capable of reading machine-executable instructions. Examples of the computing device 102 may include a server, a desktop computer, a notebook computer, a tablet computer, a thin client, a mobile device, a personal digital assistant (PDA), and the like.
In an example, computing 102 device may include a determination engine 152, an extraction engine 154, a relevance engine 156, and a classification engine 158.
Engines 152, 154, 156, and 158 may be any combination of hardware and programming to implement the functionalities of the engines described herein. In examples described herein, such combinations of hardware and programming may be implemented in a number of different ways. For example, the programming for the engines may be processor executable instructions stored on at least one non-transitory machine-readable storage medium and the hardware for the engines may include at least one processing resource to execute those instructions. In some examples, the hardware may also include other electronic circuitry to at least partially implement at least one engine of the computing device 102. In some examples, the at least one machine-readable storage medium may store instructions that, when executed by the at least one processing resource, at least partially implement some or all engines of the computing device. In such examples, the computing device 102 may include the at least one machine-readable storage medium storing the instructions and the at least one processing resource to execute the instructions.
In an example, determination engine 152 may determine whether a software installation directory on computing device 102 includes a file(s) to run software. Such files may include a file without which the software may not run. For example, an executable file (e.g., .exe file).
As used herein, a software installation directory may refer to a directory that stores the program files of software (or computer application). In some examples, the software installation directory may be referred to as application installation directory, program installation directory, or program files folder.
In an example, software may be installed across multiple directories on computing device. However, a file(s) to run the software (e.g., an executable file) may be present in one directory. In an example, a determination engine 152 may identify a software installation directory that includes such a file(s).
Determination engine 152 may use a machine learning model to determine whether a software installation directory includes a file(s) to run the software. In an example, the machine learning model may be based on gradient boosted decision trees technique. The gradient boosted decision trees technique provides a method for generating models for regression and classification tasks. Gradient boosted decision trees technique may produce a prediction model in the form of an ensemble of weak prediction models. Gradient boosting may be used to build the model in a stage-wise fashion, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
In an example, the input data for the machine learning model may include a scan file(s). A scan file may include a document that includes the file structure of all the directories on a computing device (for example, 102) along with information related to the respective directories and the respective files present in those directories. Each directory together with its first level sub-files may be treated as a single training record for the machine learning model.
In an example, before scan files are used as input data for the machine learning model, irrelevant, redundant, or highly correlated features may be eliminated from the original dataset to create a minimal set of features. In an example, the features shown in Table 1 below may be used in the machine learning model.
In response to a determination by determination engine 152 that the software installation directory may include a file(s) to run the software, extraction engine 154 may extract information from text data associated with the software installation directory.
In an example, the information (or named entities) extracted by extraction engine may include a publisher of software in the software installation directory, a name of the software, and a version of the software. Referring to
In an example, extraction engine 154 may first extract the publisher of software from the text data associated with the software installation directory. In an example, DBpedia ontology may be used to identify the publisher of software. DBpedia ontology refers to a shallow, cross-domain ontology that has been manually created on the most commonly used infoboxes in Wikipedia. DBpedia may allow users to semantically query relationships and properties of Wikipedia resources, including links to other related dataset. DBpedia may extract factual information from Wikipedia pages, and allow users to find answers to questions where the information is spread across many different Wikipedia articles. Data in DBpedia may be accessed using an SQL-like query language. Once the publisher of software has been identified, extraction engine 154 may determine the name of the software, and the version of the software from the text data.
After the information from the text data associated with the software installation directory is extracted, classification engine 152 may classify files in the software installation directory as one of a main file, an associated file, or a third party file based on respective relevance scores of the files. As used herein, a “main file” may refer to a file without which software may not run; an “associated file” may refer to an ancillary file written by the publisher of the software without which the software may run; and a “third party file” may refer to a file written by a publisher other than the publisher of the software.
In some examples, a different nomenclature may be used for referring to a main file, an associated file, and a third party file. For example, a main file, an associated file, and a third party file may be referred to as a “primary file”, a “secondary file”, and a “tertiary file” respectively.
The relevance score of a file may represent the relevance of the file to software installed in the software installation directory. Relevance engine 156 may determine the relevance score of a file. In an example, relevance engine 156 may convert each FileEntry of the files in the software installation directory into a text “query”, and the information (or named entities) extracted from the text data as “documents”. As used herein, a “FileEntry” may be an object that represents a file on a file system. In the context of example text data illustrated in
Relevance engine 156 may determine the relevance between a query and the documents for each FileEntry. In an example, relevance engine may first remove stop words from “queries” and “documents”. As used herein, stop words may refer to words which may be filtered out before or after processing of natural language data. Stop words may refer to the most common words in a language. Some examples of the stop words may include “the”, “is”, “at”, “which”, “on”, etc. Any group of words may be chosen as stop words for a given purpose. In the context of present disclosure, relevance engine 156 may remove stop words such as “program files”, “bin”, “lib”, and other words that are likely to occur frequently in queries and documents. The aforementioned are just some examples of the stop words that may be removed by relevance engine 156.
Relevance engine 156 may determine the name of software and the publisher of the software installed in the software installation directory from all possible candidates based on document frequency. Relevance engine 156 may use a ranking function for this purpose. In an example, the ranking function may be based on Okapi BM25. BM25 is a ranking function which may be used to rank matching documents according to their relevance to a given search query. An example ranking function that may be used by relevance engine 156 is given below.
where:
In an example, after each file in the software installation directory has been ranked, a final score function for each file may be determined by relevance engine 156 based on the equation given below.
score(Q)=k1ƒ(q,d1)+k2ƒ(q,d2)+k3 max(ƒ(q,d3),ƒ(q,d4)))+k4I(q)
Where k1 . . . k4 are the weights that may need to be tuned, and I(q) may be an indicator function:
In an example, the highest ranking file which is above a threshold α may be classified as the main file by classification engine 158. The files whose score are below a threshold β may be classified as third party files by classification engine 158. The remaining files may be classified as associated files by classification engine 158.
In the context of example text data illustrated in
In an example, system 300 may represent any type of computing device capable of reading machine-executable instructions. Examples of computing device may include, without limitation, a server, a desktop computer, a notebook computer, a tablet computer, a thin client, a mobile device, a personal digital assistant (PDA), and the like.
In an example, system 300 may include a determination engine 152, an extraction engine 154, a relevance engine 156, and a classification engine 158.
In an example, determination engine 152 may determine whether a software installation directory includes a file to run software. In response to a determination that the software installation directory includes the file to run the software, extraction engine 154 may extract information from text data associated with the software installation directory using named entity recognition technique. In an example, the information may include a publisher of software in the software installation directory, a name of the software, and a version of the software. Relevance engine 156 may determine respective relevance scores of the files in the software installation directory. The respective relevance scores of the files may represent respective relevance of the files against the extracted information. Classification engine 158 may classify the files in the software installation directory as one of a main file, an associated file, or a third party file based on the respective relevance scores of the files. Once the files are classified, classification engine 158 may display the classified files on a display device (for example, a computer monitor). In an example, the display may in the form of a report.
For the purpose of simplicity of explanation, the example method of
It should be noted that the above-described examples of the present solution is for the purpose of illustration. Although the solution has been described in conjunction with a specific example thereof, numerous modifications may be possible without materially departing from the teachings of the subject matter described herein. Other substitutions, modifications and changes may be made without departing from the spirit of the present solution.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2016/108992 | 12/8/2016 | WO | 00 |