DETERMINING SIMILARITY SAMPLES BASED ON USER SELECTED FEATURE GROUP

Information

  • Patent Application
  • 20250117670
  • Publication Number
    20250117670
  • Date Filed
    October 06, 2023
    a year ago
  • Date Published
    April 10, 2025
    a month ago
Abstract
Systems, methods, and software can be used to determine similarity samples. In some aspects, a method includes: obtaining one or more feature vectors of a sample, each of the one or more feature vectors corresponds to a user-selected feature group; and selecting a set of similarity samples based on the one or more feature vectors, wherein the selecting the set of similarity samples based on the one or more feature vectors comprises: for each of the one or more feature vectors, selecting a corresponding set of feature similarity samples based on the corresponding feature vector; and combining the corresponding selected set of feature similarity samples for each of the one or more feature vectors to generate the set of similarity samples.
Description
TECHNICAL FIELD

The present disclosure relates to determining similarity samples based on user selected feature group.


BACKGROUND

Similarity analysis refers to the process of searching for samples that are similar to a given input sample. In one example, similarity analysis can be used to determine whether a software code is potentially malicious by comparing the software code with a set of samples. The set of samples can include software code with known security risks. In another example, similarity analysis can be used in an imaging process to select images that are close to an input image. In another example, a similarity analysis can be used to select files and assign the same label to them. This enables the creation of a set of data for supervised learning.





DESCRIPTION OF DRAWINGS


FIG. 1 is a schematic diagram showing an example system that determines similarity samples, according to an implementation.



FIG. 2 is a flowchart showing an example operation for determining similarity samples, according to an implementation.



FIG. 3 is a schematic diagram illustrating an example method that determines similarity samples, according to an implementation.



FIG. 4 illustrates a high-level architecture block diagram of a computer according to an implementation.





Like reference numbers and designations in the various drawings indicate like elements.


DETAILED DESCRIPTION

In some implementations, the similarity analysis can be performed by converting the input sample to a feature vector and searching for neighboring vectors in prior-stored training samples to determine a set of similarity samples. The training samples can also be referred to as base samples or base training samples. However, such an analysis can become highly expensive or computationally intractable as the dimensions of the feature vector increase.


In some cases, dimensionality reduction techniques can be used to reduce the number of features and thus reduce the computational complexity of the operation. However, such techniques often lead to errors in prediction outcomes. In particular, when the degree of dimensionality is reduced significantly, the prediction error can also be significant.


In some operations, the similarity analysis can be performed by separating the features into feature groups, and performing similarity analysis based on user-selected feature groups. Feature groups can be constructed based on contiguous feature indices in the case where the full feature set is a fixed-dimensionality numerical vector, or the feature groups can be constructed by including sets of features that are logically related. Feature groups can be disjoint, where no feature group shares the same feature. Alternatively, feature groups can be overlapping, where on or more feature group share at least one common feature. In some cases, the analysis can be further performed based on user-specified weight of the selected feature groups. Such an operation can improve the performance of the similarity analysis by reducing the computational complexity, while improving the accuracy of the results. It also provides flexibility by using different feature groups that may be suitable for different types of input samples.



FIGS. 1-4 and associated descriptions provide additional details of these implementations.



FIG. 1 is a schematic diagram showing an example system 100 that determines similarity samples, according to an implementation. At a high level, the example system 100 includes a software service platform 106 that is communicatively coupled with a client device 102 over a network 110.


The client device 102 represents an electronic device that provides the sample to be analyzed. In some cases, the client device 102 can send the sample to the software service platform 106 for similarity analysis. In some cases, the software service platform 106 can send the output of the similarity analysis to the client device 102.


The software service platform 106 represents an application, a set of applications, software, software modules, hardware, or any combination thereof, that determines similarity samples. The software service platform 106 can be an application server, a service provider, or any other network entity. The software service platform 106 can be implemented using one or more computers, computer servers, or a cloud-computing platform. The software service platform 106 can be used to train machine learning models that are used in the similarity sample determination operation. The software service platform 106 includes a similarity analyzer 104. The similarity analyzer 104 represents an application, a set of applications, software, software modules, hardware, or any combination thereof, that determines similarity samples. In some implementations, the similarity analyzer 104 can generate index sets of base samples, where each index set includes feature vectors corresponding to one feature group of a list of feature groups. The similarity analyzer 104 can receive an input sample, receive user-selected feature groups, generate feature vectors of the input sample for each feature group in the user-selected feature groups. For each feature group in the user-selected feature groups, the similarity analyzer 104 can then compare the corresponding feature vector of the sample with the index set of the same feature group to select a set of feature similarity samples for the feature group, and combine the multiple sets of feature similarity samples to generate the output set of similarity samples. In some cases, some of these operations can be performed by another device. For example, the software service platform 106 can output a user interface and receive user input to select the user-selected feature groups discussed previously. Alternatively or additionally, the client device 102 can output a user interface and receive user input to select the user-selected feature groups and send the user-selected feature groups to the software service platform 106. In some cases, the client device 102 can further generate the one or more feature vectors of the sample based on the user-selected feature groups and send the one or more feature vectors to the software service platform 106 for similarity analysis. In another example, the user interface can provide a list of selectable features which differs according to the nature of the samples. In a variant, the user can also select, via the user interface, a distance or functions to be used for performing a similarity analysis. FIGS. 2-4 and associated descriptions provide additional details of these implementations.


The similarity analysis can be performed for different purposes. For example, the similarity analysis can be performed for software code analysis. In this example, the input sample can be software code. The software code can be source code or binary code. In a software development process, source code can be created by programmers using a text editor or a visual programming tool prior to compilation. The source code can be developed with a human-readable programming language and may be saved in a text file. The source code can be transformed by an assembler or a compiler into binary software code that can be executed by the computer. In some cases, the source code can be generated by automated tools, e.g., artificial intelligence powered by large language models.


The binary software code can include a stream of bytes that are generated by compiling the source code. Thus, the binary software code may not be in a human-readable format and may not be easily parsed or analyzed by a human.


The binary software code can be in a configuration of object code, executable code, or bytecode. An object code is the product of compiler output of a sequence of statements or instructions in a computer language. The source code can be logically divided into multiple source files. Each source file is compiled independently into a corresponding object file that includes an object code. The object codes in the object files are binary machine codes, but they may not be ready to be executed. The object files can include incomplete references to subroutines outside themselves and placeholder addresses. During the linking process, these object files can be linked together to form one executable file that includes executable code that can be executed on a computing device. During the linking process, the linker can read the object files, resolve references between them, perform the final code layout in the memory that determines the addresses for the blocks of code and data, fix up the placeholder addresses with real addresses, and write out the executable file that contains the executable code.


A bytecode, also referred to as portable code or p-code, is a form of instruction set designed for efficient execution by a software interpreter. Bytecodes include compact numeric codes, constants, and references (normally numeric addresses) that encode the result of compiler parsing and performing semantic analysis of things like type, scope, and nesting depths of program objects. The bytecode includes instruction sets that have one-byte opcodes followed by optional parameters. Intermediate representations such as the bytecode may be output by programming language implementations to ease interpretation or may be used to reduce hardware and operating system dependence by allowing the same code to run cross-platform, on different devices. The bytecode may often be either directly executed on a virtual machine (a p-code machine i.e., interpreter), or it may be further compiled into machine code for better performance. In some cases, binary software code that is coded using platform-independent languages such as JAVA can be stored in the bytecode format.


In this case, the client device 102 can be a device that receives or generates the software code for analysis. The software service platform 106 can be a software service platform that performs the similarity analysis on the software code to compare it with a set of pre-stored software code examples.


Applications for software similarity analysis include but are not limited to malicious software (malware) classification, malware family identification, identifying different versions of the same piece of software, identifying software authors, identifying malware as being part of a specific attack campaign, identifying software samples for whitelisting or blacklisting, and grouping files based on software complexity. The similarity analysis can be performed for other purposes, e.g., image analysis, video or audio recognition, natural language processing, etc.


Turning to a general description, the client device 102 may include, without limitation, any of the following: endpoint, computing device, mobile device, mobile electronic device, user device, mobile station, subscriber station, portable electronic device, mobile communications device, wireless modem, wireless terminal, or another electronic device. Examples of an endpoint may include a mobile device, IoT (Internet of Things) device, EoT (Enterprise of Things) device, cellular phone, personal data assistant (PDA), smart phone, laptop, tablet, personal computer (PC), pager, portable computer, portable gaming device, wearable electronic device, health/medical/fitness device, camera, vehicle, or other mobile communications devices having components for communicating voice or data via a wireless communication network. A vehicle can include a motor vehicle (e.g., automobile, car, truck, bus, motorcycle, etc.), aircraft (e.g., airplane, unmanned aerial vehicle, unmanned aircraft system, drone, helicopter, etc.), spacecraft (e.g., spaceplane, space shuttle, space capsule, space station, satellite, etc.), watercraft (e.g., ship, boat, hovercraft, submarine, etc.), railed vehicle (e.g., train, tram, etc.), and other types of vehicles including any combinations of any of the foregoing, whether currently existing or after arising. The wireless communication network may include a wireless link over at least one of a licensed spectrum and an unlicensed spectrum. The term “mobile device” can also refer to any hardware or software component that can terminate a communication session for a user. In addition, the terms “user equipment,” “UE,” “user equipment device,” “user agent,” “UA,” “user device,” and “mobile device” can be used interchangeably herein.


The example system 100 includes the network 110. The network 110 represents an application, set of applications, software, software modules, hardware, or combination thereof, that can be configured to transmit data messages between the entities in the example system 100. The network 110 can include a wireless network, a wireline network, the Internet, or a combination thereof. For example, the network 110 can include one or a plurality of radio access networks (RANs), core networks (CNs), and the Internet. The RANs may comprise one or more radio access technologies. In some implementations, the radio access technologies may be Global System for Mobile communication (GSM), Interim Standard 95 (IS-95), Universal Mobile Telecommunications System (UMTS), CDMA2000 (Code Division Multiple Access), Evolved Universal Mobile Telecommunications System (E-UMTS), Long Term Evaluation (LTE), LTE-Advanced, the fifth generation (5G), or any other radio access technologies. In some instances, the core networks may be evolved packet cores (EPCs).


A RAN is part of a wireless telecommunication system which implements a radio access technology, such as UMTS, CDMA2000, 3GPP LTE, 3GPP LTE-A, and 5G. In many applications, a RAN includes at least one base station. A base station may be a radio base station that may control all or at least some radio-related functions in a fixed part of the system. The base station may provide radio interface within their coverage area or a cell for a mobile device to communicate. The base station may be distributed throughout the cellular network to provide a wide area of coverage. The base station directly communicates to one or a plurality of mobile devices, other base stations, and one or more core network nodes.


While elements of FIG. 1 are shown as including various component parts, portions, or modules that implement the various features and functionality, nevertheless, these elements may instead include a number of sub-modules, third-party services, components, libraries, and such, as appropriate. Furthermore, the features and functionality of various components can be combined into fewer components, as appropriate.



FIG. 2 is a flowchart showing an example operation 200 for generating similarity samples, according to an implementation. The example operation 200 can be implemented by a software service platform, e.g., the software service platform 106 shown in FIG. 1. The example operation 200 shown in FIG. 2 can be implemented using additional, fewer, or different operations, which can be performed in the order shown or in a different order.


At 210, a vectorization process is performed on a sample 202 to generate one or more feature vectors based on user-selected feature groups 204. Each of the one or more feature vectors corresponds to one user-selected feature group in the user-selected feature groups 204.


The sample 202 can be a software code, e.g., a document file, an executable code, or a script. The sample 202 can also be other forms of a data sample to be evaluated in a similarity analysis.


The sample 202 can have one or more features. Each of the features can represent a portion of sample 202, a characteristic of the sample 202, or both. Examples of the features include file size, file format, file type (e.g., whether the sample 202 is a binary code of a software code, or a source code of the software code), file content (part or all of the sample 202), associated operating system or programming environment in which the sample 202 is programmed or executed, the number of particular strings included in the sample 202, the structure of the sample 202 (e.g., the order of a set of instructions in a software code), signature, checksum, and etc. If the sample 202 is an image, the features of the image can also include image size, resolution, color profile, and other information. The features of the sample 202 can be obtained from metadata associated with the sample 202, from a preprocess operation that scans the sample 202 and extracts the feature information automatically from the code, or both. In some implementations, the list of features can be preconfigured. The preconfigured list of features can be stored at the software service platform.


In some implementations, the features can be grouped in a plurality of feature groups. Each feature group can include one of more features. In some cases, the features can be grouped based on how relatable these features are, how interpretable these features are, or both. For example, features related to imports can be grouped into an import feature group. In the context of a software code sample, features related to import include features that are related to parameters, sets of code or instructions that are imported to the software code sample. Other examples of feature groups include the export feature group, including features that are related to parameters, sets of code or instructions that are exported by the software code sample; the certificate feature group, including features that are related to sets of code or instructions in the software code sample that contain or handle certificates; the general observation feature group, including general characteristics of the sample such as file size, file format, and other information; the string feature group, including features that are related to a preconfigured list of strings in the software code sample; the machine learning feature group, including features that are extracted by applying machine learning operations to the code sample (example machine learning operations include autoencoders, transformers, convolutional neural networks, etc.); the obfuscation feature group, including features that are related to sets of codes or instructions in the software code sample that implement obfuscation techniques; and other feature groups. In some implementations, the grouping of the feature groups can be pre-set. A list of feature groups and the features in these feature groups can be stored at the software service platform.


In some implementations, a user input is received by the software service platform. The user input indicates the user-selected feature groups 204. In some cases, the software service platform can output the list of feature groups in a user interface. For example, the software service platform can display the list of feature groups on a screen of the software service platform. Each feature group can have a selectable user interface object, e.g., a button or a checkbox, associated with it. A user can select one or more feature groups that the user prefers to include in the similarity analysis by using the associated user interface object, e.g., click the button or check the checkbox. The software service platform can receive the feature groups that are selected by the user by receiving the user input.


In some cases, the user interface can also provide the explanation or the list of features in each feature group, or both, to assist the user selection process. For example, the explanation or the list of features in each feature group, or both, can be displayed on the user interface. Alternatively or additionally, the explanation or the list of features can be shown by receiving another user input, e.g., in a pop-up window or a new screen outputted in response to a click.


In some cases, the software service platform can provide a user interface for the user to select a customized feature group. For example, the user can select a set of features to form a customized group. The software service platform can receive the user input that indicates the features in the customized feature group.


In some cases, the user can select one feature group. Alternatively, the user can select more than one feature group. While FIG. 2 shows 3 features vectors (222a, 222b, 222c) that correspond to 3 user-selected feature groups, other number of user-selected feature group can be used.


In some cases, the user can further set a weight for each selected feature group. The weight can be one of the preconfigured weights that the user can select, e.g., an integer scale between 1 and 10, a percentage scale from 10% to 100% in 5% increments, or other numbers. The weight can also be an integer or floating number that the user can input into the user interface.


In some implementations, instead of outputting the user interface objects for feature selection at the software service platform and receiving user input at the software service platform, the software service platform may receive the user input from another device, for example from the client device 102 shown in FIG. 1, or a web server that provides webpage for the selection. These devices may provide user interfaces for the feature group selection discussed previously, receive the user input representing the user-selected feature group and weights of each user-selected feature group, and send the user input to the software service platform.


The feature vectors, e.g., the feature vectors 222a, 222b, and 222c, can be generated based on the user-selected feature groups 204, where each feature vector correspond to one user-selected feature group. In some cases, the vectorization process can be used to convert the sample 202 into a vector representation by using feature extraction algorithms. For each user-selected feature group, each feature in the group can be converted to a numerical value, in format of integer numbers or floating point numbers. The numerical values can be concatenated into a feature vector for the user-selected feature group. Alternatively or additionally, the numerical values can be combined or transformed to generate the feature vector. Examples of the transformation or combination techniques include multiplication, addition, passing through a non-linear transformation function (e.g., Fourier, Rectifying, or etc.).


At 230, feature similarity analysis is performed. As illustrated, similarity analyses 230a, 230b, and 230c are performed on the respective feature vectors, i.e., feature vectors 222a, 222b, and 222c, to generate the respective feature sets 232a, 232b, and 232c.


In some cases, k-nearest neighbors algorithm (k-NN) can be used to perform the similarity analysis on the feature vectors. In an example k-NN algorithm, the distance between the input data point and indices of the training examples are calculated. Examples of the distance used for the calculation can be Euclidean distance, Hamming distance, or other metrics, including, e.g., cosine similarity. Then k-nearest neighbors to the input data point are identified. The k-NN algorithm used to perform the feature similarity analysis can be an exact k-NN algorithm or an approximate k-NN algorithm.


In this operation, the base training samples are processed with the same feature extraction process discussed previously to generate different index sets for different feature groups. For each feature group in the list of feature groups discussed previously, indices of the base training samples for the corresponding feature group are generated by performing a vectorization process of the base training samples using the features in the feature group. Because the list of feature groups is pre-configured, these calculations can be performed before the process 200, and the index sets corresponding to these feature groups can be stored at the software service platform. The index sets corresponding to the feature groups in the user-selected feature groups 204 are used in the feature similarity analysis 230. Alternatively or additionally, the index sets for the feature groups in the user-selected feature groups 204 are calculated during the feature similarity analysis 230, e.g., the user-selected feature groups 204 may include a customized feature group, and thus the index set corresponding to the customized feature group can be calculated after the user-selected feature groups 204 is received.


For the similarity analysis 230a, the feature vector 222a is used as an input data point to the corresponding k-NN model. The index set of the corresponding feature group, i.e., the index set generated by processing the base training samples based on the same feature group as the feature vector 222a is used as the index set for the corresponding K-NN model. The distance between the feature vector 222a and each index in the corresponding index set is calculated. The set of feature similarity samples 232a is determined based on the distances. In some cases, the samples in the set of feature similarity samples are the k base training samples whose indices have the shortest distance to the feature vector 222a. The same process is repeated for other feature vectors and their corresponding k-NN models to generate sets of feature similarity samples 232b and 232c. In another example, the similarity analysis can rely on the use of Locality Sensitive Hashing (LSH) before or as an alternative of the use of the k-NN algorithm previously mentioned.


At 240, the sets of feature similarity samples 232a, 232b, and 232c are combined to generate the output set 242. Different combination techniques can be used. In one example, the multi-occurrence may be used as a criterion to select the samples in the output set 242. The similarity samples in the sets 232a, 232b, and 232c are examined. Some base samples may appear in more than one set of feature familiarity samples. A preset threshold number can be configured. For example, if the preset threshold number is configured to be 2, then all the base samples that appear in 2 or more sets of feature familiarity samples are put in the output set 242.


In another example, the distance may be used as a criterion. For each sample included in the sets 232a, 232b, and 232c, the distances between the sample and the corresponding input vector (i.e., the feature vector 222a, 222b, or 222c) are compared, and the top X numbers of the samples having the shortest distance are selected to be the samples in the output set 242, where X is the number of samples in the output set 242. The distance can be the distance discussed previously in the feature similarity analysis 230. For example, the output set 242 can be formed by selecting X samples with the shortest distance among the samples in the sets 232a, 232b, and 232c. X can be configured by a user, an administrator, or other operation personnel or algorithms. Alternatively or in combination, the output set 242 can be formed by selecting samples in the sets 232a, 232b, and 232c that are below a configured threshold. In some cases, the distances for each feature similarity set can be normalized before the comparison and selecting process discussed previously. For example, in the similarity analysis 230a, the shortest distance between the feature vectors 222a and the corresponding index is identified and used to divide the distances between the feature vectors 222a and the indices in the corresponding index set to generate normalized distances for that set. The normalized distances are used for the distance-based combination process. A similar normalization process can be performed for other similarity analysis 230b and 230c.


Additionally or alternatively, the normalized procedure can be performed if dimensions of the feature vectors 222a, 222b, and 222c are different. For example, the distances obtained in each similarity analysis 230a, 230b, and 230c can be divided by the square root of the number of dimensions of the corresponding feature vector.


As discussed previously, in same cases, a user-specified weight for each feature group in the user-selected feature groups can be received. In these cases, the user-specified weights are used in the combination process. In one example, as discussed previously, the distances obtained in each similarity analysis 230a, 230b, and 230c can be normalized. The normalized distances for the base training samples in the feature similarly set 232a, 232b, and 232c can be multiplied by the user-specified weight of the corresponding feature vector, and the and the top X numbers of the samples having the shortest distance are selected to be the samples in the output set 242, where X is the number of samples in the output set 242.


In one example, the user may assign a weight of 1 to the import feature group, a weight 3 to the general observation feature group, and a weight 5 to the obfuscation feature group. The vectorization process is performed on the sample 202 to extract feature vector 1 based on the features in the import feature group, extract feature vector 2 based on the features in the general observation feature group, and extract feature vector 3 based on the features in the obfuscation feature group. The vectorization process has also been performed on the base training samples to generate different index sets 1, 2, and 3 based on the features in the import feature group, general observation feature group, and obfuscation feature group, respectively. The distances between feature vector 1 and the index set 1 are calculated and 10 base training samples in the index set 1 that have the shortest distances with feature vector 1 form the feature similarity set 1. The distances for each base training sample in the feature similarity set 1 are normalized and then multiplied by the user-selected weight for the import feature group, i.e., 1, to obtain the weighted normalized distances for these base training samples. Feature similarity sets 2 and 3 are formed for the general observation feature group and the obfuscation feature group, respectively. The distances for each base training sample in the feature similarity sets 2 and 3 are also normalized and then multiplied by the respective user-selected weight for these feature groups, i.e., 3 and 5, to obtain the weighted normalized distances for these base training samples. The weighted normalized distances of all base training samples in the feature similarity sets 1, 2, and 3 are compared that the top 10 base training samples having the shortest weighted normalized distances are selected to form the output set.


Alternatively or additionally, multiple criteria can be used in combination to determine the output set 242. For example, each base sample in the feature similarity sets 232a, 232b, and 232c can be assigned a multi-occurrence factor based on the number of feature similarity sets in which the base sample appears. Each base sample in the feature similarity sets 232a, 232b, and 232c can also be assigned a distance factor based on the weighted normalized distance with the corresponding feature vector. Each factor can be multiplied with a corresponding factor weight and summed to calculate a total score of the base sample. The factor weight can be pre-configured, e.g., by the administrator of the software service platform, or can be selected by the user input. The base samples can be ordered based on their total scores, and the top X base samples can be selected to form the output set 242.


In some cases, the output set 242 can be outputted at the user interface of the software service platform that performs the analysis. Other information of the selection process, e.g., the multi-occurrence information, the distance information, or a combination can also be provided. Alternatively or additionally, the output set 242 can be sent to a different device for outputting. Other information, e.g., the multi-occurrence information, the distance information, or a combination can also be sent to the different device.


In a classification operation, the class of the input data point can be determined based on a majority vote of the class labels of the k nearest neighbors. In a regression operation, the average or weighted average of the target values of the K neighbors can be calculated to determine the target value of the input data point.


In some cases, the classification operation or regression operation of k-NN can be performed after the output set 242 is generated. For example, each base sample in the output set 242 has a classification label, and a classification of the sample 202 can be determined based on a majority vote of these base samples. In one implementation, the operation 200 can be used to determine the security status of the software code. In these cases, the input sample can be the software code, and the classification label can include “safe” or “potentially malicious”. Alternatively or additionally, the classification label can further indicate different risk levels of the malicious potential. In these cases, the determined classification label can also be outputted or sent to another device for outputting.


In another implementation, the operation 200 can be used for image processing. In these cases, the input sample can be an example image, and the base samples in the output set 242 can be used to search for a target image that is close to the example image. In these cases, the retrieved target image can also be outputted or sent to another device for outputting.



FIG. 3 is a flowchart showing an example method 300 for generating similarity samples, according to an implementation. The example method 300 can be implemented by a server, e.g., the software service platform 106 shown in FIG. 1. The example method 300 shown in FIG. 3 can be implemented using additional, fewer, or different operations, which can be performed in the order shown or in a different order.


At 310, one or more feature vectors of a sample are obtained. Each of the one or more feature vectors corresponds to a user-selected feature group.


At 320, a set of similarity samples are selected based on the one or more feature vectors. For each of the one or more feature vectors, a corresponding set of feature similarity samples is selected based on the corresponding feature vector. The corresponding selected sets of feature similarity samples for each of the one or more feature vectors are combined to generate the set of similarity samples.



FIG. 4 illustrates a high-level architecture block diagram of a computer 400 according to an implementation. The computer 400 can be implemented as the software service platform 106 and the client device 102 of FIG. 1. The computer 400 can also be used to implement the operations discussed in FIGS. 2-3. The described illustration is only one possible implementation of the described subject matter and is not intended to limit the disclosure to the single described implementation. Those of ordinary skill in the art will appreciate the fact that the described components can be connected, combined, and/or used in alternative ways consistent with this disclosure.


In some cases, the processing algorithm of the code package establishment can be implemented in an executable computing code, e.g., C/C++ executable codes. In some cases, the computer 400 can include a standalone Linux system that runs batch applications. In some cases, the computer 400 can include mobile or personal computers.


The computer 400 may comprise a computer that includes an input device, such as a keypad, keyboard, touch screen, microphone, speech recognition device, other device that can accept user information, and/or an output device that conveys information associated with the operation of the computer, including digital data, visual and/or audio information, or a GUI.


The computer 400 can serve as a client, network component, a server, a database or other persistency, and/or any other components. In some implementations, one or more components of the computer 400 may be configured to operate within a cloud-computing-based environment.


At a high level, the computer 400 is an electronic computing device operable to receive, transmit, process, store, or manage data. According to some implementations, the computer 400 can also include, or be communicably coupled with, an application server, e-mail server, web server, caching server, streaming data server, business intelligence (BI) server, and/or other server.


The computer 400 can collect data of network events or mobile application usage events over network 110 from a web browser or a client application, e.g., an installed plugin. In addition, data can be collected by the computer 400 from internal users (e.g., from a command console or by another appropriate access method), external or third parties, other automated applications, as well as any other appropriate entities, individuals, systems, or computers.


Each of the components of the computer 400 can communicate using a system bus 412. In some implementations, any and/or all the components of the computer 400, both hardware and/or software, may interface with each other and/or the interface 402 over the system bus 412 using an Application Programming Interface (API) 408 and/or a service layer 410. The API 408 may include specifications for routines, data structures, and object classes. The API 408 may be either computer language-independent or -dependent and refer to a complete interface, a single function, or even a set of APIs. The service layer 410 provides software services to the computer 400. The functionality of the computer 400 may be accessible for all service consumers using this service layer. Software services, such as those provided by the service layer 410, provide reusable, defined business functionalities through a defined interface. For example, the interface may be software written in JAVA, C++, or other suitable languages providing data in Extensible Markup Language (XML) format or another suitable format. While illustrated as an integrated component of the computer 400, alternative implementations may illustrate the API 408 and/or the service layer 410 as stand-alone components in relation to other components of the computer 400. Moreover, any or all parts of the API 408 and/or the service layer 410 may be implemented as child or sub-modules of another software module, enterprise application, or hardware module without departing from the scope of this disclosure.


The computer 400 includes an interface 402. Although illustrated as a single interface 402 in FIG. 4, two or more interfaces 402 may be used according to particular needs, desires, or particular implementations of the computer 400. The interface 402 is used by the computer 400 for communicating with other systems in a distributed environment connected to a network (whether illustrated or not). Generally, the interface 402 comprises logic encoded in software and/or hardware in a suitable combination and operable to communicate with the network. More specifically, the interface 402 may comprise software supporting one or more communication protocols associated with communications such that the network or interface's hardware is operable to communicate physical signals within and outside of the computer 400.


The computer 400 includes at least one processor 404. Although illustrated as a single processor 404 in FIG. 4, two or more processors may be used according to particular needs, desires, or particular implementations of the computer. Generally, the processor 404 executes instructions and manipulates data to perform the operations of the computer 400. Specifically, the processor 404 executes the functionality disclosed in FIGS. 1-4.


The computer 400 also includes a memory 414 that holds data for the computer 400. Although illustrated as a single memory 414 in FIG. 4, two or more memories may be used according to particular needs, desires, or particular implementations of the computer 400. While memory 414 is illustrated as an integral component of the computer 400, in alternative implementations, memory 414 can be external to the computer 400.


The application 406 is an algorithmic software engine providing functionality according to particular needs, desires, or particular implementations of the computer 400, particularly with respect to functionality required for anomaly detection. Although illustrated as a single application 406, the application 406 may be implemented as multiple applications 406 on the computer 400. In addition, although illustrated as integral to the computer 400, in alternative implementations, the application 406 can be external to the computer 400.


There may be any number of computers 400 associated with, or external to, and communicating over a network. Furthermore, this disclosure contemplates that many users may use one computer 400, or that one user may use multiple computers 400.


Described implementations of the subject matter can include one or more features, alone or in combination.


For example, in a first implementation, a method, comprising: obtaining one or more feature vectors of a sample, each of the one or more feature vectors corresponds to a user-selected feature group; and selecting a set of similarity samples based on the one or more feature vectors, wherein the selecting the set of similarity samples based on the one or more feature vectors comprises: for each of the one or more feature vectors, selecting a corresponding set of feature similarity samples based on the corresponding feature vector; and combining the corresponding selected set of feature similarity samples for each of the one or more feature vectors to generate the set of similarity samples.


The foregoing and other described implementations can each, optionally, include one or more of the following features:


A first feature, combinable with any of the following features, further comprising: receiving a user-specified weight for each of the one or more feature vectors; and wherein the set of similarity samples is selected further based on the user-specified weight.


A second feature, combinable with any of the previous or following features, wherein, for each of the one or more feature vectors, selecting a corresponding set of feature similarity samples based on the corresponding feature vector comprises: for each of the one or more feature vectors, calculating a distance between the feature vector and a corresponding feature index set of base samples; and selecting the corresponding set of feature similarity samples based on the calculated distances


A third feature, combinable with any of the previous or following features, further comprising: normalizing the calculated distances, and wherein the set of similarity samples are generated based on the normalized distances.


A fourth feature, combinable with any of the previous or following features, wherein the obtaining one or more feature vectors of the sample comprises: receiving a user input, wherein the user input indicates a selection of feature groups, and generating the one or more feature vectors of the sample for the feature groups.


A fifth feature, combinable with any of the previous or following features, further comprising: outputting, at a user interface, one or more user interface objects indicating a plurality of feature groups.


A sixth feature, combinable with any of the previous features, wherein the sample is a software code.


In a second implementation, a computer-readable medium containing instructions which, when executed, cause an electronic device to perform operations comprising: obtaining one or more feature vectors of a sample, each of the one or more feature vectors corresponds to a user-selected feature group; and selecting a set of similarity samples based on the one or more feature vectors, wherein the selecting the set of similarity samples based on the one or more feature vectors comprises: for each of the one or more feature vectors, selecting a corresponding set of feature similarity samples based on the corresponding feature vector; and combining the corresponding selected set of feature similarity samples for each of the one or more feature vectors to generate the set of similarity samples.


The foregoing and other described implementations can each, optionally, include one or more of the following features:


A first feature, combinable with any of the following features, the operations further comprising: receiving a user-specified weight for each of the one or more feature vectors; and wherein the set of similarity samples is selected further based on the user-specified weight.


A second feature, combinable with any of the previous or following features, wherein, for each of the one or more feature vectors, selecting a corresponding set of feature similarity samples based on the corresponding feature vector comprises: for each of the one or more feature vectors, calculating a distance between the feature vector and a corresponding feature index set of base samples; and selecting the corresponding set of feature similarity samples based on the calculated distances


A third feature, combinable with any of the previous or following features, the operations further comprising: normalizing the calculated distances, and wherein the set of similarity samples are generated based on the normalized distances.


A fourth feature, combinable with any of the previous or following features, wherein the obtaining one or more feature vectors of the sample comprises: receiving a user input, wherein the user input indicates a selection of feature groups, and generating the one or more feature vectors of the sample for the feature groups.


A fifth feature, combinable with any of the previous or following features, the operations further comprising: outputting, at a user interface, one or more user interface objects indicating a plurality of feature groups.


A sixth feature, combinable with any of the previous features, wherein the sample is a software code.


In a third implementation, a computer-implemented system, comprising: one or more computers; and one or more computer memory devices interoperably coupled with the one or more computers and having tangible, non-transitory, machine-readable media storing one or more instructions that, when executed by the one or more computers, perform one or more operations comprising obtaining one or more feature vectors of a sample, each of the one or more feature vectors corresponds to a user-selected feature group; and selecting a set of similarity samples based on the one or more feature vectors, wherein the selecting the set of similarity samples based on the one or more feature vectors comprises: for each of the one or more feature vectors, selecting a corresponding set of feature similarity samples based on the corresponding feature vector; and combining the corresponding selected set of feature similarity samples for each of the one or more feature vectors to generate the set of similarity samples.


The foregoing and other described implementations can each, optionally, include one or more of the following features:


A first feature, combinable with any of the following features, the operations further comprising: receiving a user-specified weight for each of the one or more feature vectors; and wherein the set of similarity samples is selected further based on the user-specified weight.


A second feature, combinable with any of the previous or following features, wherein, for each of the one or more feature vectors, selecting a corresponding set of feature similarity samples based on the corresponding feature vector comprises: for each of the one or more feature vectors, calculating a distance between the feature vector and a corresponding feature index set of base samples; and selecting the corresponding set of feature similarity samples based on the calculated distances.


A third feature, combinable with any of the previous or following features, the operations further comprising: normalizing the calculated distances, and wherein the set of similarity samples are generated based on the normalized distances.


A fourth feature, combinable with any of the previous or following features, wherein the obtaining one or more feature vectors of the sample comprises: receiving a user input, wherein the user input indicates a selection of feature groups, and generating the one or more feature vectors of the sample for the feature groups.


A fifth feature, combinable with any of the previous or following features, the operations further comprising: outputting, at a user interface, one or more user interface objects indicating a plurality of feature groups.


A sixth feature, combinable with any of the previous features, wherein the sample is a software code.


Implementations of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Implementations of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory computer-storage medium for execution by, or to control the operation of, data processing apparatus. Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The computer-storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.


The terms “data processing apparatus,” “computer,” or “electronic computer device” (or equivalent as understood by one of ordinary skill in the art) refer to data processing hardware and encompass all kinds of apparatus, devices, and machines for processing data, including by way of example, a programmable processor, a computer, or multiple processors or computers. The apparatus can also be or further include special purpose logic circuitry, e.g., a central processing unit (CPU), an FPGA (field programmable gate array), or an ASIC (application specific integrated circuit). In some implementations, the data processing apparatus and/or special purpose logic circuitry may be hardware-based and/or software-based. The apparatus can optionally include code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them. The present disclosure contemplates the use of data processing apparatus with or without conventional operating systems, for example LINUX, UNIX, WINDOWS, MAC OS, ANDROID, IOS or any other suitable conventional operating system.


A computer program, which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network. While portions of the programs illustrated in the various figures are shown as individual modules that implement the various features and functionality through various objects, methods, or other processes, the programs may instead include a number of sub-modules, third-party services, components, libraries, and such, as appropriate. Conversely, the features and functionality of various components can be combined into single components, as appropriate.


The processes and logic flows described in this specification can be performed by one or more programmable computers, executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., a CPU, an FPGA, or an ASIC.


Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors, both, or any other kind of CPU. Generally, a CPU will receive instructions and data from a ROM or a RAM or both. The essential elements of a computer are a CPU for performing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to, receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a PDA, a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a USB flash drive, to name just a few.


Computer readable media (transitory or non-transitory, as appropriate) suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM, DVD+/−R, DVD-RAM, and DVD-ROM disks. The memory may store various objects or data, including caches, classes, frameworks, applications, backup data, jobs, web pages, web page templates, database tables, repositories storing business and/or dynamic information, and any other appropriate information including any parameters, variables, algorithms, instructions, rules, constraints, or references thereto. Additionally, the memory may include any other appropriate data, such as logs, policies, security, or access data, reporting files, as well as others. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.


To provide for interaction with a user, implementations of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube), LCD, LED, or plasma monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse, trackball, or trackpad by which the user can provide input to the computer. Input may also be provided to the computer using a touchscreen, such as a tablet computer surface with pressure sensitivity, a multi-touch screen using capacitive or electric sensing, or other type of touchscreen. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's client device in response to requests received from the web browser.


The term “graphical user interface,” or “GUI,” may be used in the singular or the plural to describe one or more graphical user interfaces and each of the displays of a particular graphical user interface. Therefore, a GUI may represent any graphical user interface, including but not limited to, a web browser, a touch screen, or a command line interface (CLI) that processes information and efficiently presents the information results to the user. In general, a GUI may include a plurality of user interface (UI) elements, some or all associated with a web browser, such as interactive fields, pull-down lists, and buttons operable by the business suite user. These and other UI elements may be related to or represent the functions of the web browser.


Implementations of the subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., such as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of wireline and/or wireless digital data communication, e.g., a communication network. Examples of communication networks include a local area network (LAN), a radio access network (RAN), a metropolitan area network (MAN), a wide area network (WAN), Worldwide Interoperability for Microwave Access (WIMAX), a wireless LAN (WLAN) using, for example, 802.11 a/b/g/n/ac/ax/be and/or 802.20, all or a portion of the Internet, and/or any other communication system or systems at one or more locations. The network may communicate with, for example, Internet Protocol (IP) packets, Frame Relay frames, Asynchronous Transfer Mode (ATM) cells, voice, video, data, and/or other suitable information between network addresses.


The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship with each other.


In some implementations, any or all of the components of the computing system, both hardware and/or software, may interface with each other and/or the interface using an API and/or a service layer. The API may include specifications for routines, data structures, and object classes. The API may be either computer language independent or dependent and refer to a complete interface, a single function, or even a set of APIs. The service layer provides software services to the computing system. The functionality of the various components of the computing system may be accessible for all service consumers via this service layer. Software services provide reusable, defined business functionalities through a defined interface. For example, the interface may be software written in JAVA, C++, or other suitable language providing data in XML format or other suitable formats. The API and/or service layer may be an integral and/or a stand-alone component in relation to other components of the computing system. Moreover, any or all parts of the service layer may be implemented as child or sub-modules of another software module, enterprise application, or hardware module without departing from the scope of this disclosure.


While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular implementations of particular inventions. Certain features that are described in this specification in the context of separate implementations can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable sub-combination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can, in some cases, be excised from the combination, and the claimed combination may be directed to a sub-combination or variation of a sub-combination.


Particular implementations of the subject matter have been described. Other implementations, alterations, and permutations of the described implementations are within the scope of the following claims as will be apparent to those skilled in the art. While operations are depicted in the drawings or claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed (some operations may be considered optional), to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous.


Moreover, the separation and/or integration of various system modules and components in the implementations described above should not be understood as requiring such separation and/or integration in all implementations, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.


Accordingly, the above description of example implementations does not define or constrain this disclosure. Other changes, substitutions, and alterations are also possible without departing from the spirit and scope of this disclosure.

Claims
  • 1. A method, comprising: obtaining one or more feature vectors of a sample, each of the one or more feature vectors corresponds to a user-selected feature group; andselecting a set of similarity samples based on the one or more feature vectors, wherein the selecting the set of similarity samples based on the one or more feature vectors comprises: for each of the one or more feature vectors, selecting a corresponding set of feature similarity samples based on the corresponding feature vector; andcombining the corresponding selected set of feature similarity samples for each of the one or more feature vectors to generate the set of similarity samples.
  • 2. The method of claim 1, further comprising: receiving a user-specified weight for each of the one or more feature vectors; and wherein the set of similarity samples is selected further based on the user-specified weight.
  • 3. The method of claim 1, wherein, for each of the one or more feature vectors, selecting a corresponding set of feature similarity samples based on the corresponding feature vector comprises: for each of the one or more feature vectors, calculating a distance between the feature vector and a corresponding feature index set of base samples; andselecting the corresponding set of feature similarity samples based on the calculated distances.
  • 4. The method of claim 3, further comprising: normalizing the calculated distances, and wherein the set of similarity samples are generated based on the normalized distances.
  • 5. The method of claim 1, wherein the obtaining one or more feature vectors of the sample comprises: receiving a user input, wherein the user input indicates a selection of feature groups, andgenerating the one or more feature vectors of the sample for the feature groups.
  • 6. The method of claim 5, further comprising: outputting, at a user interface, one or more user interface objects indicating a plurality of feature groups.
  • 7. The method of claim 1, wherein the sample is a software code.
  • 8. A computer-readable medium containing instructions which, when executed, cause an electronic device to perform operations comprising: obtaining one or more feature vectors of a sample, each of the one or more feature vectors corresponds to a user-selected feature group; andselecting a set of similarity samples based on the one or more feature vectors, wherein the selecting the set of similarity samples based on the one or more feature vectors comprises: for each of the one or more feature vectors, selecting a corresponding set of feature similarity samples based on the corresponding feature vector; andcombining the corresponding selected set of feature similarity samples for each of the one or more feature vectors to generate the set of similarity samples.
  • 9. The computer-readable medium of claim 8, the operations further comprising: receiving a user-specified weight for each of the one or more feature vectors; and wherein the set of similarity samples is selected further based on the user-specified weight.
  • 10. The computer-readable medium of claim 8, wherein, for each of the one or more feature vectors, selecting a corresponding set of feature similarity samples based on the corresponding feature vector comprises: for each of the one or more feature vectors, calculating a distance between the feature vector and a corresponding feature index set of base samples; andselecting the corresponding set of feature similarity samples based on the calculated distances.
  • 11. The computer-readable medium of claim 10, the operations further comprising: normalizing the calculated distances, and wherein the set of similarity samples are generated based on the normalized distances.
  • 12. The computer-readable medium of claim 8, wherein the obtaining one or more feature vectors of the sample comprises: receiving a user input, wherein the user input indicates a selection of feature groups, andgenerating the one or more feature vectors of the sample for the feature groups.
  • 13. The computer-readable medium of claim 12, the operations further comprising: outputting, at a user interface, one or more user interface objects indicating a plurality of feature groups.
  • 14. The computer-readable medium of claim 8, wherein the sample is a software code.
  • 15. A computer-implemented system, comprising: one or more computers; andone or more computer memory devices interoperably coupled with the one or more computers and having tangible, non-transitory, machine-readable media storing one or more instructions that, when executed by the one or more computers, perform one or more operations comprising: obtaining one or more feature vectors of a sample, each of the one or more feature vectors corresponds to a user-selected feature group; andselecting a set of similarity samples based on the one or more feature vectors, wherein the selecting the set of similarity samples based on the one or more feature vectors comprises: for each of the one or more feature vectors, selecting a corresponding set of feature similarity samples based on the corresponding feature vector; andcombining the corresponding selected set of feature similarity samples for each of the one or more feature vectors to generate the set of similarity samples.
  • 16. The computer-implemented system of claim 15, the operations further comprising: receiving a user-specified weight for each of the one or more feature vectors; and wherein the set of similarity samples is selected further based on the user-specified weight.
  • 17. The computer-implemented system of claim 15, wherein, for each of the one or more feature vectors, selecting a corresponding set of feature similarity samples based on the corresponding feature vector comprises: for each of the one or more feature vectors, calculating a distance between the feature vector and a corresponding feature index set of base samples; andselecting the corresponding set of feature similarity samples based on the calculated distances.
  • 18. The computer-implemented system of claim 17, the operations further comprising: normalizing the calculated distances, and wherein the set of similarity samples are generated based on the normalized distances.
  • 19. The computer-implemented system of claim 15, wherein the obtaining one or more feature vectors of the sample comprises: receiving a user input, wherein the user input indicates a selection of feature groups, andgenerating the one or more feature vectors of the sample for the feature groups.
  • 20. The computer-implemented system of claim 15, the operations further comprising: outputting, at a user interface, one or more user interface objects indicating a plurality of feature groups.