The disclosure generally relates to CPC class G06F and subclass 21/50 and/or 21/56.
Typosquatting is a type of cyberattack that attempts to dupe users by employing Uniform Resource Locators (URLs), filenames, or other user-queried/requested string identifiers that resemble legitimate identifiers as identifiers of malicious content. The typosquatting string identifiers are designed to mimic trusted/known string identifiers such that the corresponding content is believed to be trustworthy. Typosquatting is advertised on falsified tutorial websites and other resources facilitating user tasks which adds to user trust and increases likelihood of attack success.
Embodiments of the disclosure may be better understood by referencing the accompanying drawings.
The description that follows includes example systems, methods, techniques, and program flows to aid in understanding the disclosure and not to limit claim scope. Well-known instruction instances, protocols, structures, and techniques have not been shown in detail for conciseness.
Requests to pull software packages including container images for download, installation, and execution from public and private repositories poses a cybersecurity risk due to attackers uploading malicious software packages that resemble known/trustworthy versions. A common such cyberattack comprises changing identifiers such as filenames, executable names, etc. to closely resemble the trustworthy version (e.g., “nginx” versus “mginx”). Readily available metadata for repositories of software packages can be a strong indicator for whether user-requests are directed at trustworthy or malicious versions. A typosquatting detection agent (“agent”) disclosed herein detects requests to hosts maintaining repositories and leverages this metadata to identify and mitigate requests to typosquatting software packages and containers.
The agent monitors user traffic, for example at an endpoint, to detect when a user request is made to a repository storing software packages. Upon detection, the agent queries a database of known software packages with an identifier for the requested package. The database searches for previously requested packages with same or similar identifiers and corresponding trustworthiness scores. If the requested identifier has not been previously requested, then the agent queries a host(s) of the repository for associated metadata such as a number of downloads, number of stars, source, etc. for feature generation. The agent then converts the metadata into numerical features and computes a weighted sum of the numerical features as a trustworthiness score for the requested software package (“package”). Each of the numerical features is engineered to positively correlate with trustworthiness and weights are chosen according to known importance.
For malicious package detection, the agent compares the score for the requested package to scores of the previously requested packages with same or similar identifiers. A first criterion for maliciousness is whether the difference between one of the previously requested scores and the requested package score is below a first threshold, which prompts termination of the pull request to the requested package. A second criterion for maliciousness is whether the difference between one of the previously requested scores and the requested package score is below a second (smaller) threshold, which prompts a user alert that the package may be malicious. The database maintains and updates scores by adding the score of the requested package in association with its identifier as well as an indication of whether typosquatting was suspected/detected. Utilizing the metadata for package requests and comparison thereof to similarly identified packages boosts typosquatting detection by incorporating numerical trustworthiness features and leveraging group trustworthiness according to multitudes of scores for packages with proximal identifiers.
Use of the phrase “at least one of” preceding a list with the conjunction “and” should not be treated as an exclusive list and should not be construed as a list of categories with one item from each category, unless specifically stated otherwise. A clause that recites “at least one of A, B, and C” can be infringed with only one of the listed items, multiple of the listed items, and one or more of the items in the list and another item not listed.
“Container” refers to code units that execute on an operating system and are isolated from processes running on the operating system external to the container. Commands to view root level directories running processes, and available hardware are oblivious to directories, processes, and hardware outside of the container. In contrast to virtual machines, multiple containers can execute on a single operating system. Container isolation is orchestrated by software such as Docker® software that provides operating system level virtualization. Containers can be delivered as downloadable executables referred to as “container images”. Container images are executable files that include configuration files, libraries, library references, executable code, and all other necessary code units such that, once run, initialize the execution and configuration of the corresponding container.
“Software package” and “package” refer to collections of files that include executable software files, dependencies, metadata for the software, configuration files, and other files pertinent to managing corresponding software. Software package management can be facilitated by a “package manager” which helps users install, upgrade, uninstall, and/or configure software. Software packages may require additional installations for external software dependencies not included. Container and software packages are maintained/distributed by repositories that are managed by hosts and/or services. The hosts and/or services can specify an application programming interface (API) or syntax for querying the hosts and/or services for packages, containers, and metadata thereof.
“Request” and “pull request” refer to requests to download/obtain packages/container images for local execution/installation on an endpoint device.
At stage A, the agent 105 detects the request 100 communicated from the endpoint device 101 to the host 103 for a package stored on the package repositories 102. The agent 105 continuously monitors traffic from the endpoint device 101 to the Internet 110 for malicious behavior and can be running inline on the endpoint device 101 or on an intermediary firewall in the cloud. The agent 105 can keep a log of existing sessions/flows at the endpoint device 101 (e.g., by logging and analyzing pcap files) and can associate each session/flow with a corresponding application. The agent 105 can further have signatures stored thereon that model normal behavior for each application and can compare session/flow traffic against the signatures for malicious behavior detection. Pull requests to repositories can be associated with certain applications running on the endpoint device 101 and the agent 105 can be configured to monitor for pull requests and abnormal behavior specifically within traffic sessions/flows associated with those applications.
Traffic communicated between the endpoint device 101 and the Internet 110 is depicted with a solid double-sided arrow while traffic communicated between the endpoint device 101 and the host 103 is depicted with a dotted double-sided arrow. This is for illustrative purposes. In embodiments where communication from endpoint device 101 to host 103 occurs via the Internet 110, the traffic to and from the host 103 depicted as dotted is a subset of the traffic to and from the Internet 110 depicted as solid.
At stage B, the agent 105 queries the database 104 with the identifier 118 corresponding to the request 100 to retrieve scores for packages with similar identifiers. The database 104 searches for package identifiers with similar scores. For instance, the database 104 can search for package identifiers with a Levenshtein distance to the identifier 118 below a threshold Levenshtein distance. The threshold Levenshtein distance can be a fixed distance determined based on analysis of previous typosquatting instances for pull requests to the package repositories 102. The database 104 can further limit the number of similar identifiers to a threshold number of identifiers. The threshold number of identifiers and threshold Levenshtein distance can be indicated in a message/query indicating the identifier 118 communicated by the agent 105 or can be parameters set at the database 104. Other quantifiable notions of similarity such as semantic similarity according to natural language processing embeddings of the identifiers can be used.
The database 104 retrieves a list of similar identifiers and corresponding scores and returns them to the agent 105 as pairs 106. An example list 108 of identifier trustworthiness score pairs is the following:
At stage C, the agent 105 communicates the metadata query 114 to the host 103 to retrieve metadata for the package corresponding to the request 100. The metadata query 114 indicates the identifier 118 for the request 100 as well desired metadata fields such as a package source, a number of stars, a number of tags, a number of downloads, a registration date of the publisher, a number of previous images of the publisher, and an average number of image stars for the publisher. The agent 105 communicates with the host 103 via channels and according to syntax dictated by the host 103, for instance an API to a service at the host 103 that includes functionality for querying for metadata of packages by identifier. Alternatively, the agent 105 can send a Hypertext Transfer Protocol (HTTP) request to a uniform resource locator (URL) for a website hosted by the host 103 that corresponds to the identifier 118. To exemplify, the website can have specific syntax such as “example.com/_/*” such that query to the URL with a package identifier in the “*” placeholder will return a HTTP response including a webpage for the package (or a HTTP 404 response code if no such package exists), and the agent 105 can mine the HTTP response for metadata of the package. The agent 105 can, alternatively, send an HTTP response to a URL having search functionality such as “example.com/search?q=*” where “*” is the identifier of the package and can query an additional URL returned in an HTTP response as search results. The agent 105 can communicate a separate query in the metadata query 114 for publisher features according to any of the previous embodiments.
The host 103 returns pull request package metadata (“package metadata”) 112 to the agent 105. Example metadata 116 for a package comprises the following:
At stage D, the agent 105 generates a score for the package indicated in the request 100 according to the package metadata 112. The agent 105 generates numerical features for each of the fields indicated in the package metadata 112. The agent 105 can convert string features into numerical features and normalize the numerical features to be in a same range (e.g., [0,1]). To exemplify, for the example metadata 116, the “Source” metadata field can be converted to a 1 or 0 numerical feature based on trustworthiness of the indicated source (e.g., 1 for verified publishers, verified open-source projects, and official packages at the package repositories 102, and 0 for other sources). Number of stars and average image stars for the publisher can be scaled by maximal number of stars, e.g., divide by 5. Number of tags, number of downloads and number of previous images for the publisher can be scaled with a quantile transformation against other packages/publishers in the package repositories 102 (wherein the quantiles are scaled to lie in [0,1]). Registration date can be scaled by a time-based distribution that weights publishers with older registration dates higher that publishers with newer registration dates but does not unreasonably upscale registration dates beyond a fixed time horizon in the past. These examples are given for the purposes of illustration and any appropriate types of scaling and methods of standardization can be implemented.
After generating normalized numerical features for each metadata field indicated in the package metadata 112, the agent 105 generates a weighted sum of the normalized numerical features. When the normalized numerical feature values are denoted {fi}i and the weights are denoted {wi}i, then the weight sum is computed as Σifiwi. The weights can be determined by a domain-level expert based on known importance of each feature for typosquatting detection. Alternatively, the weights can be determined using regression on the normalized numerical feature values for previous typosquatting/non-typosquatting instances with the feature values as the explanatory variable and the typosquatting outcome (i.e., 1 for typosquatting and 0 for non-typosquatting) as the response variable. This regression determines relative importance of each of the explanatory variables (features) for predicting the response variable (typosquatting outcome) as the weights. The agent 105 can query the database 104 for numerical features for previously requested typosquatting and non-typosquatting packages. Other machine-learning based methods can be used to determine feature weights.
At stage E, the agent 105 applies one or more score criteria to the score generated for the request 100 and scores returned in the pairs 106 for packages with similar identifiers. Suppose the score for the request 100 is denoted s and the scores indicated in the pairs 106 are {si}i=1n. Then a first criterion that the request 100 is typosquatting is that for at least one score si:
i.e., that the difference between at least one of the scores {si}i and the score of the request 100 is above a threshold τ1. This first criterion indicates that the score of the request 100 is significantly below at least one score for packages of similar identifiers and therefore is not trustworthy. The parameter τ1 can be tuned based on previous typosquatting/non-typosquatting instances. If the above inequality is not satisfied, then a second criterion for determining that the request 100 is typosquatting is that, for at least one score si:
for a threshold τ2<τ1, i.e., that the difference between at least one of the scores {si}i and the score of the request 100 is above a (smaller) threshold τ2. The second criterion indicates that the score of the request 100 is smaller but not significantly smaller than at least one score for packages of similar identifiers. Note that for each of these criteria, it is sufficient to check the inequality for the maximal of the scores {si}i because the difference si−s using the maximal si is largest. If the score of the request 100 fails the first criterion and satisfies the second criterion, then the request 100 may not be trustworthy and lesser corrective action than a score satisfying the first criterion may be warranted.
Additional and alternative criteria to the above can be applied to s, {si}i=1n for typosquatting detection. For instance, the criteria can comprise that one of the scores si is significantly higher than the remaining scores (including the score of the request 100). In the above examples, typosquatting is detected if any of multiple criteria are satisfied. Alternatively, typosquatting can be detected if a threshold number of criteria are satisfied. Different trustworthiness scores can beget different criteria and different parameters tuned according to prior typosquatting for those different scores.
If the agent 105 determines that the scores s, {si}i=1n satisfy one of the above criteria (i.e., that the request 100 corresponds to a typosquatting attack), then the agent 105 performs corrective action for the endpoint device 101. If the first criterion is satisfied, the agent 105 terminates any established flows/sessions between the endpoint device 101 and the host 103 that corresponds to the request 100 (e.g., flows/sessions associated with an application for the request 100). Additionally, the agent 105 can issue a warning to the endpoint device 101 that indicates to a user that typosquatting has been detected along with the identifier 118 and any similar package identifiers that are known to be trustworthy (for instance, a package identifier corresponding to a score that satisfies the second criterion above). The agent 105 can issue instructions to locally delete any data downloaded from the request 100 from memory at the endpoint device 101 and terminate any processes related to download, installation, execution, etc. of the requested package. If the first criterion is not satisfied but the second criterion is satisfied, the agent 105 issues a warning at the endpoint device 101 that indicates the requested package may be malicious and asks the user whether to proceed. The agent 105 communicates the identifier 118 associated with the generated score s to the database 104 along with an indication of whether the package corresponding to the identifier 118 corresponds to a typosquatting attack for storage and future typosquatting detection. When one of the scores {si}i is significantly larger (e.g., according to a tunable threshold) than the remaining scores including the score s of the request 100, the agent 105 can suggest the package corresponding to the significantly larger score as a trustworthy alternative for the requested package.
At block 302, the agent determines whether a software package pull request is detected. For an inline deployment, the agent can monitor processes by applications that are associated with typosquatting and can determine that one or more processes launched for those applications indicate a pull request. Alternatively, for both an inline and cloud deployment the agent can monitor packets in sessions/flows for applications and/or destination IP addresses associated with package repositories/registries to determine whether a pull request is indicated in logs of the packets and/or can extract application layer information to detect any pull requests. If the agent determines that a pull request is detected, flow proceeds to block 304. Otherwise, flow returns to block 300 for continued scanning of user traffic at the endpoint device.
At block 304, the agent retrieves trustworthiness scores for software packages having similar identifiers to the software package indicated in the pull request from a trustworthiness score database (“database”). The operations at block 304 are depicted in greater detail in reference to
At block 305, the agent determines whether the response from the database indicates that the requested package is previously requested and trusted. The agent can determine that a repository/registry associated with a previously requested package returned by the database is the same as the repository/registry corresponding to the requested package, that the previously requested package has a same identifier as the requested package, and that a trustworthiness score for the previously requested package is above a threshold score and/or that the previously requested package had a benign verdict. The agent can use further criteria in determining whether the requested package is trusted such as that the previously requested package was analyzed for typosquatting within a recent time period. If the agent determines that the requested package is previously requested and trusted, flow returns to block 302. Otherwise, flow proceeds to block 306.
At block 306, the agent queries the host of the repository/registry for metadata of the requested software package. The query indicates an identifier of the requested software package and can further include syntax that specifies desired metadata fields. In some embodiments, the query is to a URL hosted by the host that corresponds to the requested software package and in other embodiments, the query is via an API of a service run by the host that allows for pull requests and metadata queries to the repository/registry. Example metadata fields in the query to the host that indicate trustworthiness of the requested software package include a source, a number of stars, a number of tags, a number of downloads, a registration date of the publisher, a number of previous images of the publisher, an average number of image stars of the publisher, etc.
At block 308, the agent generates numerical features from metadata returned by the host and computes a trustworthiness score for the requested software package as a weighted sum of the features. The agent converts string metadata fields to numerical features and normalizes the numerical features (e.g., to lie in a same interval or to match a same distribution). The trustworthiness score is then determined as a sum of each normalized numerical feature multiplied by a respective weight. The weights are determined, for instance, by a domain-level expert based on known importance for typosquatting detection or, alternatively, using regression with numerical features of previously requested packages corresponding to known typosquatting or non-typosquatting instances. At blocks 306 and 308, the agent can additionally generate scores for one or more of the software packages having similar identifiers as these scores become deprecated by following the same method of querying respective hosts and generating numerical features/scores from the returned metadata.
At block 310, the agent applies trustworthiness criteria to the retrieved scores and the requested package score and performs corrective action. The operations at block 310 are described in greater detail in reference to
At block 311, the agent communicates the identifier of the requested software package, the trustworthiness score, and indications of the verdict to the database for updates. When the identifier is already present in the database but corresponds to a package at a separate repository to the requested software package, the database can generate an additional entry with the same identifier for the package/container and additional information to distinguish repositories. For instance, the identifiers can be concatenated with repository names. The tuple of the identifier, trustworthiness score, and verdict can be stored as a data structure that facilitates efficient retrieval. The database can further store an identifier of the repository of the requested software package and a time stamp of when the typosquatting verdict occurred in association with this tuple to inform subsequent determinations of whether requested software packages have been previously requested. Flow returns to block 300.
At block 402, the database searches for an initial set of similarity identifiers according to the similarity metric. For instance, the database can comprise a BK-tree data structure that facilitates efficient lookup for similar strings according to Levenshtein or other similarity distance. A simpler implementation is to compute a Levenshtein distance between the identifier of the requested software package and all of the identifiers stored in the database, sort the identifiers in the database by Levenshtein distance, and take the n-closest identifiers or identifiers within a threshold Levenshtein distance. Note that this approach can be applied for any similarity metric between identifiers. Embodiments can supplement a distance-based similarity with heuristics-based rules. For instance, searching for an identifier with or without a “s” as a similar identifier.
At block 404, the database determines whether search criteria are satisfied. The search criteria can be specified by the query from the agent to the database or can be hard coded rules. For instance, the search criteria can be that the number of identifiers is below an upper threshold number of identifiers, that each identifier is below a threshold similarity metric value, etc. In some embodiments where a threshold similarity metric value can lead to zero packages with identifiers sufficiently close, the search criteria can be that the number of identifiers is above a lower threshold number of identifiers. If the search criteria are satisfied, flow proceeds to block 408. Otherwise, flow proceeds to block 406.
At block 406, the database updates the set of similar identifiers according to one or more search criteria that are not satisfied. For instance, the database can cutoff the number of identifiers to the n-closest identifiers according to a similarity metric where n is an upper threshold number of acceptable identifiers. The database can remove identifiers above a threshold similarity metric distance d from the identifier of the requested software package. If the number of similar identifiers is below a lower threshold number of similarity metric values, the database can increase the threshold similarity metric distance d to allow for additional similar identifiers. Flow returns to block 404.
At block 408, the database returns the current set of similar identifiers to the identifier of the requested software package and their corresponding trustworthiness scores. The database can further return similarity metric values from each of the similar identifiers to the identifier of the requested software package. The database can retrieve the trustworthiness scores as it generates similarity metric values according to the aforementioned operations or it can retrieve the trustworthiness scores once the final (current) set of similar identifiers is determined. The trustworthiness scores can be associated with identifiers as a hash data structure in the database for efficient retrieval. In embodiments where one of the similar identifiers is identical to the identifier of the requested software package, the database can retrieve and return additional data such as a repository where the previously requested package was pulled from and when the previously requested request was evaluated for typosquatting.
At block 500, a typosquatting detection agent (“agent”) generates a first threshold score relative to the retrieved scores. The first threshold score can be chosen as any threshold less than the retrieved scores to determine whether the requested package score is significantly below the scores for packages with similar identifiers. For instance, denoting the retrieved scores as {si}i=1n and
then the first threshold score can be sj−τ1 for some tunable parameter τ1, i.e., the maximum of the scores minus the tunable parameter.
At block 502, the agent determines whether the requested package score is below the first threshold score. Denoting the requested package score s, then the agent determines whether s<sj−τ1. If the agent determines the requested package score is below the first threshold score, flow skips to block 510. Otherwise, flow proceeds to block 504.
At block 504, the agent generates a second threshold score relative to the retrieved scores and the requested package score. For instance, using the above notation the second threshold score can be sj−τ2, i.e., the maximal of the retrieved scores minus a tunable parameter τ2<τ1. The parameters τ2 and τ1 can be tuned according to scores of previous typosquatting pull requests or according to statistics of any of the package scores.
At block 506, the agent determines whether the requested package score is below the second threshold score. Using the above notation, the agent determines whether s<sj−τ2. If the agent determines that the requested package score is below the second threshold score, flow proceeds to block 508. Otherwise, flow proceeds to block 514.
At block 508, the agent generates a user alert for the requested package. The user alert indicates an identifier of the requested package and a prompt asking the user whether to proceed with the pull request at an endpoint device. Additionally, the user alert can indicate additional metadata associated with the package such as an associated repository, a trustworthiness score, etc. If the user decides to proceed with the pull request, flow proceeds to block 514. Otherwise, flow proceeds to block 512.
At block 510, the agent terminates the pull request. The agent terminates or sends instructions to terminate any sessions, flows, or processes associated with downloading, installing, and/or executing content from typosquatting. The agent can further instruct an endpoint device that communicated the pull request to delete local data downloaded from the pull request and can alert a user of the endpoint device that the pull request was terminated due to typosquatting. Flow proceeds to block 512.
At block 512, the agent indicates typosquatting in a verdict for the requested package. The agent can further indicate a confidence of the typosquatting verdict. The confidence can be determined according to the difference sj−s, i.e., how much smaller the trustworthiness score is for the requested package is than the maximal trustworthiness score for packages with similar identifiers. The verdict can further associate a trustworthy package as the package having a similar identifier with a maximal trustworthiness score among the retrieved scores that is sufficiently larger than all of the distinct retrieved scores and the requested package score.
At block 514, the agent indicates a benign verdict for the requested package. Subsequently, the agent allows flows/sessions associated with the requested package and processes associated with downloading/installing/executing the requested package at an endpoint device. The agent continues to monitor user traffic/processes on the endpoint device related to the package for malicious behavior.
In the above operations for
The above disclosure is described with reference to software packages. Any instance of the term “software package” or “package” can be replaced with a “container” or “container image” while equivalently maintaining the recited operations, methods, components, techniques, etc. with appropriate modifications such as adapting to APIs of container repository services versus software package services, appropriately interfacing with hosts of such services, terminating flows/sessions/processes as determined by the manner with which endpoint devices download/acquire software packages versus container images when typosquatting is detecting, etc.
The flowcharts are provided to aid in understanding the illustrations and are not to be used to limit scope of the claims. The flowcharts depict example operations that can vary within the scope of the claims. Additional operations may be performed; fewer operations may be performed; the operations may be performed in parallel; and the operations may be performed in a different order. For example, the operations depicted in blocks 500/502 and 504/506 can be performed in parallel or concurrently. The operations at blocks 304 and 306 can occur in any order.
As will be appreciated, aspects of the disclosure may be embodied as a system, method or program code/instructions stored in one or more machine-readable media. Accordingly, aspects may take the form of hardware, software (including firmware, resident software, micro-code, etc.), or a combination of software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” The functionality presented as individual modules/units in the example illustrations can be organized differently in accordance with any one of platform (operating system and/or hardware), application ecosystem, interfaces, programmer preferences, programming language, administrator preferences, etc.
Any combination of one or more machine-readable medium(s) may be utilized. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable storage medium may be, for example, but not limited to, a system, apparatus, or device, that employs any one of or combination of electronic, magnetic, optical, electromagnetic, infrared, or semiconductor technology to store program code. More specific examples (a non-exhaustive list) of the machine-readable storage medium would include the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a machine-readable storage medium may be any tangible medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus, or device. A machine-readable storage medium is not a machine-readable signal medium.
A machine-readable signal medium may include a propagated data signal with machine-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A machine-readable signal medium may be any machine-readable medium that is not a machine-readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a machine-readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
The program code/instructions may also be stored in a machine-readable medium that can direct a machine to function in a particular manner, such that the instructions stored in the machine-readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.