Domain Name System network services are generally ubiquitous in IP-based networks. Generally, a client (e.g., a computing device) attempts to connect to a server(s) over the Internet by using web addresses (e.g., Uniform Resource Locators (URLs) including domain names or fully qualified domain names). Web addresses are translated into IP addresses. The Domain Name System (DNS) is responsible for performing this translation from web addresses into IP addresses. Specifically, requests including web addresses are sent to DNS servers that generally reply with corresponding IP addresses or with an error message in case the domain has not been registered, a non-existent domain.
Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.
The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
The Domain Name System (DNS) is a globally distributed database that provides core functionality for the operation of the Internet and local intranets. In particular, DNS provides the ability to locate Internet resource information, for example, IP addresses for domain names. The distributed nature of the DNS allows this resource information to be updated dynamically and controlled by the resource holders. To locate the current information, a client device, for example, a laptop, queries the DNS via a standard protocol. In practice, client devices do not perform the database lookup, referred to as resolution, themselves, but depend on other specialized servers to act on their behalf. These servers are called DNS recursive resolvers (e.g., a DNS recursor), and they are able to expedite the resolution of DNS records for a large number of clients through caching and optimized software. Recursive resolvers can also enact policies, for example, to limit client access to the Internet or specific resources.
DNS Tunneling (DNST) generally refers to a method of sending data over the DNS protocol other than what it was originally designed for. For example, this can include spam and antimalware tools acting as a remote query service as well as malware, such as command and control (C2) and exfiltration services. Due to the potential for abuse (e.g., using DNST for cyber-attacks and/or other undesired activities), there exists a need to detect and mitigate such DNS Tunneling activities. Existing approaches for DNST detection have typically used signatures and/or machine learning (see, e.g., Yu et al. “Behavior Analysis based DNS Tunneling Detection and Classification with Big Data Technologies”, publicly available at https://pdfs.semanticscholar.org/b7bc/7d2eb9c0f18b5eOe5da3cc6903acfe7c29fe.pdf; and Farnham, Gregory. “Detecting DNS Tunneling”, publicly available at https://sansorg.egnyte.com/d1/r4ouqZy5dp).
DNS Tunneling encodes content data into the fully qualified domain name (FQDN) of the query to send data to the server and the server can send data via encoding information into the answers. The encoding can depend on the type of query, several reference systems are publicly available including (see, e.g., Ekman, Erik. “Iodine”, publicly available at https://github.com/yarrick/iodine; and “DNSCAT2”, publicly available at https://github.com/iagox86/dnscat2).
Thus, new and improved techniques for DNS security, and specifically, new and improved techniques for DNST detection, are needed.
Various techniques for applying natural language processing as features for DNS tunneling detection are disclosed. In some embodiments, a system/process/computer program product for applying natural language processing as features for DNS tunneling detection includes aggregating DNS traffic from one or more networks; automatically classifying the aggregated DNS traffic to detect DNS tunneling activity; and performing an action based on the detected DNS tunneling activity based on a policy.
In an example implementation, we focus on using content features to improve DNS tunneling (DNST) detection via DNS query-response logs. Specifically, content features can be applied by having positive and negative labeled datasets and using machine learning (ML) classifiers to determine a separation/distance based on features in either data set. An example content feature is to use counts of n-grams observed in the DNS query. An n-gram in this context is an n-char sequence of letters or numbers within the string, for example, the 3-grams of the string “hello” include “hel”, “ell”, “llo”. The count of n-grams is the number of times any given n-gram is observed, and this number can be compared with expected values in other strings.
However, n-grams are not a particularly strong feature, as the content features in the new tunnels may not match the training data. Further, an actor (e.g., a nefarious actor) may attempt to add some common substrings to the content of their tunnel to confuse such a DNST detection algorithm (e.g., a well-known penetration testing (pen test) tool prepends strings such as “www,” “post,” and “api” to its encodings to try to evade such existing DNST detection techniques).
Thus, new and improved techniques for applying natural language processing as features for DNS tunneling detection are disclosed.
Example system embodiments for applying natural language processing (NLP) as features for DNS tunneling detection are further described below.
To avoid the above-described problems with existing DNST detection approaches, the disclosed techniques for DNST detection apply a natural language anomaly score as a feature for classifiers implemented using machine learning algorithms as further described below. Unlike typical classification algorithms where we would have labeled sets of data, the disclosed techniques define boundaries to separate the data. Specifically, in the disclosed anomaly detection classifiers, we only train on data considered “normal” and then determine a boundary that makes data outside of it as an outlier or an anomaly to facilitate DNST detection. The disclosed anomaly detection algorithms can use scores normalized between 0 and 1 to rank a datum as more normal (e.g., closer to 0) or anomalous (e.g., closer to 1).
String anomaly scores have been used in DNS systems previously, for instance, to detect Domain Generation Algorithms (DGA) (see, e.g., Cruciani et al. “Semi-supervised detection of Algorithmically Generated Domains using Neural Network-based Autoencoders”, https://pure.ulster.ac.uk/ws/portalfiles/portal/93270315/DGA_DetectionUsingAutoEncoders_acc epted.pdf). However, in previous work, the focus has been to use the anomaly scores to create immediate alerts instead of combining them with other features as input to train a machine learning model.
More specifically, instead of using content features directly in our classifier/model implemented using a machine learning algorithm, such as n-grams of embeddings, we first train an anomaly detection algorithm on non-tunneled traffic (e.g., normal network traffic). Then, on our labeled datasets of tunnels and non-tunnels, we apply our anomaly detection scores to both the positive and negative labeled data as part of our feature engineering pipeline, which can be smoothed using an average, median, and/or standard deviation. These scores are combined with other common meta data features to create a feature vector used for training and evaluation of the classifier implemented using machine learning algorithms (e.g., using embeddings and distance vectors to implement an anomaly classifier, such as using an isolation forest, an auto encoder/Convolutional Neural Network (CNN) and loss function, and/or other ML techniques can similarly be implemented) for DNST detection that, for example, can be used in combination with one or more other classifiers to facilitate an effective and efficient DNST detection solution.
Based on our experiments, most DNS Tunneling (DNST) traffic appears different enough to facilitate an effective separation of the data. For example, two empirical trials that we performed used an isolation forest with n-grams (e.g., character 3-grams) and character-level convolutional neural network (CNN) autoencoder ML classifiers/models. Specifically, the disclosed stacked machine learning techniques for DNST detection are effectively classifying the DNS query traffic as normal or not normal (e.g., not normal being associated with likely DNST traffic activity based on the ML-based classification of extracted n-grams from the strings of the DNS query traffic, such as further described herein).
Referring to
Also, a labeled dataset is provided as input as shown at 106. In this example implementation, the labeled dataset includes a dataset of DNS queries that is labeled as DNST or not DNST (e.g., based on prior analysis and/or DNS security expert manually labeling of the dataset of DNS queries).
The NLP anomaly training algorithm and the labeled dataset are then provided as input to the DNST anomaly model as shown at 108. In an example implementation, the DNST anomaly model (e.g., an ML-based classifier that performs NLP anomaly classification based on a score of the DNS query traffic between 0 (normal/not likely DNST traffic) and 1 (not normal/likely DNST traffic)) can be implemented using a CNN autoencoder with an MSE loss function, other anomaly detection algorithms can similarly be used for DNS anomaly scores long-term-short term memory (LSTM) autoencoders, sequence to sequence models, or one class support vector machines.
The DNST anomaly model processes the NLP anomaly training data to generate an NLP anomaly score as shown at 110.
Also, the labeled dataset (e.g., tunneled/not tunneled DNS queries) is used to extract a set of meta data features as shown at 112. Example metadata features include the number of unique queries in a given time period, the mean length of query names, the number or unique answers, and the DNS query time,
The results of the NLP anomaly score (110) and the extracted set of meta data features (112) are provided as input into the DNS Tunneling (DNST) model that is generated at 114. In an example implementation, the DNST model can be implemented using a random forest, other models such as logistic regression, neural networks, or support vector machines can be employed as well.
As such, the disclosed techniques for providing a DNS tunneling training pipeline with NLP anomaly scoring can be implemented using stacked machine learning as described above with respect to
Specifically,
Referring to
At 404, the DNST anomaly model processes the NLP anomaly training data to generate an NLP anomaly score as shown at 406.
Also, the labeled dataset (e.g., tunneled/not tunneled DNS queries) is used to extract a set of meta data features as shown at 408. For example, a feature can be a score that is based on a count of how many times a given n-gram extracted from the labeled dataset of tunneled/not tunneled DNS queries is associated with tunneled DNS queries or not tunneled DNS queries.
The results of the NLP anomaly score (406) and the extracted set of meta data features (408) are provided as input into the DNS Tunneling (DNST) model as shown at 410. In an example implementation, the DNST model can be implemented using a random forest. Examples of such meta data features can include the number of unique queries in a given time period, the mean and standard deviation of the length of query names, the number or unique answers and the DNS query time.
The DNST model is used to generate a DNS Tunneling (DNST) Score 412 for a given DNS query input, which can be performed in a streaming and batch implementation, such as further described below.
In an example implementation, this pipeline can be placed within, for example, a DNS Behavioral Observation Streaming System (DBOSS), which is a DNS security framework where one specifies the transforms used to create the features for making observations of particular DNS behaviors and an evaluation criteria. In this example implementation, the NLP anomaly score provides another transform, and the evaluation criteria if the DNS Tunneling (DNST) Score is above a predetermined threshold (e.g., a threshold value of 0.5 or another threshold value can similarly be used for the DNST score). If the decision criteria for a domain is met, then an action can be performed based on a policy (e.g., a DNS security policy). For example, the domain can be automatically blocked and logged as having potential DNS tunneling behavior and/or flagged for secondary checks (e.g., and/or other actions can be performed, such as adding the domain to a block list and/or a threat feed, reporting the domain, quarantining the domain, automatically generating a new DNS signature for a domain, wherein the domain is associated with the detected DNS tunneling activity, etc.).
Example process embodiments for applying natural language processing (NLP) as features for DNS tunneling detection will now be further described below.
At 502, aggregating DNS traffic from one or more networks is performed, such as similarly described above with respect to
At 504, automatically classifying the aggregated DNS traffic to detect DNS tunneling activity is performed, such as similarly described above with respect to
At 506, performing an action based on the detected DNS tunneling activity based on a policy is performed, such as similarly described above with respect to
At 602, labeled DNS traffic is input, such as similarly described above with respect to
At 604, DNS traffic to detect DNS tunneling activity using a DNS tunneling (DNST) anomaly model based on natural language processing (NLP) is automatically classified, such as similarly described above with respect to
At 606, an NLP anomaly score for the DNS traffic is generated and provided as input along with extracted metadata features to a DNS tunneling (DNST) model, such as similarly described above with respect to
At 608, the DNST model generates a DNS Tunneling (DNST) Score, such as similarly described above with respect to
Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.
This application claims priority to U.S. Provisional Patent Application No. 63/538,591 entitled APPLYING NATURAL LANGUAGE PROCESSING ANOMALY MEASURES AS FEATURES FOR DNS TUNNELING DETECTION filed Sep. 15, 2023, which is incorporated herein by reference for all purposes.
Number | Date | Country | |
---|---|---|---|
63538591 | Sep 2023 | US |