Various examples are described herein that relate to intrusion deterrence for multiple node computing systems.
Data centers provide vast processing, storage, and networking resources to users. For example, client devices can leverage data centers to perform image processing, computation, data storage, and data retrieval. A client device such as a smart phone, Internet-of-Things (IoT) compatible device, a smart home, building appliance (e.g., refrigerator, light, camera, or lock), wearable device (e.g., health monitor, smart watch, or smart glasses), connected vehicle (e.g., self-driving car or flying vehicle), and smart city sensor (e.g., traffic sensor, parking sensor, or energy use sensor). Data and platform security are needed to prevent intrusion into data centers and computing devices that could cause device failures, steal personal information, access data, and other disruptive or illegal activities.
Generally speaking, there are two type of intrusion detection categories, namely, signature-based and anomaly-based. Signature-based techniques attempt to secure computing systems against known patterns of attacks by recognizing attacks using pattern-matching algorithms and comparing network traffic with a library of attack signatures. However, signature-based intrusion detection techniques are not able to identify new and unknown attacks as soon as they occur.
Anomaly-based intrusion techniques can be used to identify intrusions to a system based on deviations from the system's normal behavior. Anomaly-based intrusion techniques build a model of normal behavior and automatically classify statistically significant deviations from normal behavior as being abnormal. Using this technique makes it is possible to detect new attacks, but there can be a high rate of false positive alarms generated when the knowledge collected about normal behavior is inaccurate.
Resiliency, by definition, involves software and hardware components tolerating possible successful attacks, misconfigurations, failures, faults, and so on. To attempt to provide resiliency, several methods are available, namely, redundant operating stations with hardware or software result comparisons, distributed recovery block with an acceptance test, triple modular voting and redundant computing stations, as well as N-version programming where different versions are created and executed. Furthermore, Moving Target Defense (MTD), address space randomization, instruction set randomization, and data randomization techniques have been known to have been applied.
As High Performance Computing (HPC) moves to the cloud environment, cybersecurity against unauthorized intrusion is a challenge due to the integration of computer networks, virtualization, multi-tenant occupancy, remote storage, and so forth. According to some embodiments, computing environments (e.g., HPC fabrics) can provide for executing duplicate processes on different platforms such that the duplicate processes perform the same functions but using different programming languages and different platform software (e.g., different operating systems). Results (e.g., latency and computational results) provided from multiple duplicate processes can be compared against each other and an expected result. For example, results can include one or more of: a computation result value or values, how much memory is used, central processing unit (CPU) or core utilization, input/output utilization, a secure shell (SSH) key from nodes (e.g., Partition Key (PKey)). Any anomalous result can be determined to be attributed to an intrusion and that platform is disabled. In some cases, the anomalous system can potentially be turned off or disconnected from the other nodes.
At a time interval or at pseudo-random intervals, the software platform is altered and changed to a different software platform. For example, a platform executing Linux operating system is changed to run Microsoft Windows Server, another platform executing Microsoft Windows Server can be changed to run UNIX, and so forth. In addition, the processes running on each platform are modified to execute binaries based on a different programming language. For example, a platform executing a Java-based process can instead execute a C++ based process. In some cases, the change in platform software and programming language can be selected pseudo-randomly so that any attempted change in platform software or programming language will not necessarily yield a change.
If an attacker gains any information about the vulnerability of one system, after a platform software or programming language change, the existing vulnerabilities may no longer exist. Additionally, with redundancy of processes (e.g., duplicate processes), even if a system fails or is compromised, the processes continue operating and can perform workload requests from clients, other devices, or processes.
A trusted area can be region of memory or a processor or both that are not connected to Internet or a network and not accessible by other processes except for FM 120. For example, a trusted area can be a secure enclave or an Intel Software Guard Extensions (SGX) allocated enclave. For example, the trusted area can store diversity level 132, redundancy level 134, and shuffling rate 136. Controller process 122 can access information in diversity level 132, redundancy level 134, and shuffling rate 136 to determine when and how to modify a software environment in any of nodes 130-0 to 130-n.
Diversity level 132, redundancy level 134, and shuffling rate 136 specify respectively, a number of nodes to perform a same or similar workload, differences in versions of the applications and operating platforms, and frequency that each execution environment will be potentially modified. Using Application Resilient Editor (ARE), a user or administrator can define content of the diversity level 132, redundancy level 134, and shuffling rate 136. Based on the specified configurations, controller process 122 can configure the environment with the redundancy level of nodes and the parameter change frequency. Note that shuffling rate 136 can be set to change and not be a consistent period and can be pseudo-random selected time intervals.
Controller process 122 can access fabric manager data to distribute workloads and jobs to two or more nodes among nodes 130-0 to 130-n using a network, interconnect, or fabric. In one example, controller 122 causes at least two of nodes 130-0 to 130-n to execute different software platforms, accept different programming languages, or operate at different performance requirements. Examples of software platforms include operating systems (e.g., Windows, Linux, iOS, MacOS, any other operating system, including different version numbers of the same operating system), virtual machine, file system. Examples of programming languages include C, C++, Java, Python, JavaScript, and any other computing language. Examples of performance requirements include one or more of: CPU clock speed, GPU clock speed, memory allocation, storage allocation, or network interface transmit and receive rates.
For example, controller 122 can direct node 130-0 to execute a Windows Server Operating System and accept applications written in Java whereas controller 122 can direct node 130-1 can execute a Linux operating system and accept applications written in C. Controller 122 can dispatch the same workload, in compiled format, based on one workload written in Java and the other workload written in C to respective nodes 130-0 and 130-1. Controller 122 can use communicate with other nodes and use a Partition Key (PKey) (e.g., Omni-Path PKey) (and vice versa) that prevents against undesirable communication between nodes and will can spoofing the controller.
After a workload is submitted by controller 122 to two or more of nodes 130-0 to 130-n, nodes will perform the workload and provide results that are accessible to controller 122. Controller 122 can collect the results and apply a voting mechanism technique to identify any anomalous node. For example, a controller 122 can review workload results by a main node and redundant nodes, compare workload results, and if a majority of results are the same, then any different result is considered to be an anomaly. A majority of results can occur when most of the results are the same even though a majority of nodes do not provide the same result. For example, if 10 nodes provide results and 4 of the nodes provide the same result, and 6 of the nodes provide different results, the results from the 4 nodes can be considered majority. A majority can occur when a majority of nodes provide the same result. For example, if 10 nodes provide results, a majority occurs when the 6 of the nodes provide the same result. For example, results can include one or more of: a computation result value or values, how much memory is used, central processing unit (CPU) or core utilization, input/output utilization, a secure shell (SSH) key from nodes (e.g., partition key (PKey)). If there is no majority of results, then controller 122 can consider all nodes that performed the workload and provided the results to be compromised.
In addition, time to complete a workload can be compared against one another to determine if any node took too long to complete a workload. If a result takes longer than expected to be received from a node, the behavior of the node can be considered abnormal. Results and latency of operation (e.g., time to complete a workload) from nodes can be stored and compared against most recently received results and latency to determine if any node exhibits abnormal behavior compared against one or more prior results or latency of operation. In some cases, if a result or latency is sufficiently different than one or more prior results or latency of operation for a same or similar prior workload, any node that provided workload sufficiently different results or exhibited sufficiently different latency can be considered compromised even if a majority of nodes demonstrated the same results or latency.
Controller 122 can disconnect and deactivate any associated node that provided the anomalous result or latency. Controller 122 can associate an anomalous result or latency with unauthorized intrusion. Deactivating the node can potentially prevent an unauthorized intruder from comprising the system as a whole or other nodes.
Anomaly detector 150 can define normal and abnormal behavior and update the ruleset as needed to modify intrusion detection ability.
The gathered historic behavior data (e.g., results or latency data) can be used for the identification of whether behavior is considered normal or abnormal. Feature extractions 204 extracts the monitored features and retrieves the closest case from database 210 (e.g., same job, same programming language, same software platform) and compares the results and latency using state metric block 206. If the analyzed data is within 1% of in terms of latency and an exact match in result, the node is considered to behave normally. If the analyzed data is not within 1% of in terms of latency or not an exact match in result and can be considered not consistent with historical results and the node is considered behaving abnormally. Other percentages than 1% can be used. An analysis 208 is provided indicating abnormal behavior or normal behavior. Anomaly detector indicates to the controller whether a node is considered to behave normally or abnormally.
Referring back to
Controller 122 can execute a behavior obfuscation system to attempt to create confusion that results in larger time that would be needed for an attacker to understand the operating parameters of a system of connected nodes. The behavior obfuscation system can dynamically change the execution environment of each node by modifying dynamic software behavior via modification of operating system, applying different performance parameters of platforms, applying different file systems (e.g., FAT, NTFS, ZFS, Ext, and so forth). Behavior obfuscation system can potentially confuse an attacker and an attacker would fail to generate the required attack for the existing vulnerabilities because the system behavior dynamically changes. Controller 122 can apply the pseudocode below to perform the behavior obfuscation system.
When the prescribed shuffling rate triggers a modification of the redundant nodes, the pseudocode provides for pseudo-randomly selecting a node among the redundant nodes and applying a version of an application coding language (e.g., Java, C, C++, Python, and so forth) and platform software (e.g., operating system and file system) as well as operating parameters (e.g., clock speeds, allocated memory, peak network interface bandwidth, and so forth). According to the pseudocode, the applied version can be changed for each available node at the shuffling rate. An available node can be a node that has not been disconnected for any reason such as anomaly detection from its workload processing. However, the applied version can be selected pseudo randomly from available versions so that in some cases, a node can run the same version even after shuffling. In some examples, when controller 122 is running on the node that executes the FM 120, Omni-Path “sweep time” can be the shuffling rate that controller 122 uses to perform the behavior obfuscation system.
In some embodiments, shuffling of node characteristics can also cause a controller 122 to move to another node. For example, at sweep time, FM 120 can instantiate the controller 122 on another node so that the controller location can change and intrusion of the controller can be more difficult.
At 406, processes can be allocated on the redundant nodes. For example, a process can be a workload or transaction to be performed using any compute resource of a node. Redundant nodes can perform the same functions based on the processes but using different platform parameters. However, in some cases, at least some but not all of the redundant nodes can use the same platform parameters. At 408, results are received from the processes. The processes can be performed on the redundant nodes. At 410, the results are analyzed to determine whether an anomaly is detected. For example, if results from a node provides a different result than that provided by a majority of other nodes, the node can be considered to provide an anomalous result. For example, a result can include one or more of: a computation result value or values, how much memory is used, central processing unit (CPU) or core utilization, input/output utilization, a secure shell (SSH) key from nodes (e.g., partition key (PKey)). For example, if a time to complete the process and provide a result is markedly different than times to complete the process by other nodes, then the result (and associated node) can be considered anomalous. In some cases, the result or its latency can be compared against prior results or latencies from performance of similar or the same process, and anomaly can be identified from substantial differences regardless of whether a majority of the same or similar results or latency was found. If an anomaly is detected, then 412 can follow.
At 412, the controller causes any node with an anomalous result to be deactivated. Accordingly, the controller will not use those deactivated nodes to perform workloads or communicate with those deactivated nodes.
Referring again to 410, if an anomaly is not detected, then 420 of
At 454, the platform software and process characteristics selected using 452 are applied to the nodes. In some cases, the controller can be migrated to a different node along with potential changes to software and process characteristics.
Various embodiments were tested against different attacks and their ability to continue to operate normally under successful attacks. Hydra (e.g., brute force password cracking software) and HPing3 (e.g., networking tool for sending custom TCP/IP packets for security auditing) were used to attack the system. When there is an attack (either insider or outsider), the system will fail completely or slow-down greatly under successful attacks when embodiments are not applied.
In one example, system 600 includes interface 612 coupled to processor 610, which can represent a higher speed interface or a high throughput interface for system components that needs higher bandwidth connections, such as memory subsystem 620 or graphics interface components 640. Interface 612 represents an interface circuit, which can be a standalone component or integrated onto a processor die. Where present, graphics interface 640 interfaces to graphics components for providing a visual display to a user of system 600. In one example, graphics interface 640 can drive a high definition (HD) display that provides an output to a user. High definition can refer to a display having a pixel density of approximately 100 PPI (pixels per inch) or greater and can include formats such as full HD (e.g., 1080p), retina displays, 4K (ultra-high definition or UHD), or others. In one example, the display can include a touchscreen display. In one example, graphics interface 640 generates a display based on data stored in memory 630 or based on operations executed by processor 610 or both. In one example, graphics interface 640 generates a display based on data stored in memory 630 or based on operations executed by processor 610 or both.
Memory subsystem 620 represents the main memory of system 600 and provides storage for code to be executed by processor 610, or data values to be used in executing a routine. Memory subsystem 620 can include one or more memory devices 630 such as read-only memory (ROM), flash memory, one or more varieties of random access memory (RAM) such as DRAM, or other memory devices, or a combination of such devices. Memory 630 stores and hosts, among other things, operating system (OS) 632 to provide a software platform for execution of instructions in system 600. Additionally, applications 634 can execute on the software platform of OS 632 from memory 630. Applications 634 represent programs that have their own operational logic to perform execution of one or more functions. Processes 636 represent agents or routines that provide auxiliary functions to OS 632 or one or more applications 634 or a combination. OS 632, applications 634, and processes 636 provide software logic to provide functions for system 600. In one example, memory subsystem 620 includes memory controller 622, which is a memory controller to generate and issue commands to memory 630. It will be understood that memory controller 622 could be a physical part of processor 610 or a physical part of interface 612. For example, memory controller 622 can be an integrated memory controller, integrated onto a circuit with processor 610.
While not specifically illustrated, it will be understood that system 600 can include one or more buses or bus systems between devices, such as a memory bus, a graphics bus, interface buses, or others. Buses or other signal lines can communicatively or electrically couple components together, or both communicatively and electrically couple the components. Buses can include physical communication lines, point-to-point connections, bridges, adapters, controllers, or other circuitry or a combination. Buses can include, for example, one or more of a system bus, a Peripheral Component Interconnect (PCI) bus, a HyperTransport or industry standard architecture (ISA) bus, a small computer system interface (SCSI) bus, a universal serial bus (USB), or an Institute of Electrical and Electronics Engineers (IEEE) standard 1364 bus.
In one example, system 600 includes interface 614, which can be coupled to interface 612. In one example, interface 614 represents an interface circuit, which can include standalone components and integrated circuitry. In one example, multiple user interface components or peripheral components, or both, couple to interface 614. Network interface 650 provides system 600 the ability to communicate with remote devices (e.g., servers or other computing devices) over one or more networks. Network interface 650 can include an Ethernet adapter, wireless interconnection components, cellular network interconnection components, USB (universal serial bus), or other wired or wireless standards-based or proprietary interfaces. Network interface 650 can transmit data to a remote device, which can include sending data stored in memory. Network interface 650 can receive data from a remote device, which can include storing received data into memory.
In one example, system 600 includes one or more input/output (I/O) interface(s) 660. I/O interface 660 can include one or more interface components through which a user interacts with system 600 (e.g., audio, alphanumeric, tactile/touch, or other interfacing). Peripheral interface 670 can include any hardware interface not specifically mentioned above. Peripherals refer generally to devices that connect dependently to system 600. A dependent connection is one where system 600 provides the software platform or hardware platform or both on which operation executes, and with which a user interacts.
In one example, system 600 includes storage subsystem 680 to store data in a nonvolatile manner. In one example, in certain system implementations, at least certain components of storage 680 can overlap with components of memory subsystem 620. Storage subsystem 680 includes storage device(s) 684, which can be or include any conventional medium for storing large amounts of data in a nonvolatile manner, such as one or more magnetic, solid state, or optical based disks, or a combination. Storage 684 holds code or instructions and data 686 in a persistent state (i.e., the value is retained despite interruption of power to system 600). Storage 684 can be generically considered to be a “memory,” although memory 630 is typically the executing or operating memory to provide instructions to processor 610. Whereas storage 684 is nonvolatile, memory 630 can include volatile memory (i.e., the value or state of the data is indeterminate if power is interrupted to system 600). In one example, storage subsystem 680 includes controller 682 to interface with storage 684. In one example controller 682 is a physical part of interface 614 or processor 610 or can include circuits or logic in both processor 610 and interface 614.
A power source (not depicted) provides power to the components of system 600. More specifically, power source typically interfaces to one or multiple power supplies in system 600 to provide power to the components of system 600. In one example, the power supply includes an AC to DC (alternating current to direct current) adapter to plug into a wall outlet. Such AC power can be renewable energy (e.g., solar power) power source. In one example, power source includes a DC power source, such as an external AC to DC converter. In one example, power source or power supply includes wireless charging hardware to charge via proximity to a charging field. In one example, power source can include an internal battery, alternating current supply, motion-based power supply, solar power supply, or fuel cell source.
In an example, system 600 can be implemented using interconnected compute sleds of processors, memories, storages, network interfaces, and other components. High speed interconnects can be used such as PCIe, Ethernet, or optical interconnects (or a combination thereof).
Various examples may be implemented using hardware elements, software elements, or a combination of both. In some examples, hardware elements may include devices, components, processors, microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, ASICs, PLDs, DSPs, FPGAs, memory units, logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth. In some examples, software elements may include software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, APIs, instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof. Determining whether an example is implemented using hardware elements and/or software elements may vary in accordance with any number of factors, such as desired computational rate, power levels, heat tolerances, processing cycle budget, input data rates, output data rates, memory resources, data bus speeds and other design or performance constraints, as desired for a given implementation. It is noted that hardware, firmware and/or software elements may be collectively or individually referred to herein as “module,” “logic,” “circuit,” or “circuitry.”
Some examples may be implemented using or as an article of manufacture or at least one computer-readable medium. A computer-readable medium may include a non-transitory storage medium to store logic. In some examples, the non-transitory storage medium may include one or more types of computer-readable storage media capable of storing electronic data, including volatile memory or non-volatile memory, removable or non-removable memory, erasable or non-erasable memory, writeable or re-writeable memory, and so forth. In some examples, the logic may include various software elements, such as software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, API, instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof.
According to some examples, a computer-readable medium may include a non-transitory storage medium to store or maintain instructions that when executed by a machine, computing device or system, cause the machine, computing device or system to perform methods and/or operations in accordance with the described examples. The instructions may include any suitable type of code, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, and the like. The instructions may be implemented according to a predefined computer language, manner or syntax, for instructing a machine, computing device or system to perform a certain function. The instructions may be implemented using any suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language.
One or more aspects of at least one example may be implemented by representative instructions stored on at least one machine-readable medium which represents various logic within the processor, which when read by a machine, computing device or system causes the machine, computing device or system to fabricate logic to perform the techniques described herein. Such representations, known as “IP cores” may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that actually make the logic or processor.
The appearances of the phrase “one example” or “an example” are not necessarily all referring to the same example or embodiment. Any aspect described herein can be combined with any other aspect or similar aspect described herein, regardless of whether the aspects are described with respect to the same figure or element. Division, omission or inclusion of block functions depicted in the accompanying figures does not infer that the hardware components, circuits, software and/or elements for implementing these functions would necessarily be divided, omitted, or included in embodiments.
Some examples may be described using the expression “coupled” and “connected” along with their derivatives. These terms are not necessarily intended as synonyms for each other. For example, descriptions using the terms “connected” and/or “coupled” may indicate that two or more elements are in direct physical or electrical contact with each other. The term “coupled,” however, may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other.
The terms “first,” “second,” and the like, herein do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The terms “a” and “an” herein do not denote a limitation of quantity, but rather denote the presence of at least one of the referenced items. The term “asserted” used herein with reference to a signal denote a state of the signal, in which the signal is active, and which can be achieved by applying any logic level either logic 0 or logic 1 to the signal. The terms “follow” or “after” can refer to immediately following or following after some other event or events. Other sequences of steps may also be performed according to alternative embodiments. Furthermore, additional steps may be added or removed depending on the particular applications. Any combination of changes can be used and one of ordinary skill in the art with the benefit of this disclosure would understand the many variations, modifications, and alternative embodiments thereof.
Disjunctive language such as the phrase “at least one of X, Y, or Z,” unless specifically stated otherwise, is otherwise understood within the context as used in general to present that an item, term, etc., may be either X, Y, or Z, or any combination thereof (e.g., X, Y, and/or Z). Thus, such disjunctive language is not generally intended to, and should not, imply that certain embodiments require at least one of X, at least one of Y, or at least one of Z to each be present. Additionally, conjunctive language such as the phrase “at least one of X, Y, and Z,” unless specifically stated otherwise, should also be understood to mean X, Y, Z, or any combination thereof, including “X, Y, and/or Z.”’