Recognizing security incidents in large-scale software systems that comprise a multitude of “layered” software services—in other words, software services that invoke each other in ordered sequences of caller-callee communication patterns—is a difficult task. Traditional approaches to security incident (i.e., intrusion) detection in such systems employ mechanisms that attempt to secure and monitor (1) the network perimeter of the system, (2) the physical servers hosting service instances, and (3) the point-to-point communications between caller and callee service instances. Examples of such mechanisms include network-level access control lists, user authentication and authorization, service-level inbound and outbound call restrictions, and caller authentication/authorization at callee service instances.
While these existing mechanisms are functional for their intended purposes, there are still certain types of security incidents which these mechanisms can fail to detect, either entirely or in a timely manner. For instance, consider a scenario in which an insider (i.e., an authorized user) installs malware on a service instance S1 in an intermediary service layer of a financial payments software system, where the malware is configured to issue application programming interface (API) calls to a callee service instance S2 for the malicious purpose of collecting user credit card information from a secured card vault. Assume that these API calls from S1 to S2 are typically invoked as part of a longer, valid call flow in the system (e.g., a client-initiated call flow for retrieving the client's saved credit card details from the card vault), and thus S1 has the requisite network/service permissions to communicate with S2. In this scenario, since the insider is authorized to access the system's servers, this attack will not trigger any detection mechanisms that are designed to recognize external threats (e.g., network perimeter defenses, user authentication/authorization, etc.). Further, since service instance S1 is authorized to issue the API calls to service instance S2 as part of the system's normal operation, service instance S2 will generally be unable to recognize service instance S1 as being compromised via conventional point-to-point controls/restrictions on caller-callee communications.
Techniques for implementing call flow-based anomaly detection in a layered software system are provided. According to one set of embodiments, a service instance in the layered software system can receive an invocation message indicating invocation of an API exposed by the service instance. The service instance can further create a log entry including information pertaining to the invocation of the API and a call flow tag, where the call flow tag includes an identifier of a call flow to which the invocation of the API belongs and an ordered series of one or more sub-identifiers indicating a position of the invocation within the call flow. The service instance can then write the log entry to a log store of the layered software system.
A further understanding of the nature and advantages of the embodiments disclosed herein can be realized by reference to the remaining portions of the specification and the attached drawings.
In the following description, for purposes of explanation, numerous examples and details are set forth in order to provide an understanding of various embodiments. It will be evident, however, to one skilled in the art that certain embodiments can be practiced without some of these details, or can be practiced with modifications or equivalents thereof
Embodiments of the present disclosure provide techniques for detecting anomalies in a layered software system (i.e., a software system comprising layered software services) based on call flows that are observed in the system. As used herein, a “call flow” comprises an ordered sequence of API calls that are invoked by the system's service instances in order to execute a task, such as a service request received from a client. In a relatively simple layered software system (or in the case of a relatively simple task), a call flow may be linear in nature; for example, service instance S1 may call API “A1” of service instance S2, which in turn may call API “A2” of service instance S3, which in turn may call API “A3” of service instance S4. In more complex systems and/or tasks, a call flow may exhibit a tree-like structure where one service instance invokes multiple APIs of one or more other service instances, each of which invokes multiple APIs of one or more yet other service instances, and so on.
In various embodiments, the techniques of the present disclosure include collecting data regarding call flows that are executed within a layered software system and analyzing the collected call flow data to determine, among other things, whether the observed call flows are “allowed” flows—in other words, call flows that are recognized as being valid for the system. If the observed call flows are allowed flows, the layered software system can continue operating as normal. However, if any observed call flow is not an allowed flow, the layered software system can conclude that an anomaly (indicative of, e.g., a security incident or other issue) has been detected. The layered software system can then identify and take one or more actions for addressing the anomaly based on various criteria (e.g., the nature of the anomaly, the nature of the call flow, the nature of the system, etc.).
With these techniques, the layered software system can advantageously recognize and act upon certain types of security incidents—for example, attacks that are perpetrated by insiders and/or are difficult to detect via point-to-point caller-callee access control mechanisms—in a manner that is more robust and rapid than traditional intrusion detection approaches. Further, beyond security, these techniques can facilitate the detection of other types of issues that may be surfaced in call flow patterns, such as software bugs, service configuration errors, and regulatory compliance problems. The foregoing and other aspects of the present disclosure are described in further detail below.
Since software system 100 is a “layered” system, service instances 102(1)-(N) are generally configured to invoke each other according to ordered API call sequences (i.e., call flows) in order to carry out various tasks. For instance,
As shown in call flow pattern 200, service instance 102(1) can receive the service request from client 202, which includes an invocation of an API “A1” exposed by front-end service layer 204. In response, service instance 102(1) can execute API A1 and issue two downstream API calls to business logic layer 204: a first call of an API “A2” to service instance 102(2) and a second call of the same API A2 to service instance 102(3).
Upon receiving its API call, service instance 102(2) can execute API A2 and issue a downstream API call of an API “A3” to service instance 102(4) of data access service layer 208. Similarly, service instance 102(3) can execute API A2 and issue a downstream API call of an API “A4” to service instance 102(5) of data access service layer 208. Finally, service instances 102(4) and 102(5) can execute APIs A3 and A4 respectively without issuing any further downstream calls, thereby completing/fulfilling the service request. Although not shown in
As noted in the Background section, one challenge with managing a layered software system such as system 100 of
To address this issue and other similar issues, layered software system 100 of
Generally speaking, CF collectors 106(1)-(N) of service instances 102(1)-(N) and CF observers 110(1)-(M) can work in concert to detect anomalies in layered software system 100 (i.e., events indicating abnormal system activity/behavior, such as a security incident) based on the call flows of the system. A high-level flow of this call flow-based anomaly detection approach (flow 300) is depicted in
Concurrently with the operation of CF collector 106/service instance 102, each CF observer 110 can, on a continuous or periodic basis, retrieve a set of log entries from log store 108 that pertain to a particular call flow (block 306). Using these log entries, and in particular the call flow tags of the log entries, CF observer 110 can synthesize the structure of the call flow (i.e., the ordered sequence of API calls in the call flow) (block 308). For example, as part of block 308, CF observer 110 may create a call flow graph that is similar in appearance to call flow pattern 200 depicted in
Then, at block 310, CF observer 110 can perform an analysis to determine whether the synthesized call flow is an allowed flow (i.e., a call flow that is deemed to be valid for system 100). In one set of embodiments, the analysis at block 310 can involve comparing the synthesized call flow against a known group of allowed flows. In other embodiments, the analysis at block 310 can involve applying a set of manually-defined rules that codify the characteristics of an allowed flow. In yet other embodiments, the analysis at block 310 can involve providing the synthesized call flow as input to a machine learning model that has been trained (using, e.g., training data specific to system 100) to identify allowed flows and/or not-allowed flows.
Assuming that CF observer 110 determines the synthesized call flow is not an allowed flow, CF observer 110 can conclude that an anomaly has been detected and can identify one or more actions to take in response to the detected anomaly (block 312). These actions may vary depending on the type of the anomaly/call flow/system and can include, e.g., generating an alert for a service developer or system administrator, generating reporting data/statistics, modifying the behavior of one or more service instances 102(1)-(N), reversing transactions committed via the call flow, shutting down the entire system, and more. In cases where a developer or administrator reviews the call flow and determines that it is not in fact anomalous, this information can be fed back into the set of rules or machine learning model applied at the analysis step of block 310 in order to update that rule set or model.
Finally, at block 314, CF observer 110 can cause the identified action(s) to be enforced by communicating with the entities that are responsible for enforcement. CF observer 110 can thereafter return to block 306 in order to process additional log entries/call flows from log store 108.
With the high-level approach shown in
In this scenario, conventional intrusion detection solutions that rely on caller-callee controls/restrictions would not be able to detect this attack, since service instance 102(2) is authorized to invoke API A3 of service instance 102(4) as part of overall call flow pattern 200. However, since the shorter call flow of service instance 102(2) to 102(4) is not an allowed flow, the approach shown in
Second, in addition to detecting security incidents, the approach of
Additional details regarding the processing attributed to CF collectors 106(1)-(N) and CF observers 110(1)-(M) in
It should be appreciated that
Starting with block 402, CF collector 106 can receive a message indicating that API X has been called/invoked. In the case where API X is called by an upstream service instance or a client, this message can be a message or data packet that is transmitted by the upstream instance/client. In the case where API X is called by some piece of code that is resident on service instance 102, this message be an inter-process or intra-process message.
At block 404, CF collector 106 can check whether the message received at block 402 includes a call flow tag for the invocation of API X As mentioned previously, this call flow tag is a data structure that includes a call flow identifier indicating the call flow to which the API call belongs and an ordered series of sub-identifiers indicating the position of the API call within that call flow. In one set of embodiments, the call flow tag can exhibit the following format:
[call flow ID].[sub-ID 1].[sub-ID 2].[sub-ID 3]
In these embodiments, each “sub-ID” is an identifier of an API call that has been issued in the context of the call flow identified by “call flow ID” and is ordered in accordance with both its horizontal and vertical position in the call flow. The last “sub-ID” identifies the API call to which the overall call flow tag is associated. By way of example, consider the invocation of API A2 by service instance 102(1) to service instance 102(3) in call flow pattern 200 of
If CF collector 106 determines at block 404 that there is no call flow tag included in the message, CF collector 106 can conclude that the current invocation of API X is the first call in a new call flow and thus can generate a new call flow tag for this invocation (block 406). As part of this step, CF collector 106 can generate a new call flow identifier (e.g., a randomly generated number) and append a sub-identifier for the invocation of API X to the new call flow identifier.
Alternatively, if CF collector 106 determines at block 404 that there is a call flow tag included in the message, CF collector 106 can conclude that the current invocation of API
X is part of an in-process call flow (as identified by the existing call flow tag). In this case, CF collector 106 can simply extract the existing call flow tag from the message (block 408).
Upon either generating a new call flow tag or extracting an existing call flow tag, CF collector 106 can create a log entry for the invocation of API X based on the contents of the message received at block 402 (block 410). This log entry can include an identifier/name of service instance 102, an identifier/name of API X, the input parameters to API X specified by the caller entity, and the new/existing call flow tag. In certain embodiments, this log entry can also include other information, such as an identity/name of the caller entity, caller authentication information included in the invocation message, and so on.
CF collector 106 can then write the created log entry to log store 108 and allow service instance 102 to proceed with executing API X(blocks 412 and 414). In some embodiments, CF collector 106 may write the created log entry to a particular data structure in log store 108 that is associated with the call flow identifier included in the call flow tag (e.g., a call flow-specific log file, database table, directory, etc.). In this way, CF collector 106 can partition the log entries stored in log store 108 on a per-call flow basis.
If the execution of API X by service instance 102 does not result in the issuance of any downstream API calls (block 416), workflow 400 can end. However, if the execution of API X does result in the issuance of at least one downstream API call, CF collector 106 can generate a revised version of either the new call flow tag generated at block 406 or the existing call flow tag extracted at block 408 that appends a new sub-identifier corresponding to the downstream call (block 418). Although not shown, if there are multiple downstream API calls, CF collector 106 can generate multiple revised call flow tags that build upon each other (e.g., revised version 1 is used as a basis for revised version 2, revised version 2 is used as a basis for revised version 3, etc.) in order to generate an appropriate tag for each call.
Finally, at block 420, CF collector 106 can include the revised call flow tag in a message for invoking the downstream API and can transmit/provide the message to the service instance that is the target of the invocation.
At block 502, CF observer 110 can first retrieve, from log store 108, all of the log entries recorded by CF collectors 106(1)-(N) for call flow C. In embodiments where log store 108 comprises per-call flow data structures, this step can involve retrieving all of the log entries maintained in the data structure associated with the call flow identifier of C.
At block 504, CF observer 110 can extract the call flow tags included in each log entry. CF observer 110 can then synthesize, based on the extracted call flow tags, the structure of call flow C (i.e., the ordered sequence of API calls within the call flow) (block 506). For example, in one set of embodiments, CF observer 110 can generate a call flow graph that is similar in appearance to call flow pattern 200 of
Once the structure of call flow C has been synthesized, CF observer 110 can perform an analysis to determine whether C is an allowed flow or not (block 508). In one set of embodiments, CF observer 110 may have access to a predefined list of allowed flows. In these embodiments, CF observer 110 may execute the analysis of block 508 by comparing call flow C to each allowed flow in the predefined list and searching for a match. In other embodiments, CF observer 110 may have access to a predefined set of rules that have been manually created by service developers/system administrators and that codify the characteristics of allowed or not-allowed flows. In these embodiments, CF observer 110 may execute the analysis of block 508 by applying each of the predefined set of rules to call flow C. In yet other embodiments, CF observer 110 may have access to a machine learning model that has been trained to recognize allowed or not-allowed flows based on training data that is specific to layered software system 100 (or the type of system that system 100 embodies). In these embodiments, CF observer 110 may execute the analysis of block 508 by providing data regarding call flow C as inputs to the machine learning model and evaluating the model output. In yet other embodiments, CF observer 110 may combine any two or more of the foregoing analysis techniques.
If CF observer 110 determines via the analysis at block 508 that call flow C is an allowed flow (block 510), CF observer 110 can conclude that there is no anomaly with respect to C and workflow 500 can end.
However, if CF observer 110 determines that call flow Cis not an allowed flow (block 510), CF observer 110 can conclude that an anomaly has been detected and can identify one or more actions to take in response to the detected anomaly (block 512). The specific actions that are identified at this step can vary significantly based on a number of different criteria, such as the nature of the anomaly, the nature of call flow C, the nature of layered software system 100, and others. The following is a non-exhaustive list of possible actions (other actions not on this list are believed to be within the scope of the present disclosure and will be evident to one of ordinary skill in the art):
At block 514, CF observer 110 can cause the action(s) identified at block 512 to be enforced via communication with, e.g., service instances 102(1)-(N), log store 108, and/or other entities/systems. Workflow 500 can then terminate.
In certain embodiments, one or more of CF observers 110(1)-(M) can perform their call flow analysis and action identification functions in a manner that is synchronous, or inline, with respect to the call flows executed by service instances 102(1)-(N). For instance, assume that a call flow C passes through a number of service instances and ends at a final (i.e., terminal) service instance T which is configured to perform a secured or sensitive task (e.g., retrieve credit card details, post a charge to a bank account, etc.). In this example, at the time call flow C reaches service instance T, instance T can invoke a CF observer 110 and request that CF observer 110 analyze and provide an answer on whether C is an allowed flow, prior to executing its portion (i.e., API call) of C. Upon receiving this answer, service instance T can proceed to execute the API call (if the flow is deemed to be allowed) or can reject the API call (if the flow is not deemed to be allowed). Thus, with this approach, service instances 102(1)-(N) can be proactive in preventing the execution of anomalous call flows.
Depending on the complexity of its analysis, it is possible that CF observer 110 in the example above may take an extended period of time in order to return an answer to service instance T Thus, this approach may be best suited to service requests/tasks that do not require real-time or near real-time execution. Alternatively, in some embodiments, CF observer 110 may configured to perform only a portion of its analysis in an inline manner (e.g., portions that can be executed quickly, such as the application of a few simple rules) and leave the remaining, more complex analysis portions for offline handling. In this way, CF observer 110 can still provide some level of anomaly detection inline without extended delays.
In addition to (or in lieu of) the “allowed flow” analysis described in
In another set of embodiments, each CF observer 110 can implement a call data integrity analysis in which it verifies whether the invocation message content passed between service instances in a given call flow is correct (i.e., has not be tampered with). In these embodiments, CF collector 106 of each service instance 102 can calculate a hash of (1) the invocation message content to be sent to a downstream service instance and (2) a previous message hash received from an upstream service instance (if it exists), and can include this calculated hash value in the invocation message. Upon receiving, the invocation message at the downstream service instance, the CF collector of the downstream service instance can record the message hash value in the log entry written to log store 108.
Then, at the time a CF observer 110 evaluates a call flow, CF observer 110 can examine the chain of message hash values stored in log store 108 for the call flow and determine whether all of the hash values are correct in view of the corresponding message content. If so, CF observer 110 can conclude that the messages in the call flow have not been tampered with. If not, CF observer 110 can identify the modified messages and can take an appropriate action (e.g., raise an alert so that a human can investigate, shut down the affected service instances, etc.).
Bus subsystem 604 can provide a mechanism for letting the various components and subsystems of computer system 600 communicate with each other as intended. Although bus subsystem 604 is shown schematically as a single bus, alternative embodiments of the bus subsystem can utilize multiple busses.
Network interface subsystem 616 can serve as an interface for communicating data between computer system 600 and other computer systems or networks. Embodiments of network interface subsystem 616 can include, e.g., an Ethernet card, a Wi-Fi and/or cellular adapter, a modem (telephone, satellite, cable, ISDN, etc.), digital subscriber line (DSL) units, and/or the like.
User interface input devices 612 can include a keyboard, pointing devices (e.g., mouse, trackball, touchpad, etc.), a touch-screen incorporated into a display, audio input devices (e.g., voice recognition systems, microphones, etc.) and other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and mechanisms for inputting information into computer system 600.
User interface output devices 614 can include a display subsystem, a printer, or non-visual displays such as audio output devices, etc. The display subsystem can be, e.g., a flat-panel device such as a liquid crystal display (LCD) or organic light-emitting diode (OLED) display. In general, use of the term “output device” is intended to include all possible types of devices and mechanisms for outputting information from computer system 600.
Storage subsystem 606 includes a memory subsystem 608 and a file/disk storage sub system 610. Sub systems 608 and 610 represent non-transitory computer-readable storage media that can store program code and/or data that provide the functionality of embodiments of the present disclosure.
Memory subsystem 608 includes a number of memories including a main random access memory (RAM) 618 for storage of instructions and data during program execution and a read-only memory (ROM) 620 in which fixed instructions are stored. File storage subsystem 610 can provide persistent (i.e., non-volatile) storage for program and data files, and can include a magnetic or solid-state hard disk drive, an optical drive along with associated removable media (e.g., CD-ROM, DVD, Blu-Ray, etc.), a removable flash memory-based drive or card, and/or other types of storage media known in the art.
It should be appreciated that computer system 600 is illustrative and many other configurations having more or fewer components than system 600 are possible.
The above description illustrates various embodiments of the present disclosure along with examples of how aspects of these embodiments may be implemented. The above examples and embodiments should not be deemed to be the only embodiments, and are presented to illustrate the flexibility and advantages of the present disclosure as defined by the following claims. For example, although certain embodiments have been described with respect to particular process flows and steps, it should be apparent to those skilled in the art that the scope of the present disclosure is not strictly limited to the described flows and steps. Steps described as sequential may be executed in parallel, order of steps may be varied, and steps may be modified, combined, added, or omitted. As another example, although certain embodiments have been described using a particular combination of hardware and software, it should be recognized that other combinations of hardware and software are possible, and that specific operations described as being implemented in software can also be implemented in hardware and vice versa.
The specification and drawings are, accordingly, to be regarded in an illustrative rather than restrictive sense. Other arrangements, embodiments, implementations and equivalents will be evident to those skilled in the art and may be employed without departing from the spirit and scope of the present disclosure as set forth in the following claims.