Malicious insider activity is difficult to detect, and an insider with access to files in a content repository may be able to copy, send, alter or delete large amounts of data without getting caught or prior to being caught. Proposed systems to address malicious insiders establish a threshold of a “reasonable number” of files to access in a given time span, and alert a security administrator if a user exceeds this reasonable number of files. While this detects malicious access of an excessive number of files, such a system may fail to detect a lesser number of malicious accesses. Setting a threshold of a reasonable number of files too low will likely generate too many false positives.
It is within this context that the embodiments arise.
In some embodiments, a method, performed by a processor to detect malicious or risky data accesses is provided. The method includes modeling user accesses to a content repository as to probability of a user accessing data in the content repository, based on a history of user accesses to the content repository. The method includes scoring a singular user access to the content repository, based on probability of access according to the modeling and alerting in accordance with the scoring.
In some embodiments, a tangible, non-transitory, computer-readable media having instructions thereupon which, when executed by a processor, cause the processor to perform a method. The method includes training a probabilistic model of data accesses, using a history of user accesses to a content repository and monitoring user accesses to the content repository. The method includes scoring each user access of a plurality of user accesses to data in the content repository as to how probable the user access is according to the probabilistic model and alerting in accordance with the scoring.
In some embodiments, a detection system for data accesses is provided. The system includes a server having a modeling module, a scoring module and an alerting module, and configured to receive information about user accesses to a content repository, for both history and ongoing monitoring. The modeling module is configured to produce a probabilistic model of user accesses to data in the content repository based on the history. The scoring module is configured to produce a score of a user access to the content repository, based on the ongoing monitoring and based on how probable is the user access to the content repository according to the probabilistic model. The alerting module is configured to issue an alert based on a result of the scoring module.
Other aspects and advantages of the embodiments will become apparent from the following detailed description taken in conjunction with the accompanying drawings which illustrate, by way of example, the principles of the described embodiments.
The described embodiments and the advantages thereof may best be understood by reference to the following description taken in conjunction with the accompanying drawings. These drawings in no way limit any changes in form and detail that may be made to the described embodiments by one skilled in the art without departing from the spirit and scope of the described embodiments.
A detection system that computes the likelihood of data accesses and detects malicious or risky behavior, e.g., malicious or risky data accesses by a user, is described below. The system generates a probabilistic model of user accesses of data (e.g., raw data, databases, files, etc.) in a content repository. On an ongoing basis, the system monitors user accesses of the content repository, by pulling access information from, or receiving access information that is pushed by, the content repository. User accesses are individually scored based on the likelihood (i.e., probability according to the model) of that user access of that data at that time or span of time. The system issues an alert if a user access and corresponding score meet conditions set in one or more rules. Thus, the system increases the granularity of identifying malicious or risky data accesses to an individual user access of specific data in a content repository, as compared to detection systems that look for aggregate data access behavior such as accessing greater than a specified number of files in a specified amount of time. In some embodiments, a probabilistic model is trained on the data accesses of users to score how usual or unusual a user data access is. A scoring system analyzes user data accesses in near-real time and provides a likelihood determination for each datum/user pair. A reporting system may alert administrators based on users, files or repositories that exhibit an excessive amount of anomalous activity. Various embodiments with various features are described below, and further embodiments can employ various combinations of these features.
Prior to the ongoing monitoring of user accesses, or alternatively during the ongoing monitoring of user accesses, the server 102 develops a probabilistic model of user accesses, which is written into a model database 114. In the embodiment shown in
Using recent activity for a specified content repository 104, the modeling module 108 trains a model or models 116, 118 of datum access for users 106. As an example, consider the context of a file server, where the data are files. Therefore, for each file, a model could compute how likely it is for a given user to access that specific file at this time. It should be appreciated that this example is not meant to be limiting as there are many ways to train this model. A simple model, which would be oblivious to the specific file and user, could be a function of the age of the file. Knowing that a file is this many days old, a probability density function could be modeled to estimate the likelihood of an access by any user at this time. For example, a relatively newer file is more likely to be accessed by people working on that file or making use of the contents of that file, and less likely to be accessed by people who are working in other areas. A relatively older file is less likely to be accessed in general, but might be more likely to be accessed by people who are working on a next-generation version of related subject matter. In various embodiments, the models can become progressively more nuanced. For example, given that the file has a specific extension, and is created by a specific person, it can be determined as to what is the likelihood that someone from line of business X would access that file. The challenge of finding the optimal model is orthogonal to the purposes of the present description, since the modeling is contingent on a variety of factors individual to each content repository 104.
Continuing with reference to
The scoring module 110 focuses on the scoring of anomalous activity for each user session with the content repository 104. Ideally, the scoring should give a sense of the severity of the incident. For each datum access, a model 116, 118 can be applied to provide the likelihood of that access by that user at that time. The scoring module 110 will sum up a score based on this likelihood for some time interval. The score for the datum file access is a function of the probability. Many different scoring functions can be used, and one should be selected that best suits not only the activity, but the bandwidth of the security administrator to process events (e.g., if the administrator has little time to process events, then a function can be employed that only penalizes extremely anomalous activity). In one implementation, a function can be devised so that if the probability is greater than or equal to some threshold alpha, the score is zero. In one embodiment, the score is the natural logarithm of the inverse probability. This would give a high score to an occurrence of an event that has a very low probability, which is a good candidate for an alert. Other bases for logarithmic functions, or other mathematical functions could be used as the examples provided are not meant to be limiting.
Additionally, in further embodiments, a weighted score could be based on the sensitivity of a file being accessed. Sensitivity could be automatically determined (e.g., generated by the scoring module 110) based on factors such as metadata and content (e.g., finance and legal documents are more sensitive than HR (human resources) documents) and access pattern (e.g., a file accessed by a few users is more sensitive than a file that is accessed by a large number of users; access by one user to a file created by another user is more sensitive than access to one's own file, etc.). Alternatively, sensitivity of a file could be determined and communicated by a data loss prevention (DLP) service or module (e.g., credit card numbers or Social Security numbers have higher sensitivity than product pricing information or product reviews). The interval for activity may be fixed (e.g. each hour, each day) or could be dynamic and defined as a function (e.g. a session, where a session is terminated by an hour of inactivity). This aspect will periodically update the alerting part of this system, in some embodiments.
In the embodiment shown in
Rules for alerting could be customized for the activity score, the nature of the data, and the number of alerts that the administrator can reasonably handle, in various embodiments. For example, any user who has an anomalous access to financial data may be reported, but on non-financial data, a higher threshold will be required to bring the administrator's focus to that particular user. Some embodiments are not limited to reporting on users 106. Various embodiments could highlight a file server (i.e. a content repository 104) that has seen a significant amount of anomalous activity recently. The alerting module 112 could be selected or directed to alert regarding a user 106, regarding a file or other data, or regarding a specific content repository 104 (e.g., when monitoring multiple content repositories 104).
For purposes of both model development and ongoing monitoring,
In an action 408, for each user access to the content repository, as monitored in the action 406, the user access is scored according to the probabilistic model generated in the action 404. Scoring could be performed by a scoring module, as described above with reference to
It should be appreciated that the methods described herein may be performed with a digital processing system, such as a conventional, general-purpose computer system. Special purpose computers, which are designed or programmed to perform only one function may be used in the alternative.
Display 511 is in communication with CPU 501, memory 503, and mass storage device 507, through bus 505. Display 511 is configured to display any visualization tools or reports associated with the system described herein. Input/output device 509 is coupled to bus 505 in order to communicate information in command selections to CPU 501. It should be appreciated that data to and from external devices may be communicated through the input/output device 509. CPU 501 can be defined to execute the functionality described herein to enable the functionality described with reference to
Detailed illustrative embodiments are disclosed herein. However, specific functional details disclosed herein are merely representative for purposes of describing embodiments. Embodiments may, however, be embodied in many alternate forms and should not be construed as limited to only the embodiments set forth herein.
It should be understood that although the terms first, second, etc. may be used herein to describe various steps or calculations, these steps or calculations should not be limited by these terms. These terms are only used to distinguish one step or calculation from another. For example, a first calculation could be termed a second calculation, and, similarly, a second step could be termed a first step, without departing from the scope of this disclosure. As used herein, the term “and/or” and the “/” symbol includes any and all combinations of one or more of the associated listed items.
As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises”, “comprising”, “includes”, and/or “including”, when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Therefore, the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting.
It should also be noted that in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may in fact be executed substantially concurrently or may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
With the above embodiments in mind, it should be understood that the embodiments might employ various computer-implemented operations involving data stored in computer systems. These operations are those requiring physical manipulation of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. Further, the manipulations performed are often referred to in terms, such as producing, identifying, determining, or comparing. Any of the operations described herein that form part of the embodiments are useful machine operations. The embodiments also relate to a device or an apparatus for performing these operations. The apparatus can be specially constructed for the required purpose, or the apparatus can be a general-purpose computer selectively activated or configured by a computer program stored in the computer. In particular, various general-purpose machines can be used with computer programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required operations.
A module, an application, a layer, an agent or other method-operable entity could be implemented as hardware, firmware, or a processor executing software, or combinations thereof. It should be appreciated that, where a software-based embodiment is disclosed herein, the software can be embodied in a physical machine such as a controller. For example, a controller could include a first module and a second module. A controller could be configured to perform various actions, e.g., of a method, an application, a layer or an agent.
The embodiments can also be embodied as computer readable code on a tangible non-transitory computer readable medium. The computer readable medium is any data storage device that can store data, which can be thereafter read by a computer system. Examples of the computer readable medium include hard drives, network attached storage (NAS), read-only memory, random-access memory, CD-ROMs, CD-Rs, CD-RWs, magnetic tapes, and other optical and non-optical data storage devices. The computer readable medium can also be distributed over a network coupled computer system so that the computer readable code is stored and executed in a distributed fashion. Embodiments described herein may be practiced with various computer system configurations including hand-held devices, tablets, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers and the like. The embodiments can also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a wire-based or wireless network.
Although the method operations were described in a specific order, it should be understood that other operations may be performed in between described operations, described operations may be adjusted so that they occur at slightly different times or the described operations may be distributed in a system which allows the occurrence of the processing operations at various intervals associated with the processing.
In various embodiments, one or more portions of the methods and mechanisms described herein may form part of a cloud-computing environment. In such embodiments, resources may be provided over the Internet as services according to one or more various models. Such models may include Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). In IaaS, computer infrastructure is delivered as a service. In such a case, the computing equipment is generally owned and operated by the service provider. In the PaaS model, software tools and underlying equipment used by developers to develop software solutions may be provided as a service and hosted by the service provider. SaaS typically includes a service provider licensing software as a service on demand. The service provider may host the software, or may deploy the software to a customer for a given period of time. Numerous combinations of the above models are possible and are contemplated.
Various units, circuits, or other components may be described or claimed as “configured to” perform a task or tasks. In such contexts, the phrase “configured to” is used to connote structure by indicating that the units/circuits/components include structure (e.g., circuitry) that performs the task or tasks during operation. As such, the unit/circuit/component can be said to be configured to perform the task even when the specified unit/circuit/component is not currently operational (e.g., is not on). The units/circuits/components used with the “configured to” language include hardware—for example, circuits, memory storing program instructions executable to implement the operation, etc. Reciting that a unit/circuit/component is “configured to” perform one or more tasks is expressly intended not to invoke 35 U.S.C. 112, sixth paragraph, for that unit/circuit/component. Additionally, “configured to” can include generic structure (e.g., generic circuitry) that is manipulated by software and/or firmware (e.g., an FPGA or a general-purpose processor executing software) to operate in manner that is capable of performing the task(s) at issue. “Configured to” may also include adapting a manufacturing process (e.g., a semiconductor fabrication facility) to fabricate devices (e.g., integrated circuits) that are adapted to implement or perform one or more tasks.
The foregoing description, for the purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the embodiments and its practical applications, to thereby enable others skilled in the art to best utilize the embodiments and various modifications as may be suited to the particular use contemplated. Accordingly, the present embodiments are to be considered as illustrative and not restrictive, and the invention is not to be limited to the details given herein, but may be modified within the scope and equivalents of the appended claims.