This invention relates generally to data protection systems, and more specifically to incorporating organizational awareness for automating data protection policies.
Backup software is used by large organizations to store their data for recovery after system failures, routine maintenance, archiving, and so on. Backup sets are typically taken on a regular basis, such as hourly, daily, weekly, and so on, and can comprise vast amounts of information. Backup programs are often provided by vendors that provide backup infrastructure (software and/or hardware) to customers under service level agreements (SLA) that set out certain service level objectives (SLO) that dictate minimum standards for important operational criteria such as uptime and response time, etc. Within a large organization, dedicated IT personnel or departments are typically used to administer the backup operations and work with vendors to resolve issues and keep their infrastructure current.
Data within an organization is typically not considered to be monolithic as far as data protection policies are concerned. As enterprise systems grow and become more complex, the data for different assets within the organization, such as personnel, machines, data sources, and so on may be assigned different data protection policies so that storage costs and SLOs can be optimally tailored to the appropriate types of data.
In present systems, data assets are manually assigned to specific policies by system administrators in what is largely a manual process. Some advanced systems, such as VMware platforms, may allow assets to be automatically assigned to policies based on virtual center (vCenter) tags, but the mappings between policies and tags must still be manually configured by administrators. Other backup software products may custom protect certain types of data, such as e-mail systems (e.g., Microsoft Exchange) based on information from directory services like LDAP (Lightweight Directory Access Protocol) or Microsoft Active Directory for authentication and authorization. However, this software generally does not use the content of those systems to assign assets to protection policies and keep the assignments current. In a company with potentially tens of thousands of employees, employee devices, and the constant change involved with people being added, promoted, reassigned, or removed on an almost daily basis, administrators are forced to rely on either manual efforts or external, static automation workflows to update assignments. All of this adds significant administrative overhead, as well as gaps in data protection, and opportunities for data breaches.
What is needed, therefore is a data protection system that automatically incorporates organizational awareness to efficiently apply data protection policies or policy attributes to specific assets within an organization and thereby eliminate present manual or ad-hoc methods of tagging data to the policies.
The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions. EMC, Data Domain and Data Domain Restorer are trademarks of DellEMC Corporation.
In the following drawings like reference numerals designate like structural elements. Although the figures depict various examples, the one or more embodiments and implementations described herein are not limited to the examples depicted in the figures.
A detailed description of one or more embodiments is provided below along with accompanying figures that illustrate the principles of the described embodiments. While aspects are described in conjunction with such embodiment(s), it should be understood that it is not limited to any one embodiment. On the contrary, the scope is limited only by the claims and the described embodiments encompass numerous alternatives, modifications, and equivalents. For the purpose of example, numerous specific details are set forth in the following description in order to provide a thorough understanding of the described embodiments, which may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the embodiments has not been described in detail so that the described embodiments are not unnecessarily obscured.
It should be appreciated that the described embodiments can be implemented in numerous ways, including as a process, an apparatus, a system, a device, a method, or a computer-readable medium such as a computer-readable storage medium containing computer-readable instructions or computer program code, or as a computer program product, comprising a computer-usable medium having a computer-readable program code embodied therein. In the context of this disclosure, a computer-usable medium or computer-readable medium may be any physical medium that can contain or store the program for use by or in connection with the instruction execution system, apparatus or device. For example, the computer-readable storage medium or computer-usable medium may be, but is not limited to, a random-access memory (RAM), read-only memory (ROM), or a persistent store, such as a mass storage device, hard drives, CDROM, DVDROM, tape, erasable programmable read-only memory (EPROM or flash memory), or any magnetic, electromagnetic, optical, or electrical means or system, apparatus or device for storing information. Alternatively, or additionally, the computer-readable storage medium or computer-usable medium may be any combination of these devices or even paper or another suitable medium upon which the program code is printed, as the program code can be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
Applications, software programs or computer-readable instructions may be referred to as components or modules. Applications may be hardwired or hard coded in hardware or take the form of software executing on a general-purpose computer or be hardwired or hard coded in hardware such that when the software is loaded into and/or executed by the computer, the computer becomes an apparatus for practicing the certain methods and processes described herein. Applications may also be downloaded, in whole or in part, through the use of a software development kit or toolkit that enables the creation and implementation of the described embodiments. In this specification, these implementations, or any other form that embodiments may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the embodiments.
Some embodiments involve data processing in a distributed system, such as a cloud based network system or very large-scale wide area network (WAN), and metropolitan area network (MAN), however, those skilled in the art will appreciate that embodiments are not limited thereto, and may include smaller-scale networks, such as LANs (local area networks). Thus, aspects of the one or more embodiments described herein may be implemented on one or more computers executing software instructions, and the computers may be networked in a client-server arrangement or similar distributed computer network.
The network server computers are coupled directly or indirectly to the network storage 114, target VMs 104, data center 108, and the data sources 106 and other resources 116/117 through network 110, which is typically a public cloud network (but may also be a private cloud, LAN, WAN or other similar network). Network 110 provides connectivity to the various systems, components, and resources of system 100, and may be implemented using protocols such as Transmission Control Protocol (TCP) and/or Internet Protocol (IP), well known in the relevant arts. In a cloud computing environment, network 110 represents a network in which applications, servers and data are maintained and provided through a centralized cloud computing
Backup software vendors typically provide service under a service level agreement (SLA) that establishes the terms and costs to use the network and transmit/store data specifies minimum resource allocations (e.g., storage space) and performance requirements (e.g., network bandwidth) provided by the provider. The backup software may be any suitable backup program such as EMC Data Domain, Avamar, and so on. In cloud networks, it may be provided by a cloud service provider server that may be maintained be a company such as Amazon, EMC, Apple, Cisco, Citrix, IBM, Google, Microsoft, Salesforce.com, and so on.
In most large-scale enterprises or entities that process large amounts of data, different types of data are routinely generated and must be backed up for data recovery purposes. This data comes from many different sources and is used for many different purposes. Some of the data may be routine, while others may be mission-critical, confidential, sensitive, and so on. As shown in the example of
As shown in
For the embodiment of
The process of
Inputs to the organization classifier 310 and backup software 306 is typically already integrated with directory services such as LDAP or Microsoft Active Directory, or similar. LDAP represents a type of application protocol for maintaining distributed directory information services over IP networks. Such directory services may provide an organized set of records in a hierarchical structure, such as a corporate e-mail directory. Although embodiments are described with respect to LDAP, any similar protocol can be used.
The organization classifier 310 can either share the configuration of one or more directory services 302 with the backup software 306, or the services 302 can be directly configured in the organization classifier itself. The backup software 306 may also be protecting the e-mail system 304 itself, and these this system may be using one of the directory services 302 to implement their Global Address Lists (GALs), or they may have their own internal corporate directories. The organization classifier 310 can either share the configuration of such systems with the backup software 306, or the services can be directly configured in the organization classifier itself. In a traditional organization, the GAL is considered sufficient to capture the full organization chart, but other embodiments of this component may integrate with other Enterprise Resource Planning (ERP) tools (e.g., Workday) to collect additional information about employees.
In an embodiment, the organization classifier 310 maintains an internal data structure represented as a graph. The graph is stored using a graph database, but other embodiments may use other data storage, such as a relational database, and the like. In a graph database, each node in the graph represents an object of a type including Domain, Group, User, Device, among others.
As shown in in
With regard to the relationships among the people, there is a many-to-many mapping of Users to Groups. A User can be part of one or more Groups, and a Group may have one or more Users. Both Users and Groups have a many-to-one mapping to a Domain. Each User and each Group can be part of only one Domain. Diagram 400 is provided for purposes of illustration only, and many other hierarchies, node structures, and configurations may be used.
In general, the structure and content of the internally generated graph 400 should match, at least loosely, the original LDAP information. However, certain distinctions or other information may inform the organization classifier's internally generated graph depending on the analysis procedure. For example, when also integrated with an ERP system, the information from the ERP system may create differences between the internal graph and the LDAP source. An important element of the organization classifier graph, such as shown in
The native types of each directory system are mapped to the types present within the organization classifier. For example, an Active Directory Organizational Unit (OU) maps to a Group. A set of key/value pairs are also associated with each node. These are used to cache data for the calculation of scores (as described below), such as number of emails received or sent.
With reference back to
Another connected system may be the company e-mail system 507. For each email system, if the email system is using one of the configured directory services for its user list, the classifier 502 scans through the mailboxes and extracts statistics, such as total number of emails, and adds those as key/value pairs to the node of the graph corresponding to the User who owns that mailbox. If an email system is not itself connected to a directory service, the classifier 502 will search its connected directory services for a matching email address to associate the Users. If no match is found, then the mailbox is ignored. Besides an e-mail system, other communication platforms may also be scanned, such as chatrooms, social network sites, electronic bulletin boards and so on. The e-mail system 507 data is used to cull information regarding user interactions that may help inform each individual's influence, impact, or importance in the company or a group. Such information may tend to indicate that the data used by that individual is more or less important than their simple LDAP hierarchy data may suggest. This data thus represents informal user interaction information that is used to supplement the formal data provided by the directory service 506. This informal information is not used to change a person's position in the generated graph, but rather to help modify the scoring of that person.
As shown in
Total OC Score=Base Score−Boost Value
The Base Score is assigned according to a user's position in the top-down corporate organizational chart, while the boost value is derived from the informal data (e.g., e-mails, communication patterns, and so on) along with certain organizational data. A lower total score indicates a higher importance within the company.
With respect to the base score 606, this score is calculated on the basis of a user's location at in the graph, where the graph position corresponds to a user's ‘importance’ in the company, therefore the value of his or her data. An inverse scale is used so that a lower number denotes higher importance. A person at the top of the chart who does not report to anyone else, such as the President or CEO, has a base score of 1. Their direct reporting personnel (e.g., VPs) each have a base score of 2, those users' direct reporting personnel each have a base score of 4, and so on, with the score doubling for each level. An inverse scoring scale is used so that the graph can extend to an arbitrary number of levels without affecting the scores at the higher levels of the graph. Other embodiments may implement different scoring mechanisms, such as linearly increasing by a fixed number of points per level of hierarchy, normalizing the score to a specified range, or using a method where higher scores indicate higher importance, and so on.
The boost value is a numerical value subtracted from the base score based on one or more rules that capture the impact of a user's communications, associations, impact on other user, as well as any contextual situations impacting their data, such as special projects, temporary assignments, and so on. Table 1 below illustrates some example components of the boost value, in an example embodiment.
The example of Table 1 lists only some possible boost value factors, but generally represents the most salient factors of a user's communication and association within a company that may impact the value of their data. Any number of such factors may be used, and weighted relative to one another to derive a boost value for the individual.
Using Table 1 as an example, the number of work related e-mail messages received by a person is used to indicate their involvement in the company and therefore, to some degree at least, their importance in the company. Just as important, however, may be the people to whom this user is communicating. So, if the user receives a high number of e-mail messages, and if the number of email messages received per week from a user's manager, or other equally or higher-level managers from other parts of the organization, and exceeds a configurable threshold (e.g., 20 per week), that user's boost value may be set accordingly, where a lower boost value helps lower the overall score. This kind of data is provided almost exclusively by the e-mail programs, as well as other similar communication platforms (chatrooms, etc.).
As shown in
These rules for determining the boost values are coded into the organization classifier 502, but other embodiments may allow for rules to be specified in an externalized resource file. The boost value can show that a person whose position in the organizational chart may be lower than another person's is effectively equally or more important than the other person based on their interactions with other important users or interaction with important data. Boost values can increase (negative boost value) or decrease (positive boost value) the user's overall score based on the factors considered.
With respect to determining an actual boost value for a user, in an embodiment, a threshold value is defined for each of the factors (such as those listed in Table 1). The organization classifier 502 derives a numeric value for each factor over the course of a scan 508 and compares the derived number to the defined threshold and assigns a zero, negative, or positive boost value for each measured factor. Alternatively, a system administrator can review the factor values received for a user and derive an appropriate boost factor for that user. For example, the system may be configured to allow only negative boost values to increase a user's importance, or it may also allow positive boost values to decrease a user's importance as well, and it may provide a manual override by an administrator.
This boost value is then combined with the base score 606 to derive the total score. The organization classifier 502 re-generates all scores at a fixed interval (e.g., daily), so the scores are dynamic in response to organizational changes such as promotions, reassignments, re-organizations, and so on.
With reference back to
As shown in
The appropriate total score range to assign to each policy may be defined by the system administrator, or it may be set automatically by the backup software based on certain objective data, such as number of total policies, number of distinct RPO/RTO values, number of copies specified, and so on. For the example table 800 of
Advanced options allow creating backup policies or rules based on specific properties of users or groups of users. For example, systems in a Group associated with Finance may have extended retention periods applied; or users directly or even remotely involved in legal proceedings may automatically have their data held under litigation hold rules, and so on.
The embodiments described herein optimize data backup operations by using information from directory service systems (e.g., LDAP, Active Directory), as well as communication programs (e.g., e-mail) to automatically apply data protection policies to users based on their individual status and data usage patterns.
Embodiments of the processes and techniques described above can be implemented on any appropriate backup system operating environment or file system, or network server system. Such embodiments may include other or alternative data structures or definitions as needed or appropriate.
The processes described herein may be implemented as computer programs executed in a computer or networked processing device and may be written in any appropriate language using any appropriate software routines. For purposes of illustration, certain programming examples are provided herein, but are not intended to limit any possible embodiments of their respective processes.
The network of
Arrows such as 1045 represent the system bus architecture of computer system 1000. However, these arrows are illustrative of any interconnection scheme serving to link the subsystems. For example, speaker 1040 could be connected to the other subsystems through a port or have an internal direct connection to central processor 1010. The processor may include multiple processors or a multicore processor, which may permit parallel processing of information. Computer system 1000 is just one example of a computer system suitable for use with the present system. Other configurations of subsystems suitable for use with the described embodiments will be readily apparent to one of ordinary skill in the art.
Computer software products may be written in any of various suitable programming languages. The computer software product may be an independent application with data input and data display modules. Alternatively, the computer software products may be classes that may be instantiated as distributed objects. The computer software products may also be component software.
An operating system for the system 1005 may be one of the Microsoft Windows®. family of systems (e.g., Windows Server), Linux, Mac OS X, IRIX32, or IRIX64. Other operating systems may be used. Microsoft Windows is a trademark of Microsoft Corporation.
The computer may be connected to a network and may interface to other computers using this network. The network may be an intranet, internet, or the Internet, among others. The network may be a wired network (e.g., using copper), telephone network, packet network, an optical network (e.g., using optical fiber), or a wireless network, or any combination of these. For example, data and other information may be passed between the computer and components (or steps) of the system using a wireless network using a protocol such as Wi-Fi (IEEE standards 802.11, 802.11a, 802.11b, 802.11e, 802.11g, 802.11i, 802.11n, 802.11ac, and 802.11ad, among other examples), near field communication (NFC), radio-frequency identification (RFID), mobile or cellular wireless. For example, signals from a computer may be transferred, at least in part, wirelessly to components or other computers.
In an embodiment, with a web browser executing on a computer workstation system, a user accesses a system on the World Wide Web (WWW) through a network such as the Internet. The web browser is used to download web pages or other content in various formats including HTML, XML, text, PDF, and postscript, and may be used to upload information to other parts of the system. The web browser may use uniform resource identifiers (URLs) to identify resources on the web and hypertext transfer protocol (HTTP) in transferring files on the web.
For the sake of clarity, the processes and methods herein have been illustrated with a specific flow, but it should be understood that other sequences may be possible and that some may be performed in parallel, without departing from the spirit of the described embodiments. Additionally, steps may be subdivided or combined. As disclosed herein, software written in accordance certain embodiments may be stored in some form of computer-readable medium, such as memory or CD-ROM, or transmitted over a network, and executed by a processor. More than one computer may be used, such as by using multiple computers in a parallel or load-sharing arrangement or distributing tasks across multiple computers such that, as a whole, they perform the functions of the components identified herein; i.e., they take the place of a single computer. Various functions described above may be performed by a single process or groups of processes, on a single computer or distributed over several computers. Processes may invoke other processes to handle certain tasks. A single storage device may be used, or several may be used to take the place of a single storage device.
Unless the context clearly requires otherwise, throughout the description and the claims, the words “comprise,” “comprising,” and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is to say, in a sense of “including, but not limited to.” Words using the singular or plural number also include the plural or singular number respectively. Additionally, the words “herein,” “hereunder,” “above,” “below,” and words of similar import refer to this application as a whole and not to any particular portions of this application. When the word “or” is used in reference to a list of two or more items, that word covers all of the following interpretations of the word: any of the items in the list, all of the items in the list and any combination of the items in the list.
All references cited herein are intended to be incorporated by reference. While one or more implementations have been described by way of example and in terms of the specific embodiments, it is to be understood that one or more implementations are not limited to the disclosed embodiments. To the contrary, it is intended to cover various modifications and similar arrangements as would be apparent to those skilled in the art. Therefore, the scope of the appended claims should be accorded the broadest interpretation so as to encompass all such modifications and similar arrangements.