List of Acronyms and Terminologies used are incorporated here. The Specification is structured as follows.
There are 28 Figures and 9 Tables described as follows.
Table 1. Summary deficiencies in the two current industry practices for Memory protection. Table 2. Various Compliance and Legal Violations due to Data in Use breach. Table 3: Average volume of data in a Hard Drive at the time of loss. Table 4. OWASP Top 10: Between 2017 and 2021. Table 5: Pseudo Code for Just-in-Time Read. Table 6: Pseudo Code for Expedited Death. Table 7: Pseudo Code for Secure Wipe. Table 8: Pseudo Code for Code Reshuffle—Attack Surface Reduction. Table 9. Masking and Unmasking Algorithms.
Data in Use refers to the data in computer memory (heap or stack or equivalents). If the data being used is a sensitive data, then its (albeit, temporary) footprint at the memory can be exploited for security breach. This invention focusses at the Data in Use security risk issues.
We compare the current industry approach to Data in Use security risk protection methods with (a) Hardware based encryption approaches (Sections 1.1, 1.2), (b) CPU and Kernel based encryption approaches (Section 1.4). Compliance violations, legal and statutory deficiencies, and inability to accurately implement Security Controls are presented in Sections 1.1, 1.2, and 1.3 respectively.
Table 1: Deleted. See Drawings file for Table 1.
Data in Use security protection (“Issue 1”) has been dealt with Data at Rest security measures (“Issue 2”), by mapping the former to the solution available for the latter. When an attacker attacks the Application, the Application is likely to abort and write its Heap onto the local Hard Drive (HD). The copy of the Heap, at the HD, may be encrypted at the HD, providing a safeguard against data leakage from the HD.
While conceptually simple and effective, this is an ill balanced and incorrect approach to solve complex Engineering problems-because the resolution of one issue (“Issue 1”) is not being directly solved, instead the resolution is deferred to the solution available for a downstream issue ('Issue 2″). This is also a violation of the Defense in Depth principle, which requires a defense to be provided for the earliest point of intrusion, instead of perpetuating the issue and compensating downstream.
This deficiency leads to a number of problems, both technically and legally.
Technical Deficiency: whose Key should be used to safeguard the Data (“in Use”)? Who owns this data? Is it (Interpretation 1) the User who is running the Application, or the (Interpretation 2) Admin of the Application who owns the Application, or the IT Security dept (Interpretation 3) of the Organization (e.g., the CISO) who owns all Assets' data security issue. Or, is it someone else?
In the current industry solution, the Data (in Use), once written from the Heap onto the HD, is not being protected by anyone of the above 3 Key owners. Instead, the HD encryption is per the Key of the HD asset owner, such as the User who owns the laptop or the server (or the DBA if it is a DB storage). One may argue that the Key of the HD asset owner has a dotted line mapping to the CISO, as all assets are eventually mapped to the IT Security dept (NB: this argument fails for Bring Your Own Device [BYOD] devices). Even with such an argument, neither the App Admin nor the User who is running the Application have any control over the Key that is used to protect the data. This ambiguity is shown in
Legal and Compliance Question: From the technical question stated above, the Legal and Compliance question is-in the event of a breach, who is accountable? The User who lost her/his data, or the Application whose usage was compromised, or the Asset owner or the IT Dept? Disassociation of Accountability (for a Data Loss) and Authority (whose Data it is, who should have the authority to take sufficient protections to safeguard the data)—is an well known legal dispute between who suffered the loss versus who becomes responsible.
The above stated Legal and Compliance question, in addition to being quite a common sense fiasco and apparently obvious, is mapped to specific languages of compliance violation and then Statutory violation, as shown below in Table 2 (see Appendix A for language recital and explanations).
A key point to observe is how close, if not exact, [a] the role of the HD owner becomes that of a Property Thief (data being intangible property), [b] the role of the IT Dept (CISO) becomes that of an Accessory to the Theft crime (if the CISO makes the claim that the CISO delegated authority to the HD owner to hold or safekeep the User data, without an explicit written consent from the User (who owns the data), and [c] ineffectiveness of sign-in time click consent provided by the User, as the consent would be viewed as a good faith approval, and not to be exploited for specific data ownership re-assignment from the User owner to the HD owner.
The argument presented in this invention—is that—it is safer, cleaner and desirable to let the rightful owner of the Data, namely the User, control the safekeeping of his/her own Data—instead of—the current practice of—delegating the safekeeping authority to the HD owner, and consequently face all these legal and compliance disputes.
Table 2: Deleted. See Drawings file for Table 2.
All of these violations are rooted to a simple engineering mishap, that the owner of the Data (the User or the Application) is not rendered with the ability to protect his/her data. In the legal dispute, the Real Party in Interest (i.e., the User) would have a Claim against the Party who failed to protect the data, namely the HD Asset owner.
The problem is further compounded when the underlying platform is not On-Prem, but Cloud hosted. With Cloud hosted platform, the Cloud is storing the Logs. Whose Key should be used to protection, Cloud's own operational or Admin key, or the Client master key, or a hybrid? Although, the Service Agreement between Cloud platform stakeholders and the Client may have several indemnities, the end question of who suffered the data loss versus who did not take sufficient measures to prevent from the data loss-remains.
The technology solution presented in this invention circumvents this problem, by providing technology directly in the hand of the Application developer (aka, the User) who can protect his/her data as required.
The data owner thus becomes the data protector, eliminating the legal disputes and compliance violations.
The next problem arises from the Key reset policies. Key reset is a critical compliance directive. Different types of Keys (for different Identities) undergo different reset policies. User Identities undergo a fixed time period based (e.g., 60-day) Key reset. Whereas, App Admin may undergo a different Key reset policy, as the Admin account may be a Service Account. Whereas, the HD Asset Key (such as the Bitlocker Key) may never be reset on a time rotation basis, instead the Bitlocker Key may be reset only post a HD asset loss of compromise.
This disparity, in Key reset policies, create another Compliance violation. The data owner (User Identity) would expect a Key reset on a predictable and preset time period. Whereas, the data protector (the laptop owner, whose Key is being used by Bitlocker) may not have reset it's Key as the laptop HD may not have been lost or stolen in several years. In the event of a breach, the dispute re: “why wasn't the Key reset periodically” (as the User Identity would expect) will lead to a Compliance violation.
While there is no set standard for BitLocker to periodically reset its key, an internet search led to the following result
The Application (and, hence the User running the Application with User's own data) may undergo repeated crashes, due to repeated security attacks. Each one of these crashes will lead to a Heap dump onto the HD. One would expect, that post each breach, the Key of the HD encryption should change. But, the HD encryption key may never change, as the HD encryption key is designed to change only after the HD is compromised. Unless compromised, the HD will keep on preserving potentially 100's or 1000's of Heap dumps, each providing valuable Ciphertext to the attacker. If the Keys were changed post each breach, the ciphertext available to the Attacker would only be for ONE security breach, and not a cascade of 100's or 1000's of breaches.
The next deficiency is how the HD that stores potentially a large number of Heap dumps aids to Cryptanalysis. The expectation, that post each Breach the Key shall be reset, is not maintained at the HD encryption key stage. Therefore, the HD (when it is stolen or compromised) is found with a large, and sometimes very large number of Heap core dumps—instead of one Heap dump.
Finding a large or very large sample of Ciphertext makes the cryptanalysis mush simpler, than if the HD had only one Heap dump. Thus, current industry practice helps the Attacker, and harms the Data owner. With one Heap dump, cryptanalysis may even not be feasible, as one Heap instance may resemble the “one time pad” characteristics of cryptography. Table 3 shows summary stats of how much data a typical encrypted HD asset may contain. The larger this number is, the more is the volume of the Ciphertext, and the easier it would be to Cryptanalyze and break the encryption.
Table 3: Deleted. See Drawings file for Table 3.
With this invention's proposed technology, the Application is directly in charge of selecting and resetting the encryption Key. Therefore, post a Heap dump, the Application can and will change its Key. There shall be no Key level correlation between the Heap dump instance 1 and Heap dump instance 2. This makes the task of the Attacker (i.e., Cryptanalysis) harder, and the task of the Data owner (who designs and uses various encryption algorithms) easier.
Securing the memory at Kernel level has been addressed by AMD and Intel CPUs [10]. These works are complimentary to our research, but not competitive. They share the same goal, as protection of the data. But, they address a different time segment of the problem, namely when the data is resting (at the memory), whereas the problem we address is when the data is being used. By encrypting the computer memory, what is accomplished (by the Intel and AMD CPUs) is protection of the memory from attacker's access. This protection will essentially provide another form of Data at Rest protection, where the data is resting at the memory.
The problem addressed by CPU and Kernel level encryption is Data at Rest protection, and not Data in Use protection. Data in Use occurs as and when the said memory will require to participate in a computer program, when the program will read the sensitive data, operate upon the sensitive data, and then write the sensitive data back into the memory, for a subsequent access from another part of the program. During these Data in Use steps, encrypting the memory at Kernel level of by CPU hardware is not applicable, as the Data must be able to participate in computer program's arithmetic and logical operation, and hence be available in Plaintext.
There are other key differences between the CPU & Kernel level encryption [ref1] and our work. These differences are as follows:
Data in Use security solutions are seldom noted, from a software perspective. An analyst reference regarding this matter had led to no prior results reported.
While source code scanner is overwhelmingly common and popular, the category of security issues and exposure from Data in Use has/have not been addressed in the commercial source code scanners.
Likewise, the OWASP community [7], as well as MITRE CWE [8] and NVD (National Vulnerability Databases) have not addressed the Data in Use class of attack surface issues from a code level solutioning perspective. See for example OWASP top-10 vulnerabilities for 2017 and 2021, shown below (Table 4) and the mapping between them. None of these vulnerabilities address Data in Use exposures and attack surfaces. One may argue that the OWASP A04 Insecure Storage is a broad category for the same, but it is too broad and is not specific to Data in Use. Plus, it is a security defect category, it does not specify what the source code should or should not do to reduce the Data in Use risk or attack surface.
Likewise, CWE 244 is a category for Heap storage security breaches, however, how to resolve the Heap getting sensitive data in the first place from the source code (which is the focus of this invention) remains non-specific. The solution (for Data in Use) security breaches from a source code perspective remains unaddressed.
Table 4: Deleted. See Drawings file for Table 4.
Data in Use security issues are studied in the context of Heap Security, where the research community has recognized the Risks and Exposure from sensitive data residing in the Heaps. See [1, 2]. Several Operating Systems level protections of memory have been proposed, e.g., on Windows systems the VirtualLock function can lock a page of memory to ensure that it will remain present in memory and not be swapped to disk [5]. However, Heap is not the only place where Data in Use breaches may occur. Log files as well as core dump of stack traces are equally vulnerable spots for Data in Use breaches. Furthermore, the treatment (prevention or reduction) of Data in Use breaches are not focused at the source code level, instead they focus on better management of the Heap.
Data Security has been extensively addressed, with primary focus being Data in Motion and Data at Rest securities. Data in Use is relatively less addressed. Breaches due to Data in Use are captured as part of the Data Loss Prevention (DLP) technologies, often embodied in Hard Drive Encryption (e.g., Bit Locker [3]). One way to view the current invention is that it is an approach to a software based solution to the DLP problem. Another way to view the current invention is that it is an approach to prevent the Heap (or, other memory or disk mapped version of memory) to keep getting footprint of the sensitive data in the first place.
This invention presents technology solution to the deficiencies presented in Section 1. The technology permits the Application to directly protect its sensitive data with the use of Encryption or Data Masking or other approaches (instead of relying on a downstream artifact such as the Hard Drive with HD's own encryption). The selection of the specific approach (to encrypt or data mask) is a decision taken by the Application and User of the Application.
The Heap carries the sensitive data, as is the fundamental computing model. However, depending on the time instant, the Heap either carries [State P] a Plaintext version of the sensitive data, or [State C] a Ciphertext version of the sensitive data.
One expects “State C” as the majority of the execution time span of the program, and “State P” as the minority or lesser of the time period. A security attack during “State C—time span” causes no Data loss, since the Heap has a Ciphertext version of the data. Whereas, a security attack during “State P—time span” may disclose the Plaintext onto the Heap (which in turn would be protected by HD protection, like in the current industry practice, see Sections 1.1, 1.2), but the State P time span is much smaller compared to State C time span.
The basic computing model (around a specific data, which could be any data, but for our focus these are Sensitive Data) includes an Add, a NOT or complement, and a Jump-conditional. (All other execution steps can be built on top of these 3instructions, which is outside the scope of this invention, but is a basic result in Automata theory.) To execute these instructions, one cannot operate on the data in Ciphertext, instead the data must be in Plaintext1. 1This is an area of research in Cryptology. If encryption algorithms can be developed that can permit correctness preserving Addition, Complement and Logical operations all in the Ciphertext space (without having to first convert the data to Plaintext to be able to carry out the respective operations in Plaintext space) then the Heap can store Ciphertext always, and the sensitive data can be 100% safe as it would never have to be converted to Plaintext. This topic, an interesting research area in cryptology, is outside the scope of this invention.
To carry out the program's intended arithmetic and logical operations, the program must operate on Plaintext version of the variable, and cannot expect to operate on the Ciphertext version of the variable. Exposure of sensitive data in Plaintext during those specific FLOPs computation steps (when the sensitive data is being operated with arithmetic and logical operations) is inevitable. This is a theoretical limitation, and cannot be avoided.
However, when the program is not explicitly operating any arithmetic or logical operation with the sensitive data, then there is no reason to keep the sensitive data laying around in Heap in Plaintext format. The program can, immediately after a sensitive data has been operated with required arithmetic and logical operations, convert the sensitive data from Plaintext to Ciphertext. Then after some time in the code, when the same sensitive data is needed again for another arithmetic and logical operation, the code can decrypt and convert the data from Ciphertext to Plaintext.
This process of toggling back and forth, between Plaintext and Ciphertext is the opportunity for Attack Surface Reduction for Data in Use. This opportunity must be traded against the cost of exploiting such opportunity, which the below subsection further elaborates.
This is a key idea this invention proposes, with of course a large number of unique optimizations built in combination with this idea.
Detection of Sensitive Data variables are purely a Lexical operation, i.e., a String search. Given one or more sensitive data names, our tool scans the input program to detect all Lexical occurrences of the sensitive data names. The output of the Detection phase consists of:
It is critical to distinguish between Declaration and Initialization, as Declaration does not assign any value to the sensitive date, while Initialization may do so. Plaintext to Ciphertext conversion is needed only after the sensitive data has obtained a value.
While scanning for sensitive data, our Tool also searches for other variables that may receive a value of the sensitive data. For example, if Account_No is a sensitive data, and Temp_Account is an intermediate variable, and if there is an instruction Temp_Account=Account_No then
The Detection Tool reports every Lexical occurrence of a sensitive variable. Suppose, SD1 is a sensitive data name. The Detection Tool output can be symbolically represented as a String of 2-tuples, where Li is the line no, and codei is the code line at that line number where the sensitive data SD1 is lexically matched.
The simplest and the most naïve Mitigation would be when for every Li two new lines of code are inserted-one at the immediate previous (to Li) line with “Decrypt (SD1)” and another at the immediate successor (to Li) line with “Encrypt (SD1). This would make the code represented as
However, this approach may not be optimum. If the code line (in FLOPs) distance between any pair of successive (i, i+1) Lexical occurrences is less than the FLOPs requirement for “Encryption +Decryption” then encrypting immediately after the previous line occurrence and decrypting immediately before the next line occurrence may be wasteful, failing to reduce any Data in Use attack surface exposure. The process of Encryption and Decryption may create new Data in Use exposure cancelling any reduction of Data in Use exposure reduction between ith and (i+1)st Lexical occurrences of the sensitive data SD1.
If two or more lines (all lexically matching with SD1) are close enough, it may be more secure not to encrypt and decrypt the sensitive data repeatedly between those close by lines.
Therefore, a code structural analysis is required to optimally place the Encryption and Decryption new lines of code. See Subsection 2.5 below for 3 classes of algorithms-Bin Packing, Bubble Sort and Dynamic Programming (Random Injection), to do so.
The opportunity (of Encrypting immediately post Use, and Decrypting immediately pre Use) to reduce the Data in Use attack surface, with the caution that the cost of Encryption and Decryption may introduce new Data in Use attack surfaces and may therefore outweigh the benefits-this tradeoff (between the “opportunity” and the “caution”) is navigated by a number of Mitigation algorithms designed in our research. The classification tree is shown in
Front and Tail Trims: Mitigation can be at the Trims, either front of tail end of the code. These are marked as “1” and “2” in FIG. 5. These Mitigation techniques are the simplest and yet may be the most beneficial. These techniques exploit the observation that a large code body (N lines) may use only a small fragment of code lines (M lines) specific to sensitive data, with M<<N. Therefore, it is likely that a significant part of the code exists completely unrelated to the sensitive data either before the very first use (hence, the naming “front Trim”) of the sensitive data, or after the very last use (hence, the naming “tail Trim”) of the sensitive data. Removal of the sensitive data at the Trims may provide the easiest and most effective data attack surface reduction.
Disjoint Code Segments with respective Sensitive Data: This is marked no “6” in
Inside Bulk of the Code Body: This is the most complex and the largest class of Mitigation algorithms. It includes three methods—“Inside out”, “Outside in” and “Random interject”—which are marked as “3”, “4” and “5” respectively.
When a security Attack is encountered and the Heap dumps its contents to the HD, a single time instant snapshot of the Heap would be written onto the HD. If a particular sensitive data (SD1) has had one or 1000s of Lexical occurrences, it would not reflect the end Heap dump—there will be only one value of SD1 that would be written onto the HD. However, which point in time's value of SD1 would be written—that may change. SD1 may be found in Plaintext, at the time of attack, which is rare but may happen if the Attack time matched exactly when SD1 was being operated for arithmetic and logical operations and hence SD1 was in Plaintext in the Heap (unfortunately, but inevitable). Or, SD1 may be found in Ciphertext at the time of attack, depending on which Lexical line was being executed at the time of attack.
Therefore, SD1's exposure is either ONE Ciphertext instance, or the Plaintext itself. The Plaintext option is out of our scope, as that is protected at the HD encryption, which is the current industry practice. (We can also consider an extreme case, where all Lexical Occurrences do undergo the (Decyrpt, Use, Encrypt) option, in which case there will be no Plaintext in Heap, unless within the D+E FLOPs window itself.) But, the Ciphertext option is more practical and likely to happen. It shows that the Cryptanalyst will only get ONE Ciphertext instance, which is often not enough to break the code.
The consequence of this finding is that—using our proposed technology (of in-line encryption and decryption)—the burden on the Attacker, i.e., cryptanalyst is much higher and often impossible. Indeed, if the Attacker can only get 1 Ciphertext instance, is it not as impossible to break-in as “one time pad” is.
This finding may permit usage of cheaper and simpler encryption algorithms, such as Substitution Cipher or Transpose Cipher, or even data masking solutions or Obfuscation. These are algorithmically less expensive to execute, making the D and E values lower, and also the Storage space requirements being less.
See Section 8, 8.7 in particular for a detail discussion.
This section presents the most basic form of Clustering algorithms (for Invasive Mitigation to reduce Data in Use Risk), where the Clustering is a local pairwise swap.
This type of clustering is not as complex as a global clustering algorithm or heuristics. It operates on a simple pairwise swap policy, which is a rudimentary form of local clustering. But, this rudimentary form of local clustering is a well known and well practiced concept in Computer algorithms, namely in Bubble Sort.
We are deploying the same bubble sort class of operation, except of course the concept of “numerical ascent or descent” is amiss in the current context. Instead, the context is bringing Use (of the sensitive data) lines closer along the lines of code, to reduce the number of potentially repetitive encryption and decryption one may have to deploy, in the context of Heap encryption and Data in Use Risk reduction.
The Initial Use Pattern is at the lowest. The 3 Swap Steps are shown progressively higher along the Y-axis. Each leftward blue arrow is a representation of a Swap. The Swap moves out irrelevant code from the Use windows to outside of the Use windows. The process to determine which code lines are relevant to a specific sensitive data is PDG analysis based, and is discussed in Section 6.1, 6.2.
In the most ideal situation, all the dots relating to a specific sensitive data will be a continuous stream of lines, starting at some line Lbegin and ending in Lend—where the lines between Lbegin and Lend only relate to processing of the specific sensitive data and no unrelated activities.
This is of course the ideal situation. The programmer may not write code in this format. The programmer may do a variety of unrelated tasks between Lbegin and Lend—which may be perfectly legitimate software development activities, but which are harmful from a Data in Use Risk exposure perspective.
The pairwise swap approach presented in this Section removes unrelated (to the sensitive data) code from inside of the [Lbegin:Lend] range and pushes those code lines to outside of the [Lbegin:Lend]
For a sensitive data SDi, let us consider two adjacent successive Use lines—the jth Use and the (j+1)st Use. A Program Dependency Graph (PDG) is constructed3 rooted at the jth Use line, and terminated at the (j+1)st Use line. This is a local PDG, and not a global PDG. It only focusses the lines of code (specific to the sensitive data) between the jth and the (j+1)st Use of SDi. 3 A PDG combines Data Flow Graph (DFG) and Control Flow Graph (CFG) [12]. We assume a reader familiarity of how to construct a DFG, and a CFG, and then how to combine them into a PDG.
If a particular line (l) of code, between the jth and the (j+1)st Use of SDi, is not mappable to the local PDG, it would demonstrate that this particular line (I) is not relevant to the sensitive data SDi and hence should be swapped out of the code window between the jth and the (j+1)st Use of SDi. This process is repeated until all such lines (l) that have no PDG-marked relevance to SDi, are removed out of the said window, and the window only consists of code lines that are directly relevant to SDi.
Pairwise swap is a double nested algorithm, with a Status Flag (Boolean) set to False at the beginning of a full iteration. The Status Flag is turned to True, if a swap is successful, indicting that there was at least one line of code which was irrelevant to the sensitive data (per the PDG) and this line of code was swapped out of the window between the jth and the (j+1)st Use of SDi. If the Status Flag is True, then a new iteration is commenced, to as to accommodate the potential transitive relationship effect between one swap that may invite a follow up swap.
The double nested algorithm starts from the last Use line (specific to the sensitive data) of the code, and then works progressively towards the front of the code. For example, if the code has 100 Use occurrences of SDi then the algorithm would start from the 100th occurrence and swap towards the 99th occurrence, then from the 99th occurrence to the 98th occurrence and continuing until the (2nd to the 1st) occurrence pairs.
If during this full (i.e., complete) pass—from the 100th towards the 99th occurrence—even a single swap is successful, then the algorithm initiates another full pass from the 100th towards the 99th occurrence. The algorithm stops only when a full pass across the entire code lines results in no swap.
This algorithm is an exact dual of the above. It starts from the 1st Use being moved to the 2nd Use, then 2nd to the 3rd, and 3rd to the 4th, . . . till the (max−1) to the max, then start back at 1st to the 2nd and keep doing until a full iteration is completed without any further swap possible. This is shown in
We propose the following combining mechanism when two or more sensitive data are involved. See
This Section presents a number of heuristics for clustering the Use patterns of two or more sensitive data. Section 3 presented the (pairwise) Swap based local clustering algorithms, but for a single sensitive data. It is also generalized for multiple sensitive data. However, they are all local clustering algorithms. They lack the global view. They operate on a bottom-up basis, not top-down basis.
This Section presents top down clustering algorithms with two or more sensitive data values.
The Front Pass algorithm starts from the very first line of the code, and detects the earliest occurrence of anyone of the sensitive data SDi. The objective of the algorithm is to identify all lines of occurrence of SDi across the code. In a practical code, there may be intermediate variables which relate to SDi, and those intermediate variables may in turn relate to some other intermediate variables (or, may even relate to other sensitive data SDj, SDk, SDm and so on).
In a flow graph (data, and/or control) sense this is transitive closure formation. Let T*(SDi) be the Set of all variables (sensitive or other) that has a control flow and/or data flow relationship to SDi. The algorithm computes T*(SDi) and then lexically detects all Use occurrences of any member of the T*(SDi) Set across the code lines. The output of this lexical scan is a List, whose pth element is denoted by OccSD
Note that, the T*(SDi) computation does not require any PDG formation. It is an invasive algorithm, to the extent that the algorithm swaps code lines in and out of their current places. But, the algorithm does not entail to a total redesign of the code. It moves existing code lines around, but does not redesign the code.
Finally, the pth and (p+1)st Use occurrences in the Lexical scan List are collapsed together to become at successive line, instead of being far apart. Thus, OccSD
The Rear Pass algorithm is dual to the Front Pass algorithm. It starts from the tail end, the last line of the source code, and scans forward. The differences between these two algorithms are shown in the below
Section 2.5 listed a number of Mitigation algorithms, of which the Front and Tail Trim algorithms and Code Partition in entirety algorithms are presented in this Section.
Notion of Birth, Use and Death of sensitive data is presented.
The life-span of a sensitive data is the time lag between its Birth and Death. During this life-span the sensitive data is in the memory, which can be compromised to gain unauthorized access. Likewise, if the code does an abort then the Heap contents may be dumped into a HD, which is another source for unauthorized access to the sensitive data.
A reduction of the life-span proportionately reduces the amount of time the sensitive data stays in memory, and hence proportionately reduces the attackers opportunity to gain access to the sensitive data while it is in use. This section presents such methods:
For every sensitive data, its Birth timeline in the Lines of Code is detected by lexical analysis. If a Birth is too far ahead in the code timeline, compared to its very first Use, then the Birth is postponed in the code timeline to bring it to immediately prior to its first Use.
This is shown in
Table 5: Deleted. See Drawings file for Table 5.
Depending on how the “Read” is constructed, e.g., from a Database or from a Message Inbox, or a command line input, or a file read, a syntax validation may be necessary. This is a one time manual operation, post which the updated code requires no further update or maintenance.
For every sensitive data, its Death is spotted by the Lexical analyzer. If no specific Death of a sensitive variable is inserted in a code, then the Death is aligned to the last line of the code. Depending on the programming language syntax, the Death may be release of a pointer, or setting a pointer to Null. If a Death is too far delayed in the code timeline, compared to its very last Use, then the Death is preponed in the code timeline to bring it to a point immediately after its first Use.
This is shown in
Table 6: Deleted. See Drawings file for Table 6.
For every sensitive data, it must be wiped securely, absent which there is always a possibility that the sensitive data may reside in the memory somewhere and reach the wrong hands eventually. Secure Wipe is an industry standard, for disks, where 3 or more times overwrite onto the sectors are performed, see DoD 5220.22-M [9] for defense sector implementation of the same, and NIST 800-88 for the same for commercial sectors. However, when the data is on a Cloud platform, a question arises as how to secure wipe the data when the physical disk sectors are not within Client control.
This invention presents a secure overwrite, instead of secure wipe, where the overwritten content is a random bit stream (=white noise). As an example, if the Expedited Death operation is a pointer release or pointer being set to Null, then the Secure Wipe would set the pointer to a random bit stream, so that whichever physical location the system underlie may have been writing the data to, the very same physical location would now get overwritten with a random bit stream.
Note that without the Secure Wipe step, a mere release of the Pointer does not erase the sensitive data—the sensitive data does stay in the memory, it is only the Pointer that looses track of where the sensitive data might be. In such a case, an attacker with a full dump of the memory (or, likewise the Log file post a core dump) may be able to access the sensitive data.
This invention presents an explicit overwrite, with random bit-stream, so that the sensitive data is permanently wiped.
Table 7: Deleted. See Drawings file for Table 7.
Depending on how the “Secure Wipe” is constructed, e.g., calling a library, or writing code within the Application itself, a syntax validation may be necessary. This is a one time manual operation, post which the updated code requires no further update or maintenance.
The above approaches focus on single sensitive data, to reduce its Attack Surface. In most codes sensitive data do not work in isolation, instead 2 or more sensitive data may interact with each other, either in computation or in business logic decisions. Computation example: two sensitive data being added to produce a third sensitive data. Business logic example: if a certain sensitive data is below a threshold send an alert message.
Opportunity for Code Partitioning: For two or more Sensitive Data Elements (which interact in computation and business logic), how to restructure their relative “Use” lines of code—so that (a) they can be brought closer in timeline of execution, and (b) sensitive data that are not relevant to one section of the code can altogether be removed from the Heap, i.e., partitioning the code.
While manual (re-programming, using human intuition) can always be done, this invention presents an automated way of doing so. See Table 8 below.
Table 8: Deleted. See Drawings file for Table 8.
The above algorithm is an automated way of reshuffling the code, so as to cluster the code by sensitive data elements' usage, and minimize the life-span of sensitive data thereby reducing their respective Attack Surfaces.
Section 2.5 presented a taxonomy of Mitigation algorithms. Algorithms marked “1”, “2” and “6” are presented in this invention, while the other 3 Algorithms (3, 4 and 5) are parked for a follow up research report. A summary of these algorithms can be stated with a dual5 perspective, which then becomes a programmer's guideline as how to write a better quality code (in Data Security Risk reduction context). These Best Practices are intuitive guidelines. Like any Best Practice, they are “ball park, sweeping statements”, may not be precise or mathematical. They are engineering thumb rules, reported as below. Note that only those sensitive variable explicitly cited by the Application User are taken into consideration. The program may have other sensitive variables, but their Data in Use exposure is not relevant to the context. 5Dual means reversing the thought process, [original thought process of] instead of looking to Mitigate the Risk, [dual thought process being] advise the programmer to write code in a way that the Mitigation would be already adopted in the source code.
The notion of “Atomicity” is well known in RW conflict resolution or NMI (non-maskable interrupt) processing. A similar concept appears to be effective while writing secure code (in the context of Data in Use security attack surface reduction). What is really being done is—[1] identify and track the sensitive data variables, [2] capture the data flow and control flow to build the PDG, [3] if two or more sensitive data are related in the PDG then cluster them and treat the entire cluster as a single PDG, and [4] codify the PDG nodes per the dependency arrows, [5] ensure no extraneous (to the PDG) code is interjected, so the entire PDG cluster is coded a single atomic segment. It is expected that this atomic segment would be single entry and single exit.
It is possible to extend this analysis to denotational semantics or mathematical programming techniques, which are outside the scope of this invention.
The Invasive Mitigating algorithms revert the source code to the design white board. It starts with identifying the list of sensitive variables, and then grouping them into disjoint sets. Data flow graph (DFG) and Control flow graph (CFG) are fundamental programmatic context, and the Program Dependency Graph (PDG) which combines DFG and CFG is being utilized in the current context.
Singleton Sensitive Data: If a sensitive variable SDi is not related (by any arithmetic or logical [AOL] operations) to any other sensitive data, then SDi is a singleton and its PDG can be constructed by including only the arithmetic and logical operation SDi participates. The PDG construction must accommodate transitive relations, so if SDi has an AOL relationship with another intermediate variable (not necessarily sensitive), Interim_Var1, then all AOL operations of Interim_Var1 must also be included in the same PDG with the PDG of SDi. Likewise, if Interim_Var1 in turn relates to Interim_Var2, then the PDG for Interim_Var2 must also be included in the foregoing PDG construction. This process of transitive inclusion making shall continue until no new interim data is detected to have any AOL relationship to anyone or more of the data items in the PDG, at which time the PDG construction shall be deemed complete.
Multiple Related Sensitive Data: If two or more sensitive data, e.g., SDi, SDj, SDk and SDm are related by AOL operations, then their PDGs must be constructed together. All intermediate variables that has any AOL relationship to anyone of more of the [SDi, SDj, SDk and SDm] set of sensitive data elements must also be progressively grouped together in the same PDG. The stopping criteria is the same as with the singleton sensitive data case—i.e., the process of transitive inclusion making shall continue until no new interim data is detected to have any AOL relationship to anyone or more of the data items in the PDG, at which time the PDG construction shall be deemed complete.
Once the PDG is constructed, the Mitigation algorithm codifies the PDG nodes starting from the “START” (node marked “Entry” in
Practical Considerations (Scope for Optimization): An example best illustrates the process. Suppose in
Therefore, (a) while in principle the PDG being solely involved with sensitive data and intermediate variables, and none other, and furthermore (b) while it may appear that the process of coding the PDG nodes is non-ambiguous—in reality, certain PDG nodes may reference to abstract and high level tasks whose implementation may require a large number of lines of code. In such situations, it is not straight forward to determine if there is, or isn't opportunities for further Data in Use Risk reduction by deploying the encryption after and decryption before methods.
For these reasons, we formulate the below Job Shop scheduling problem for codifying the PDG implementation, first as a general purpose Job shop scheduling problem with arbitrary number of machines, and then constraining it to a Job shop scheduling problem with single machine.
Below is one (among many) formulation of the optimization problem.
This problem appears similar to the Job shop scheduling problem, extensively studied in Computer Science and Operations Research.
Job shop scheduling problem: A set of jobs or tasks (each job or task being represented as a node), interconnected by a precedence graph, each node having zero or more predecessor nodes, and each node having zero or more successor nodes, and each node with an weight indicative of an execution time—how these jobs can be scheduled to be executed by a set of independent machines to optimize the total execution time (aka. Makespan). With minimize Makespan objective, below are a few known results.
Next, we generalize from the minimize Makespan objective. Below are a few well known variants of the Optimization objective for a single machine execution.
Of these, the “minimizing the cost of lateness” (Optimization Objective 4) is the closest to our context, with the below objective.
Section 6.2 mapped the current problem to Job shop scheduling and narrowed it to a single machine scheduling problem, with the Optimization being reducing the cost of Lateness, where Lateness is defined as a step function.
In this subsection, we review known results in this specific narrowed context. We review 3 special cases—2 of which do not fit to our problem context, and the 3rd which may fit and may even be an exact fit. This exact fit problem has been known to be solved Polynomial by Lawler's algorithms which is reviewed.
Lawler's algorithm is repeated below to solve our PDG scheduling problem. However, this algorithm fails in its optimality claim, when more than one machine (e.g., Quad-CPU) is involved. The next subsection 6.3 provides a solution for such situations.
Practical consideration: The time may be approximated by counting the Lines of Code, as a rough measure of the FLOPs scaled to a “per line” basis. Therefore, it may not be necessary to compute the PDG node's execution times in FLOPs, instead a count of the number of lines of code it would take to implement the PDG node's function may suffice.
The entire optimality claim in Section 6.2 relies on a single machine assumption, which may not be a valid assumption as modern processors have more than one CPUs. To make matters more complex, some of the library calls that may be involved in PDG node execution may reference out to other Servers, with a completely different execution platform. A database related library call is such an example. The library references to execution of a piece of code, may be short or may even be medium sized, but is on a different platform—namely the Database Server. The Database Server is a separate machine, unrelated to the host Server where the main program is running.
The DB library call may be part of both an input parameter preparation, or an output value write operation. Other examples of 3rd party machines would include an Web Service reference that is manifested on a remote Web Server, or a transactional batch processing system that interacts with another Server altogether, and so on.
The question therefore becomes—is the optimality claim unreal for practical embodiments. This invention proposes a serialization (of library references) programming concept that can alleviate this problem and reinforce the single machine assumption and hence reinstate the optimality assertion.
To reinforce the single machine assumption, we evaluate why and where the Library references occur.
Methods to support Use Cases 2 and 3 are what we term as Library Reference Serialization. As an example, with the PDG presented in FIG. 17, if the two variables n and s are sensitive, and the two references [read (n)] and [write(s)] are complex enough to require remote machine execution, then
The net effect of the Library Reference Serialization is that-within the PDG nodes execution timeline, the single machine assumption is reinforced. No outside machine Library reference back and forth are involved. Hence, the optimality using Lawler's algorithm is reestablished.
The 5-step process is described below.
In practice, a developer environment may implement these steps, as/when a commercialization may occur along this line. A tool like Visual Studio may embody these development environment steps, to assist secure code development.
Intermediate variables that may be used by the programmer require special treatment when it comes to Data in Use protection. Ideally, the programmer should not use any intermediate variable at all, and all processing must be directly on the sensitive data variables. However, intermediate variables usage is a personal style for many programmers. Furthermore, in some specific cases, intermediate variables provide easier method to implement certain code logic steps, such as interchanging values between two variables. Therefore, one must be able to accommodate intermediate variables.
Below
Our scope is “sensitive data” as/when held by the intermediate variables. Therefore, “what it holds” is unquestionably sensitive data. We enumerate the “how”, “where”, “why”, “when activated”, “declared how” and “destroyed how” attributes of the intermediate variables below.
We finalize the net impact to the 3 phases of Data in Use Risk reduction, namely Detection, Mitigation (Non Invasive) and Mitigation (Invasive).
Sensitive data, as they are lexically quoted and provided by the stakeholders, is always (100%) accurately detected. The detection process is: Deterministic and precise
Intermediate variables can be either: (a) Completely irrelevant (never carries sensitive data), or (b) carries sensitive data. Data Flow Graph (DFG) and Control Flow Graph (CFG) analysis puts “sensitive or not” mark on intermediate variables.
However, the DFG/CFG process may have run-time ambiguities. Hence, (sensitive or not identification of) intermediate variable may be Imprecise. False Positive and/or False Negative possibilities exist. It is not a precise and deterministic process. Manual code review and inspection are recommended.
Non invasive mitigation for sensitive data is: (a) Gap threshold driven, and (b) precise and deterministic.
However, non invasive mitigation for intermediate (sensitive data holding) variables may have the following characteristics:
Invasive mitigation for sensitive data utilizes the overall Program Dependency Graph (PDG) and optimizations thereof. It is precise and deterministic. It utilizes the Gap threshold based treatment, like in the non invasive mitigation approach.
However, invasive mitigation for intermediate (sensitive data holding) variables may have the following characteristics:
First, we present two metrices to capture the Data in Use risk exposure, and cost of mitigation. Then, we demonstrate that a third attribute is missing from our assessment-namely, how hard the Crypto algorithms ought to be? Is it always a fixed and static level, or does it vary based upon the amount of Ciphertext exposure that could be available under the specific application circumstances. We demonstrate that it is the latter, and not the former. In doing so, we derive certain variability options in the selection of the Crypto algorithms. The term ‘Crypto’ in this context should be interpreted broadly, it may not be Cryptography always, it may even be a Data Protection Method such as Masking or Obfuscation.
We compute cumulative Exposure as the sum of all gap lines between pairwise Uses of the sensitive data.
Next, we compute cumulative Cost (as incremental execution delay) as the sum of all additional encryption and decryption operations required.
Together, these two Metrics can be defined as
Combined Metric: Because both Exposure and Cost accrue in the same direction, i.e., increase in either indicates deterioration and decrease in either indicates improvement, we combine these two metrics using addition, multiplication and exponent joiner (as opposed to subtraction and division). A weight (w) is used to provide scale factor between then. Two arithmetic options are proposed below:
Finally, the combined Metric needs to be normalized, as follows:
The process of insertion of an Encryption after the previous Use and a Decryption before the immediate next Use, increases the Data in Use attack surface or Risk by (E+D) in each step, where E & D are in lines of code. In our review of the Java crypto library, E & D are in the ranges of ˜20 each, i.e., approximately 20 high (Java) level lines of code. The Cost metric counts each one of Encryption and Decryption as unitary cost item, and increases the Cost value by 2 every time the pair of Encryption and Decryption is invoked.
But, the scale factor question remains unaddressed. Is the Encryption (or, Decryption) deployed for (a) [Data in Use] Heap the same complexity as the Encryption (or, Decryption) deployed for (b) [Data at Rest] Database fields or records, (c) [Data in Motion] TLS packets? If yes, then a common scale factor can be uniformly used for all encryptions and decryptions. If not, then disparate sets of encryption(s) and decryption(s) are required.
This subsection develops the scale factor model for encryption and decryption. We demonstrate that 2 sets of encryption and decryptions are required—High complexity and cost, versus Low complexity and cost.
The amount of data available for cryptanalysis is a function of two parameters—(a) how much Ciphertext data is available (and, being disclosed to the attacker) at a single snapshot (=Snapshot Size), and (b) for how long (aka. Duration) the attacker may be able to covertly keep collecting the snapshots before a security appliance detects the intrusion and cuts off the covert listen in capability, and resets the encryption Keys.
This is shown in
We discuss below that for different categories of Data Security, namely Data in Use, Data at Rest and Data in Motion—the Exposure Volume (EV) varies significantly. Hence, for the low EV values—a lighter/simpler cryptographic solution is adequate, whereas, for the high EV values the regular industry standard heavyweight crypto algorithms are required.
This “one” data value—multiplied by a small covert listen in window, creates a rather small EV. If one extends this argument by including all the sensitive data values (not just “one”), then the EV would have a multiplier, but that multiplier is not in millions or tens of thousands or even in thousands. Most programmers do not deal with more than a handful of sensitive data in a code. As an example, a database may have millions of records, but when a program reads the records, it will usually be one or a few at a time, so the number of sensitive data inside the code will be a few.
Next, we review the Data at Rest Use Cases, as shown in
Third, we review the Data in Motion Use Cases (see
Consequently, the EV can be anywhere from large to small, depending on the Use Case.
Combining the observations in the 3 sets of Use Cases above, we note that there are two tiers of EV values. Tier 1 is with large EV, and it applies to Data at Rest and certain extranet Data in Motion applications. Tier 2 is with small to tiny EV, and it applies to Data in Use and certain intranet Data in Motion applications. This is shown in
The lower part shows the EV range differentiated solution space, aka. “Should be”. Data in Use is clearly in the lightweight (or, even masking) category. Data at Rest is clearly in the opposite, i.e., Heavyweight category. Data in Motion is in the “it depends” category. For extranet communication, it is clearly in Heavyweight category. For intranet and that too within the same server (i.e., intra-Server) communication, it is clearly in the Lightweight or Masking category. For intranet but server to server communication (i.e., traffic that is on the LAN, but inside the firewall), one could argue in either way. Heavyweight encryption may be deployed, on the safety's sake. Or, lightweight and masking solutions can be deployed, to benefit the performance and cost arguments.
The second algorithm is a One-Time Pad, which is theoretically known to be the most secure, as long as certain conditions are met, one of which is a 1-time application only. This condition is ideally suited for Heap data exposure, as each exposure carries a one-time data snapshot only, and the One-Time Pad is replaced immediately after exposure. The One-Time Pad, specific to each sensitive data, is selected each time by a pseudorandom number generator with byte length sufficient to exceed the byte length of sensitive data variable.
The third algorithm is Light Weight Cryptography, an evolving standard from NIST with original design from Korean sources. It follows an asymmetric key cryptography structure, and is an industry standard in algorithmic strength.
The fourth algorithm is a fogger based Data Masking solution. Normal industry practice in Data Masking is a one way flow, i.e., take the sensitive data and convert it to an unintelligible variant so that it cannot be easily comprehended, if disclosed. The unintelligible version can be used for Testing, placeholder databases, etc. However, our application is a two way flow, we not only need to transform the sensitive data to an unintelligible format, we also need to implement the reverse, i.e., bring back the unintelligible formatted data back to the original value of the sensitive data. This two way flow process is very similar to encryption and decryption, except it is much lighter weight, and faster to execute.
This invention proposed a Shuffle and Unshuffle algorithm as follows. Depending on the byte size of the sensitive data, the shuffled items are either at byte level, or at bit level. If the sensitive data has n bytes, then the Masking and Unmasking algorithms are as follows. The algorithm is similar to the Fisher-Yates algorithm [11].
Table 9: Deleted. See Drawings file for Table 9.
The Key construction, as shown in
2 (out of the 3) Encryption algorithms (Substitution Cipher and One-Time Pad) are Symmetric, their respective Keys are generated from the (User EX-OR Organization) created Key. The 3rd algorithm is Asymmetric, with both Public Key and Private Key being required, which are generated by an arithmetic operation over the User Key. See
Normally, and ordinarily, use of Substitution Cipher would be entirely unacceptable due to the ease with which it can be broken. Likewise, use of One-Time Pad may also be considered impractical, due to repetitive nature of the encryption process around a specific data. However, as shown in