The present disclosure generally relates to forensic data recovery and more particularly to accessing hidden data files.
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
Existing methods to recover hidden data files from computational storage devices are tedious and time-consuming. Solid state drives (SSDs or SSD when discussing a singular solid state drive) are complex computational storage devices that use NAND flash memory chips. Such memory devices have a high data storage capacity; however, they are difficult to manage because existing data in a chip cannot simply be overwritten, but rather must first be erased, and then written on again. Furthermore, data must be erased in large blocks; specifically, on the order of a million bytes, but may also be written in smaller blocks, on an order of thousands of bytes. These hindrances pose a problem for forensic analysts and others seeking to recover hidden data files because they impose significant time constraints on the process and thus potentially prevent successful data recovery from ever being accomplished.
In an effort to make the above memory chips easier to use and the hidden data more accessible, a flash translation layer (FTL) of software is included in the SSD to handle the details of deleting old data and writing new data, thus taking the burden of this task away from the host operating system. A memory array of the SSD has two spaces: a logical block address (LBA) space and a physical block address (PBA) space. These spaces are overlaid spaces. The LBA space is the data structure that the host computer sees and comprises the sectors in which data is stored. The PBA space is the memory provided by flash chips, and is generally up to 20% larger than the LBA space, depending on the particular configuration. The LBA space is mapped into the PBA space by the FTL software.
A legacy hard disk drive (HDD), a similar device, has a simpler configuration in that its LBA and PBA spaces essentially have a corresponding size ratio of one-to-one, with the PBA being only a fraction of a percent larger than the LBA.
The extra PBA space in the SSD is referred to as over provisioning space and has several purposes. These purposes include storing SSD firmware (which is the firmware that runs the SSD's internal microcontroller and which is typically 100 K to 200 K bytes, though this range is not provided to be limiting), NAND flash wear leveling, had block management, housing FM management tables, and garbage collection.
Wear leveling is a type of software algorithm that distributes the reading and writing activities evenly among the flash chips on the SSD. This is needed because NAND flash exhibits rapid wear out mechanisms, resulting in extreme fragmentation of the data written to the SSD. FTL management tables comprise memory storage for the LBA/PBA mapping table, which can be gigabytes in size. They also include other general task, or housekeeping, information. Garbage collection is a software algorithm which collects and erases currently unused but previously written areas in flash memory in order to prepare clean sections for future writes and avoid delays in erasing. All of the above functions are well known.
A problem occurs as a side effect in the operation of the SSD, which is that forensically valuable data gets moved out of the LBA space to where it cannot be accessed via the host computer interface, The management complexity and non-one-to-one LBA and PBA memory spaces of the SSD (as contrasted from the legacy HDD) further impede successful data recovery and forces individuals who want to recover the data, such as forensic analysts, to attempt to reverse engineer the algorithms in the SSD to obtain the hidden data, which can be very time consuming.
Referring now to
The data that is stored on SSDs and HDDs is in aggregations known as sectors. A sector is typically 512 bytes in length, but may be larger. The NAND flash chips include this form of data storage, considered sectors, and so, in most cases, integral numbers of LBA sectors are stored in physical flash memory pages.
There are currently two main methods to read the over provisioning space as a first step to recover hidden data. The first method consists of using custom read commands over the host interface port of the SSD. However, these commands are not standardized and are proprietary, and do not even exist for most SSD models, or are password protected or encrypted. These characteristics make it hard for individuals to access the hidden data.
The second method of reading the over provisioning space consists of reading the flash chips directly. This can be done by removing the flash chips and inserting them into a reading device that reads and stores their contents. To remove the flash chips involves desoldering the flash chips form the memory array. This may also be accomplished via electronic means of reading the flash chips while they remain installed on the SSD circuit.
There are several remaining steps currently required to recover hidden data from an SSD. After the flash memory chips are read and the data is saved as a PBA image, the LBA space is read over the host computer interface and the data is saved as a LBA image. Next, the flash memory errors in the PBA image are corrected, if possible. The error correction information. is deleted from the image, leaving only data. The PBA and LBA images are then compared, noting which sectors match in each image. Finally, the unmatched PBA sectors are separated and stored as hidden data.
The described existing process contains several issues. First, the format of the data within a flash memory chip varies greatly depending on the make and model of the SSD, as well as the make and model of the memory chip. This format must be determined before any hidden data can be recovered, which takes time. Additionally, the error correction code (ECC) that is used to prevent flash memory bit errors is typically unknown and is not published by the SSD manufacturer. It therefore may not be possible to correct errors in the raw data that is read from the flash chips. Further complicating matters is that the amount of data may be huge, reaching as high as the terabyte range. This means that the algorithms that are used must be of low complexity and have a low run time for large data sets, which could take days, weeks, or longer to complete. The standard approach to this problem is to represent each sector with a short hash value. For example, the use of an eight-byte hash value for each sector would reduce the data storage requirements by 98.5% compared to handling the raw 512-byte sectors. Provisions would need to be made to handle hash collisions.
However, even with the above hash optimization, the LBA and PBA images still bear no relation to one another due to the wear leveling algorithm used in the SSD, which significantly fragments stored files. This means that an LBA image file that is stored in contiguous sectors will be distributed over a large area of the PBA image with no simple mapping relationship, that mapping relationship being different for every make and model of SSD, as well as changing as a given SSD is used. This means that the matching process could potentially be an order of n2 process, which would be quite slow.
To increase the rate of the process, the LBA and PBA hash tables must be sorted. The tables are huge, containing millions of entries; however, there are an order of n (noted as Q(n)) sorting algorithms which may be used to quickly sort them, such as radix sort.
After the LBA and PBA hash tables are sorted, a searching process must examine each LBA hash, find it in the PBA hash table, compare the LBA and PBA source data byte-for-byte (to avoid the effects of hash collisions), and mark the sectors as matched, if they so are. This searching process is difficult to perform because there are many duplicate hashes in the PBA hash table, which prevents the use of a fast binary search. Therefore, an intermediate PBA index table is created that contains each hash value only once, along with a count number for the duplicate hashes in the PBA hash table, as well as an index into said table.
The above process works well when the errors in the PBA image have been corrected. if they have not been corrected, then any bit errors in the PBA sector source data will skew hash values and prevent them from matching to corresponding LBA sectors.
Given the foregoing, what is needed are methods which facilitate identifying and recovering data that is normally hidden in NAND flash memory arrays in SSDs and is normally inaccessible using host computer interfaces, without having to reverse engineer the algorithms in the SSD, using a hash value that is tolerant of some small percentage of bit errors in the source data.
This Summary is provided to introduce a selection of concepts. These concepts are further described below in the Detailed Description section. This Summary is not intended to identify key features or essential features of this disclosure's subject matter, nor is this Summary intended as an aid in determining the scope of the disclosed subject matter.
A method of isolating hidden data in a solid state memory system is disclosed. The method comprises obtaining a logical block address (LBA) image from the memory system, obtaining a physical block address (PBA) image, and determining whether an error exists in the PBA image and correcting the error. The method also comprises calculating an ETCRC on each sector of the LBA image and building a search tree indexed on the ETCRC value. For each sector in the PBA image, the method also comprises computing an error tolerant cyclic redundancy check (ETCRC) value and searching for the ETCRC value in the LBA search tree, if the ETCRC value found, the method compares the cyclic redundancy check (CRC) of the LBA and PBA sectors. The method also provides for outputting to an output file the PBA sector as hidden data if either the ETCRC value is not found in the LBA search tree or the CRC comparison fails.
Another method of computing a total hash function of an array of values which limits the impact of a change in the array of values to one subfield in the hash value is also disclosed. The method comprises dividing the array of values into a number of sections, computing a hash function over each section, and concatenating the computed hash function values to create a total hash function of the array of values, wherein a change in one of the array of values is reflected as a change in the total hash function in only one subfield.
A hidden data determination system for locating hidden date in a memory array of a solid state device is also disclosed. The system comprises an interface to access memory space on a memory device and a graphics processing unit to create a plurality of hash values for a logical block address (LBA) memory space of the memory device and to create a plurality of hash values for a physical block address (PBA) memory space of the memory to identify data hidden within the PBA memory space from view of the LBA memory space. The graphics processing unit creates a PBA index table associated with the plurality of PBA hash values, compares the plurality of LBA hash values with the PBA index table, identifies matches of any of the plurality of LBA hash values and any of the plurality of PBA hash values resulting from the step of comparing, and identifies data hidden within the PBA memory space when data identified in the PBA index table has no identified with any of the plurality of LBA hash values. The system also includes display device to show the data hidden.
A more particular description briefly stated above will be rendered by reference to specific embodiments thereof that are illustrated in the appended drawings. Understanding that these drawings depict only typical embodiments and are not therefore to be considered to be limiting of its scope, the embodiments will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:
Embodiments are described herein with reference to the attached figures wherein like reference numerals are used throughout the figures to designate similar or equivalent elements. The figures are not drawn to scale and they are provided merely to illustrate aspects disclosed herein. Several disclosed aspects are described below with reference to non-limiting example applications for illustration. It should be understood that numerous specific details, relationships, and methods are set forth to provide a full understanding of the embodiments disclosed herein. One having ordinary skill in the relevant art, however, will readily recognize that the disclosed embodiments can be practiced without one or more of the specific details or with other methods. In other instances, well-known structures or operations are not shown in detail to avoid obscuring aspects disclosed herein. The embodiments are not limited by the illustrated ordering of acts or events, as some acts may occur in different orders and/or concurrently with other acts or events. Furthermore, not all illustrated acts or events are required to implement a methodology in accordance with the embodiments.
Notwithstanding that the numerical ranges and parameters setting forth the broad scope are approximations, the numerical values set forth in specific non-limiting examples are reported as precisely as possible. Any numerical value, however, inherently contains certain errors necessarily resulting from the standard deviation found in their respective testing measurements, Moreover, all ranges disclosed herein are to be understood to encompass any and all sub-ranges subsumed therein. For example, a range of “less than 10” can include any and all sub-ranges between (and including) the minimum value of zero and the maximum value of 10, that is, any and all sub-ranges having a minimum value of equal to or greater than zero and a maximum value of equal to or less than 10, e.g., 1 to 4.
Embodiments disclosed herein are directed to a method that facilitates forensic data recovery. The embodiments include a method which facilitates a process wherein the physical block address (PBA) and logical block address (LBA) memory spaces are examined to identify and collect hidden data without having to reverse engineer the algorithms in the Solid state drives (SSDs)(SSD used herein as an abbreviation for a single Solid state drive), by providing a hash value that is tolerant of a small percentage of bit errors in the source data.
The term “hidden data” and/or the plural form of this term are used throughout herein to refer to data stored on a SSD that is not accessible through the SSD's host computer interface, and the like.
It is possible that two sectors will have the same data, and therefore the same ETCRC values, and will indicate the same node of the tree. To track all the sectors with the same ETCRC, each tree node contains a linked list of associated LBA sector information, comprised of the byte offset of each sector within the LBA image 110, and a Cyclic Redundancy Check (CRC) of each sector, typically a standard 32-bit CRC such as the CRC32 (commonly used in Ethernet).
The inventors have found that this process works well, with the stated assumption that the errors present in the PBA image 120 have been corrected. If they have not been corrected, then any hit errors in the PBA sector source data will foul the hash and CRC values and prevent matches to corresponding LBA sectors.
More specifically, if bit errors in the PBA image 120 are not corrected, hash values computed on corresponding LBA sectors will not match. What is needed is a hash value that is tolerant of some small percentage of bit errors in the source data. This operation is antithetical to the standard definition of a hash function, where the hash value should change greatly with only one bit change in the source data.
The ETCRC 161 can have a format and size suited to the task, not only eight bytes, but more or fewer, to accommodate various sizes of data to be hashed. Furthermore, the components of the ETCRC 161 can be other than 8-bits in length, as demanded by project requirements. Though the generator polynomial, x8+x7+x4+x2+x+1, was chosen, other generator polynomials may be used. The inventors found that the generator polynomial produces an acceptable collision rate of ˜1E-3.
By using the ETCRC 161 as explained above, if there are uncorrected bit errors in a PBA sector, then the LBA and PBA ETCRC values 112, 122 will not match, but usually only in one or two bytes of the ETCRC value. This is handled by a process called puncturing, which involves selectively ignoring combinations of bytes within the ETCRC during the hidden data discovery process.
The discovery process is run the first time with the full ETCRCs as computed. Once the first pass is complete, the hidden data output file contains all the PBA sectors that do not appear in the LBA, but also many falsely mismatched sectors that are different simply because of a few bit errors. The hidden data set is then run through the process again, but with each hidden sector ETCRC value punctured, as shown in
Puncturing may be performed using combinations of one, two, or more bytes of the ETCRC. To accommodate sectors with just a few bit errors in one 64-byte subset of the 512-byte sector, a single byte of the eight in an ETCRC is set to zero. There are eight such ETCRC puncturing patterns. To accommodate sectors with hit errors in two 64-byte subsets, two bytes of the eight in an ETCRC are set to zero. There are C(8, 2)=28 such ETCRC puncturing patterns. This level of puncturing is typically good enough to recover the vast majority of hidden data in the presence of PBA errors, though higher levels could certainly be used.
For convenience, the patterns used to puncture ETCRCs may be tabulated for use in the hidden data discovery process. As a non-limiting example,
The entire hidden data discovery process is shown in flowchart form in
Within the main loop, the first main task 303-305 is to put the AVL tree 113 into a compatible format, depending on the puncturing pattern selected. If the pattern is “empty”, meaning no puncturing is occurring (line 1 in the table of
The flowchart 700 continues in
An operation, at 308, closes the INPUTFILE and OUTPUTFILE files after all sectors are processed. If the PBA data is correct as supplied, a decision, at 309, terminates the process as no puncturing is required to complete hidden data discovery. If the PBA data is not correct as supplied, the PUNCTURE INDEX is incremented, at 310, and tested, at 311, for maximum value, terminating the process if so. Else, the INPUTFILE is closed, and the OUTPUTFILE is reopened as the new INPUTFILE, at 312, feeding the latest hidden data back into the process for further examination with a different puncturing pattern. This completes the description of the main loop in
An empty working tree is created, at 330, first, and the LBA image 110 is opened, at 331, as INPUTFILE. The top of the loop in this subroutine reads, at 332, a sector of data from INPUTFILE. The ETCRC and CRC values are computed, at 333. The ETCRC is searched for, at 334 in the tree. If not found, a node corresponding to the ETCRC is inserted, at 326 into the tree 113.
The node in the tree corresponding to the ETCRC then has the INPUTFILE byte offset and CRC of the sector stored, at 337, into the linked list. A decision block, at 338, at the end of the loop terminates the loop after the last LBA sector has been examined. The INPUTFILE is then closed, at 339.
The LBA AVL tree is now ready for use.
The node in the working tree corresponding to the punctured ETCRC then is updated, at 326, with a reference to the linked list from the original LBA AVL tree 113. The puncturing process can have the effect of combining two or more LBA AVL tree nodes 114. Rather than copying all the linked lists associated with those nodes into the working tree, references to the original LBA tree tables are stored, saving memory. A decision block, at 327, at the end of the loop terminates the loop after the last LBA node has been loaded into the working tree.
After isolation of the hidden data, commercial tools can be applied to identify interesting information, such as word processing documents, spreadsheets, videos, and images.
As disclosed above, the graphics processing unit 1520 may create a PBA index table associated with the plurality of PBA hash values. It 1520 may also compare the plurality of LBA hash values with the PBA index table with a fast binary search, and identifies matches of any of the plurality of LBA hash values and any of the plurality of PBA hash values resulting from the step of comparing. The graphics processing unit 1520 may also identify data hidden within the PBA memory space when data identified in the PBA index table has no identified with any of the plurality of LBA hash values. Thus, as also disclosed above creating the hash value for both the LBA and PBA memory spaces comprises creating an error tolerant cyclic redundancy check (ETCRC) table for both the LBA memory space and PBA memory space.
The disclosed embodiments are conformable to parallel processing on a graphics processing unit (GPU). Radix and merge sort algorithms exist that are readily available for GPU application. The ETCRC process is performed on each IBA and PBA sector independently and therefore may be paralleled. Furthermore, the matching process for each LBA sector may be paralleled with appropriate record locking mechanisms for access to individual records in the PBA index table. The embodiments may be designed to allow for extensive parallelism and commensurate acceleration.
As another non-limiting example, the embodiments disclosed herein may be used with the on-board chip reader adapter disclosed in U.S. Patent Application No. ______, which claims priority to U.S. Provisional Application 62/000,475 filed May 19, 2014, both which are incorporated herein by reference in its entirety
Even though the disclosed embodiments do not match the PBA and LBA sectors the same way that the SSD FTL software would, it does not matter because the output that is valuable is the unique, hidden PBA data, without regard for the PBA/LBA mapping relationship maintained by the FTL. Specifically, as a non-limiting example, for four PBA sectors containing data values A, B, B, and C, with the LBA showing sectors with values A, B, and C, an embodiment disclosed herein outputs as hidden data one sector with data value B. It does not matter what the PBA/LBA mapping was for that sector. It only matters that a sector with that value was recovered from the hidden PBA space, This is advantageous in that a determination of the PBA/LBA mapping relationship is not required, which is different for most types of SSD and FTL algorithms.
Several general advantages of this invention include, but are not limited to the following: hidden data is discovered, data not accessible over the usual computer interface for the storage device; only the most basic knowledge of the storage format is required, and no information about the FTL mapping between LBA to PBA; knowledge of the error correction methods is optional, and the error tolerant cyclic redundancy check makes hidden data discovery possible in a reasonable time frame; identification of the hidden data is accomplished in a reasonable amount of time, using a reasonable amount of storage that is approximately about 10% of the size of the LBA image.
While various aspects of the present disclosure have been described above, it should be understood that they have been presented by way of example and not limitation. It will be apparent to persons skilled in the relevant art(s) that various changes in form and detail can be made therein without departing from the spirit and scope of the present disclosure. Thus, the present disclosure should not be limited by any of the above described exemplary aspects.
In addition, it should be understood that the figures in the attachments, which highlight the structure, methodology, functionality and advantages of the present disclosure, are presented for example purposes only. The present disclosure is sufficiently flexible and configurable, such that it may be implemented in ways other than that shown in the accompanying figures (e.g., implementation within computing devices and environments other than those mentioned herein). As will be appreciated by those skilled in the relevant art(s) after reading the description herein, certain features from different aspects of the method of the present disclosure may be combined to form yet new aspects of the present disclosure.
Further, the purpose of the foregoing Abstract is to enable the U.S. Patent and Trademark Office and the public generally and especially the scientists, engineers and practitioners in the relevant art(s) who are not familiar with patent or legal terms or phraseology, to determine quickly from a cursory inspection the nature and essence of this technical disclosure. The Abstract is not intended to be limiting as to the scope of the present disclosure in any way.
This application claims the benefit of U.S. Provisional Application No. 62/000,478 filed May 19, 2014, and incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62000478 | May 2014 | US |