Storage devices such as hard disk drives and solid state drives are widely used in a variety of computing systems. During operation, such storage devices will generally experience some number of errors, which may be due to defective storage elements, among other causes. Some types of errors may be reported to the host and may also be handled internally by the drive itself, such as by relocating data from defective storage elements to new storage elements. In some cases, errors may be stored to an error log. Analysis of the error log may enable a user or technician to diagnose problems that may exist in the drive. Often such error logs are limited in terms of the amount of errors recorded and the type of information recorded for each error.
Certain embodiments are described in the following detailed description and in reference to the drawings, in which:
Exemplary embodiments relate to systems and methods for generating a check condition log page. Various embodiments described herein provide a storage drive configured to store a check condition log page to the main storage media of the drive. As used herein, a “check condition” error is an error that is reported by a Small Computer System Interface (SCSI) device to the host, also known as the “initiator,” in response to a command from the host. The check condition log page can include a variety of information related to each check condition error reported to the host, and this information may be gathered at the time that the error occurs.
Hard disk drives will often reallocate recoverable media errors by moving the related logical block address (LBA) to a different physical memory address. Solid state drives regularly relocate logical block addresses to different physical memory addresses, for example, as part of a wear-leveling routine. Thus, merely knowing the logical block address corresponding to an error does not always reveal the physical memory address corresponding to the error, because the logical block address may have been remapped to a new physical location. In an embodiment, the check condition log page includes the physical memory address of the storage element corresponding to the reported error.
The check condition log page described herein can have a number of beneficial uses. For example, a drive manufacturer or computer system manufacturer may perform various tests on a drive, including a stress test of the drive's firmware. Such a stress test will generally return a large number of errors, many of which may be media related errors. Upon detection of an error during such a test, the check condition log page can be accessed to find the last error with matching initiator address, command descriptor block (CDB), and sense data, retrieve the physical memory address for that error, and check that physical memory address against the list of known bad media locations. The list of known bad media locations may be obtained using an SCSI Read Defect Data command, which can be used to retrieve a defect list in the form of physical addresses. Errors that occur on known defective media locations can be forgiven, which saves time and effort that would otherwise be spent investigating each error. Furthermore, the check condition log page can be a valuable tool for analyzing drive errors that occur during normal use. For example, the check condition log page on a drive returned by a customer or in use at a customer site can be used to provide a list of a large number of recent errors. Since the physical memory address corresponding to each error is stored to the check condition log page, the media location of each error can be determined even if the logical block address corresponding to the error has been relocated by the drive. Therefore, the check condition log page may provide better insight into the field behavior of the drive.
The computer system 100 may include a computing device referred to as host 102, which may be a general purpose computer such as a desktop or a server, for example. The host 102 includes a processor 104 coupled to a system memory 106. The processor 104, which may be a central processing unit (CPU), is adapted to control the overall operation of the host 102 and may be coupled to the system memory 106 through a memory controller. The system memory 106 may include a volatile memory region such as RAM used by the processor 104 to execute the various software programs running on the processor 104, such as an operating system, applications, device drivers, and the like. The host 102 may also include user interface devices, such as a monitor and a keyboard to provide user control of the host 102.
As shown in
As shown in
The drive 108 may also include a non-transitory, computer-readable memory such as memory 204 for storing code such as device firmware. The memory 204 may include any type of volatile or non-volatile memory such as NAND flash memory, among others. The drive controller 200 may retrieve and execute instructions stored in the memory 204 to process various operations of the drive 108, including operations in accordance with an embodiment describe herein, such as generating the check condition log page 114. The drive 108 may also include volatile memory such as a cache used to store frequently accessed data or to buffer data being sent to or retrieved from the storage medium 202.
The check condition log page 114 may be stored to the storage medium 202 and may include a check condition log status parameter 208 and one or more check condition data parameters 210. The check condition log status parameter 208 provides higher-level status information about the state of the check condition log page 114, including information about the subsequent check condition data parameters 210. For example, the check condition log status parameter 208 includes information such as the number of recorded check condition data parameters, the maximum number of recordable check condition data parameters, and the like. An example of a check condition log status parameter 208 is described below in reference to
Each check condition data parameter 210 includes data relating to specific errors or conditions encountered by the drive 108. In an embodiment, each check condition data parameter 210 contains data relating to check condition errors reported to the host 102 or storage controller 112. For example, a check condition error may be caused by various conditions which prevent the drive 108 from executing an operation requested by the host 102 or storage controller 112, such as a defective media element, among others. A check condition error may also be a notification that the command succeeded, but the drive has detected some unrelated problem. For example, the drive may have detected a temperature excursion or early signs of an impending problem. In an embodiment, only check condition errors are stored to the check condition log page. For example, other error conditions such as BUSY status, RESERVATION CONFLICT status, or internal errors that are not reported to the host 102 may not be stored to the check condition log page 114. The check condition log page 114 may include any suitable number of check condition data parameters 210, for example, 1000 to 65,535 or more. An example of a check condition data parameter 210 is described below in reference to
Each time that a check condition occurs, the drive controller 200 reports the check condition to the host 102 and gathers the data to be stored to the check condition log page 114, including the physical memory address corresponding to the check condition. The drive controller 200 may then generate a new check condition data parameter 210 and populate the check condition data parameter 210 with the gathered data. In an embodiment, the check condition data parameter 210 may be written to the check condition log page 114 upon the occurrence of the check condition error. In an embodiment, the data corresponding to the check condition error may first be stored to a temporary storage location, such as the memory 204. The check condition log page 114 may be periodically updated based on the check condition data stored to the memory 204. For example, the check condition data may be transferred from the cache to the check condition log page 114 after a specified time interval, for example, one minute, ten minutes, thirty minutes, one hour, twelve hours, or more. The check condition data may be gathered at or near the time that the error occurs and before any potential remapping of the logical block addresses corresponding to the error can take place.
As shown in
In an embodiment, if the number of check condition errors encountered reaches the maximum number of check condition data parameters 210, or a multiple thereof, future check condition data parameters 210 may be wrapped around to the beginning of the check condition log page 114 such that the oldest check condition data parameter 210 is overwritten with the new check condition data. For example, if a check condition error is encountered after the check condition log page 114 is full (number of errors 304 equals maximum number of data parameters 302), the drive controller 200 may overwrite the first check condition data parameter 210 with a new check condition data parameter 210 corresponding to the most recent check condition error. In this example, last parameter written 306 will equal one and number of errors 304 will equal maximum number of data parameters 302 plus one. The next check condition data parameter 210 would be stored to position two, again overwriting the oldest check condition data parameter 210, and so on. In this way, the check condition log page 114 may store the most recent check condition errors, and number of errors 304 will provide an indication of how many check condition errors actually occurred, including those that may have been overwritten. The check condition data parameters 210 may be wrapped any number of times.
The power on time value 400 may be used as a time stamp to indicate a time that the corresponding error occurred. The power on time value 400 may indicate the number of seconds that drive 108 has been powered on at the time of the error. The initiator address 402 is the address of the initiator that sent the command associated with the check condition error and may be a serial attached small computer systems interface (SAS) address, for example. The port value 404 is a value that indicates the port that returned the error.
The mark bad value 408 may be a bit that is set if the media error is returned due to the logical block address being marked “bad” due to a previous media error at a different physical location. For example, a solid state drive may return an unrecoverable read error at LBA (logical block address) x due to a NAND media error at physical location y, in which case the MARK BAD bit for this error may be 0. The solid state drive may later move that LBA x to a new physical location z, due to garbage collection, or other LBAs in the page being written, or some other reason. The solid state drive will generally return an unrecoverable read error for any reads to LBA x until LBA x has been re-written. Any subsequent unrecoverable read errors for LBA x, before LBA x has been re-written with user data, may have the MARK BAD bit set to 1.
The CDB value 410 includes the command descriptor block (CDB) associated with the error. The CDB is used to communicate storage commands to SCSI devices, and may include a one byte operation code followed by some command-specific parameters. Each CDB can be a total of 6, 10, 12, or 16 bytes, but later versions of the SCSI standard also allow for variable-length CDBs. The sense data 412 can include information about the error, such as the error code, the number of retry steps, and the LBA where the error was encountered, among others.
The address descriptor 414 is a physical memory address that indicates the actual physical location of the media error corresponding to the check condition error. If the check condition error did not result from a media error, then the address descriptor 414 may be set to zero. For any media error, such as a bad sector, the address descriptor 414 may be set to a non-zero value that indicates the physical memory address of the media location that returned the error.
It will be appreciated that the check condition data parameter 210 may include various other data values in addition to those described above. For example, the check condition data parameter 210 may include some padding in various data fields to accommodate data fields of varying length. The check condition data parameter 210 may also include descriptors that indicate the length of fields including padding, the length of the valid portion of the field, and the like.
At block 504, an error is generated in response to the storage command. For example, the error may be a check condition error, which may be reported to the initiator. Other types of errors may include a BUSY status, RESERVATION CONFLICT status, and internal errors that are not reported to the initiator, among others.
At block 506, a check condition data parameter may be added to the check condition log page 114 shown in
In an embodiment, adding the check condition data parameter to the check condition log stored to the storage media may include adding the check condition data parameter based on check condition data temporarily held in a cache or buffer. Accordingly, data corresponding to the error may be gathered at the time of the error and stored to the cache. The check condition log may be periodically updated from the check condition data stored to the cache. Furthermore, as discussed above, upon reaching the last check condition data parameter that the check condition log can hold, new check condition data parameters may be wrapped around to the beginning of the check condition log.
At block 508, the drive controller may update the check condition log status parameter. For example, the drive controller may increment the number of errors 304 and the last parameter written 306 by one.
A processor 602, which may be the drive controller 200 if
Although shown as contiguous blocks, the software components can be stored in any order or configuration. For example, if the non-transitory, computer-readable medium 600 is a hard drive, the software components can be stored in non-contiguous, or even overlapping, sectors.