The complexity of firmware for Data Storage Devices (DSDs) has increased significantly due to the growing number of features and additional processing being performed locally at DSDs. Such increases in firmware complexity creates more issues with detecting and preventing faults or other exceptions during execution of the firmware. Logs can generally record a fault or other exception as either a time-independent statistics counter or as a time-series event log. Logs that serve as a statistics counter count the number of predefined exceptions, such as a predefined fault or the cumulative use of a component of the DSD, such as a total number of write operations performed in a flash memory. Logs that serve as a time-series event log can record the occurrence of exceptions.
However, current analytics for DSDs lacks insight into the events or states leading up to an exception. Some firmware analysis approaches may use firmware traces, but such traces are not available in many end-user environments and cover a very short time window. As a result, firmware traces are typically limited to failure analysis scenarios, rather than being used in large-scale or preventative analysis of DSDs. In addition, the use of firmware traces can lead to the collection of a large amount useless data.
The features and advantages of the embodiments of the present disclosure will become more apparent from the detailed description set forth below when taken in conjunction with the drawings. The drawings and the associated descriptions are provided to illustrate embodiments of the disclosure and not to limit the scope of what is claimed.
In the following detailed description, numerous specific details are set forth to provide a full understanding of the present disclosure. It will be apparent, however, to one of ordinary skill in the art that the various embodiments disclosed may be practiced without some of these specific details. In other instances, well-known structures and techniques have not been shown in detail to avoid unnecessarily obscuring the various embodiments.
Hosts 101 can include client computers or may include processing nodes, for example. As used herein, a host can refer to a device that is capable of issuing commands to a DSD to store data or retrieve data. In this regard, host 101 may include another storage device such as a smart DSD that is capable of executing applications and communicating with other DSDs. In other implementations, a host 101 may be included in the same device as a DSD 102.
Exception analysis unit 112 can include, for example, a computer or system of the DSD manufacturer and is configured to analyze exception logs received from DSDs 102 via network 103. As shown in
Network 103 can include, for example, a Local Area Network (LAN) or a Wide Area Network (WAN), such as the internet. In this regard, some or all of hosts 101, DSDs 102, and exception analysis unit 112 may not be physically co-located or may be located with the same data center. For example, in some implementations, DSDs 102 can provide a cloud storage for hosts 101. Exception analysis unit 112 can include a manufacturer's system for assessing operation of the firmware of DSDs 102 in the field under actual end-user workloads. As discussed in more detail below, the analysis of such actual use cases can ordinarily provide for more information for firmware updates or future firmware design.
Processor 116 of exception analysis unit 112 can include circuitry such as one or more processors for executing instructions and can include, for example, a microcontroller, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), hard-wired logic, analog circuitry and/or a combination thereof. In some implementations, processor 116 can include a System on a Chip (SoC). Processor 116 can access memory 120 to execute instructions, such as those from analysis module 18, or to access data used while executing such instructions, such as lists 20.
Memory 120 can include, for example, a volatile RAM such as Dynamic RAM (DRAM), a non-volatile RAM, or other solid-state memory. While the description herein refers to solid-state memory generally, it is understood that solid-state memory may comprise one or more of various types of memory devices such as flash integrated circuits, Chalcogenide RAM (C-RAM), Phase Change Memory (PC-RAM or PRAM), Programmable Metallization Cell RAM (PMC-RAM or PMCm), Ovonic Unified Memory (OUM), Resistive RAM (RRAM), NAND memory (e.g., Single-Level Cell (SLC) memory, Multi-Level Cell (MLC) memory (i.e., two or more levels), or any combination thereof), NOR memory, EEPROM, Ferroelectric Memory (FeRAM), Magnetoresistive RAM (MRAM), other discrete NVM chips, or any combination thereof.
Exception analysis unit 112 can also include storage 118, which can include a non-volatile storage, such as a solid-state memory (e.g., an SSD) or a rotating magnetic disk (e.g., an HDD) that may store data not in current use by processor 116. Such data can include, for example, exception logs 16 collected from DSDs 102 over time. As discussed in more detail below, each of the DSDs 102 can store an exception log 14a that records the occurrence of exceptions, such as an error, resource level, or other predefined event encountered while performing a task.
In the present disclosure, exception logs 14 can also include one or more lists or data structures 12 that indicate the order in which portions of code or firmware for the DSDs is executing in performing a task. The use of such data structures can ordinarily improve the ability to analyze exceptions, such as errors or other states of the DSD, so that sequences of execution related to certain exceptions can be identified during a failure analysis of the DSD or to optimize performance of DSDs 102. In some cases, a firmware may be updated in the field or changed for DSDs leaving the factory based on the identification of a particular sequence of code execution.
In the example of
As shown in
Memory 110 can include, for example, a volatile RAM such as DRAM, a non-volatile RAM, or other solid-state memory. In the example of
Such data may include one or more lists or data structures 12 that indicate an order in which portions of firmware 10 or other code is executed for a task. In this regard, firmware or code executed by controller 106 can be treated as a state machine or stateful system so that lists or data structures 12 can be reviewed to identify causes of an exception. Such state information of firmware 10 would not be available with conventional exception logs to show what occurred or which code portions executed leading up to an exception.
For example, a particular event, such as the use of a semaphore for a task may occur during a second state, which can correspond to the execution of a second portion of firmware or code for performing the task. Later, when a portion of the code or firmware is executed as a fifth state during the task, an error may occur such as running out of memory for a sub-process or a stall. The use of associated state information and/or the sequence of firmware or code execution can be reviewed or analyzed to identify the significance one or more causes for the error, such as the use of the semaphore in the second state. In this regard, an order of code execution indicated by a list or data structure 12 and/or state information associated with different states indicated by the list or data structure 12 can serve as a signature detection for identifying the causes of exceptions.
Lists or data structures 12 can also allow for more state information to be retained for more complex tasks that involve a greater number of code or firmware portions or involve a greater amount of switching among different code or firmware portions than for simpler tasks that involve less code or firmware portions or do not switch as much between execution of different code or firmware portions. This can ordinarily provide for a data collection proportionality so that more information is collected for tasks that are more complex that are more likely to need more information to determine the cause of an exception. In some implementations, the amount of data retained for a task may be limited to a predetermined number of code or firmware execution transitions or states. In such implementations, lists or data structures 12 can operate as a first-in, first-out queue so that the oldest collected information is purged to make room for the most recent collected information when the list or data structure has reached the predetermined number of states or transitions of code or firmware execution for the task.
Lists or data structures 12 can also include state information for DSD 102 while the task is executed, such as information about resource allocation, path overhead, semaphores, locks, and/or state flags used to determine or control the flow or sequence of code execution. In some cases, the state information can include context information that may be used to restore or continue execution of the task if execution of the task is interrupted. In other cases, the state information can include physical conditions of DSD 102, such as a high operating temperature. As discussed in more detail below, lists or data structures 12 provide insight into the sequence of events and/or state information of DSD 102 leading up to an exception, which can be used for large-scale, preventative analysis of DSD firmware, for example.
As shown in the example of
Those of ordinary skill in the art will also appreciate that other implementations can include more or less than the elements shown in
In the example of
The command is then queued by a queue manager portion of firmware 10, following the execution of the interface module. A new entry is added to list 12 including a unique identifier for the queue manager portion of firmware 10, together with a queue state and queue identifier for the command. The queue state can include, for example, the number of pending commands in the queue when the command is added to the queue. The queue identifier can include an identifier assigned to the command by the queue manager.
The command is then activated by a dispatch manager portion of firmware 10. An identifier for the dispatch manager is added to a third entry in list 12 with a task identifier as state information that is assigned by the dispatch manager for the command.
The command is then processed by a drive-side I/O manager to access data from a memory of DSD 102 (e.g., storage 108). A unique identifier for the drive-side manager portion of firmware 10 is added to a new entry with state information including the task identifier and a current state of the command, such as a physical address to be accessed on a storage media.
In the example of
The command status (e.g., command failed) is then sent to the host and a unique identifier for a host interface portion of firmware 10 is added in a new entry in list 12. A host interface status, such as available or busy, may also be included in the entry as state information.
A tracking manager portion of firmware 10 adds a final entry to list 12 to indicate the unique identifier of the tracking manager and an exception flag for the error. The exception flag may indicate, for example, the type of exception (e.g., I/O error) and/or may serve to indicate that list 12 includes an exception and should be retained in exception log 14 for DSD 102.
In some implementations, the entries in list 12 may form a blockchain that are cryptographically linked. For example, each entry may include a pointer address, a timestamp, and a hash result derived from the pointer address of the previous entry, with the exception of the first entry, which would not have a hash result for a previous entry. The use of a blockchain can ordinarily provide security for users of DSDs 102. In some implementations, an end-to-end encryption may be used so that only the DSD and exception analysis unit 112 can decrypt lists or data structures 12.
As will be appreciated by those of ordinary skill in the art, other implementations can include a different arrangement for list 12 to provide a different type of data structure. For example, list 12 may include indices to indicate the order of execution of different portions of firmware 10. In some implementations, list 12 may include a single metadata container that includes an internal structure, such as a linked list of entries within the container. Other implementations may use a different type of data structure, such as where each entry is a separate container that may be linked or otherwise indicate an order of execution of firmware or code portions for a task. In addition, other examples may include different portions of firmware or code than those discussed above for the example task of
In block 302, unique identifiers are assigned to respective code portions. In some implementations, firmware 10 may include unique identifiers in its code, such as in headers for different functions or modules. In such implementations, controller 106 may assign unique identifiers by accessing such portions of the code when it is loaded into memory 110. In other implementations, controller 106 may assign unique identifiers such as a value for code portions that have been loaded into memory 110 or that have been otherwise identified for execution.
In block 304, controller 106 creates a data structure, such as a list similar to list 12 in
In block 306, a respective unique identifier and optional state information is added to the data structure for each code portion that executes for the task. The data structure indicates an order in which the code portions are executed for the task. As noted above, the order in which the code portions are executed may be provided with a numbering or linking of entries or blocks in the data structure.
In block 308, it is determined whether an exception occurred during execution of the task or if there is a high probability (e.g., greater than a threshold probability) of a future exception based on the sequence of code portions or state information collected in the data structure. An exception can include a predefined event, such as an error or a state of DSD 102. For example, the exception can include a latency in performing a portion of the task that exceeds a threshold time or a lack of resources such as available memory or available processing resources while performing the task.
In some implementations, controller 106 may identify state information in the data structure or determine that the sequence of execution for the code portions has more than a threshold probability of resulting in an exception after completion of the task. For example, controller 106 may use a mathematical or algebraic algorithm, such as a checksum, with the unique identifiers for the code portions as inputs to identify a sequence of execution that is invalid or that may be associated with a higher probability of a future exception, such as with the use of a probabilistic data structure. In other examples, state information in the data structure may indicate a high probability of a future exception.
If it is determined in block 308 that an exception occurred during the execution of the task or that there is greater than a threshold probability of a future exception, the data structure for the task is added to an exception log, such as exception log 14 in the example of
In deleting the data structure, storage space in DSD 102 can ordinarily be conserved. However, in some implementations, a sampling of data structures for tasks can be analyzed by controller 106 or exception analysis unit 112 in
As discussed in more detail below with reference to the sequence identification process of
In block 402, a plurality of data structures or lists are analyzed for tasks. The tasks may all be of the same type, such as the performance of a write command from a host, or the tasks may be for a random sampling of different tasks. In some implementations, the tasks corresponding to the analyzed data structures or lists may be only for those tasks that have experienced a particular type of error or exception. In this regard, the process of
In block 404, one or more sequences of execution are identified as being associated with a higher probability of resulting in an exception or as typical execution paths. For example, certain sequences of execution may be common among tasks that encountered a particular exception or may be a typical execution path or order for certain task types. In some cases, a neural network may be implemented by exception analysis unit 112 to input the plurality of data structures into a neural network, such as an auto-encoder abnormality detection. Such analysis may provide a predictive result for modifying DSD firmware to avoid an exception by identifying a sequence of execution that may result in a higher probability of encountering an exception during the task or after the task. In addition, new outlier events or exceptions may be defined based on such analysis.
As discussed above, the analysis of a plurality of data structures or lists for tasks can allow controller 106 to perform self-diagnostics to change the execution of firmware or can provide large scale analytics or fleet analysis for exposure analytics, risk assessment, or insights into critical code execution paths. Such critical code execution paths can identify resources that should not be accessed by more than one process at a time, since concurrent access may lead to an exception or other unexpected result.
In block 502, a task type is determined for a task. For example, the task type can be a host-initiated task type for tasks initiated by a host, such as a host 101 in
Based on the task type determined in block 502, it is further determined in block 502 how long to retain a corresponding data structure for the task and/or whether to retain particular state information for the task. For example, a data structure, such as a list 12, can be retained for a longer time period if it is determined in block 502 that the task type is a host-initiated task type rather than a DSD-initiated task type. An example of such an implementation may store data structures for host-initiated tasks that execute without an exception for one day, while data structures for DSD-initiated tasks that execute without an exception may be deleted following completion of the task. In some cases, host-initiated tasks may be kept longer to provide information that can be used to improve the performance of the DSD in terms of a Quality of Service (QoS) for performing host commands, such as by increasing a number of Input/Output Operations Per Second (IOPS).
In other cases, state information that has been collected in a data structure may be discarded or have a shorter retention period determined in block 504 based on the task type determined in block 502. For example, state information indicating addresses for data (e.g., Logical Block Addresses (LBAs), Physical Block Addresses (PBAs)) may be kept longer for tasks that are determined to be a write command task type than for tasks determined to be a read command task type.
As discussed above, by using data structures or lists that indicate an execution sequence of code or firmware portions and optionally recording state information, it is ordinarily possible to identify sequences of execution in increasingly complex DSD firmware that have a higher probability of causing an exception. The collection of such lists or data structures for multiple DSDs over a longer period of time can allow for large-scale failure analysis or preventative analysis of DSD firmware, while reducing the amount of information collected for tasks that do not result in an exception. In addition, the foregoing use of lists or data structures for tasks can allow for the collection of information, such as an order of code execution and/or state information, for actual end-user tasks in the field, as opposed to a simulated testing performed at a manufacturer's facility.
Those of ordinary skill in the art will appreciate that the various illustrative logical blocks, modules, and processes described in connection with the examples disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. Furthermore, the foregoing processes can be embodied on a computer readable medium which causes a processor or computer to perform or execute certain functions.
To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, and modules have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Those of ordinary skill in the art may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The various illustrative logical blocks, units, modules, and controllers described in connection with the examples disclosed herein may be implemented or performed with a general purpose processor, a DSP, an ASIC, an FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, an SoC, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The activities of a method or process described in connection with the examples disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The steps of the method or algorithm may also be performed in an alternate order from those provided in the examples. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable media, an optical media, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC or an SoC.
The foregoing description of the disclosed example embodiments is provided to enable any person of ordinary skill in the art to make or use the embodiments in the present disclosure. Various modifications to these examples will be readily apparent to those of ordinary skill in the art, and the principles disclosed herein may be applied to other examples without departing from the spirit or scope of the present disclosure. The described embodiments are to be considered in all respects only as illustrative and not restrictive. In addition, the use of language in the form of “at least one of A and B” in the following claims should be understood to mean “only A, only B, or both A and B.”
Number | Name | Date | Kind |
---|---|---|---|
4413318 | Herrington | Nov 1983 | A |
5287499 | Nemes | Feb 1994 | A |
5787446 | Dang | Jul 1998 | A |
6272605 | Le | Aug 2001 | B1 |
7506314 | Kollmann et al. | Mar 2009 | B2 |
7565551 | Kirovski | Jul 2009 | B2 |
7653899 | Lindahl | Jan 2010 | B1 |
8694970 | Edgar et al. | Apr 2014 | B2 |
8719791 | MacPherson | May 2014 | B1 |
9063940 | Humby | Jun 2015 | B1 |
9110682 | Cao | Aug 2015 | B2 |
9329789 | Chu | May 2016 | B1 |
9672107 | Horn | Jun 2017 | B1 |
9762399 | Ghose | Sep 2017 | B2 |
9773026 | Tetreault | Sep 2017 | B1 |
9912752 | Davis | Mar 2018 | B1 |
9959280 | Whitehead | May 2018 | B1 |
10146612 | Buckner | Dec 2018 | B1 |
10181948 | Nenov | Jan 2019 | B1 |
10185595 | Ramatchandirane | Jan 2019 | B1 |
10447483 | Su | Oct 2019 | B1 |
10476847 | Smith | Nov 2019 | B1 |
11429587 | Barrell | Aug 2022 | B1 |
20040080512 | McCormack | Apr 2004 | A1 |
20040117758 | Bushey | Jun 2004 | A1 |
20050086643 | Shane | Apr 2005 | A1 |
20050097259 | Zievers | May 2005 | A1 |
20050108464 | Zievers | May 2005 | A1 |
20050125187 | Pomaranski | Jun 2005 | A1 |
20050237308 | Autio | Oct 2005 | A1 |
20060041864 | Holloway | Feb 2006 | A1 |
20060107016 | Murotani | May 2006 | A1 |
20060250909 | Chen | Nov 2006 | A1 |
20060265448 | Qing | Nov 2006 | A1 |
20080122935 | Ohmura | May 2008 | A1 |
20080184151 | Agarwal | Jul 2008 | A1 |
20090064096 | Goebel | Mar 2009 | A1 |
20090100411 | Weiss | Apr 2009 | A1 |
20090292973 | Seol | Nov 2009 | A1 |
20100100871 | Celeskey | Apr 2010 | A1 |
20100115182 | Murugesan | May 2010 | A1 |
20100142275 | Yogev | Jun 2010 | A1 |
20100146186 | Traister | Jun 2010 | A1 |
20100162080 | Kinoshita | Jun 2010 | A1 |
20100231959 | Tanikawa | Sep 2010 | A1 |
20100251028 | Reghetti | Sep 2010 | A1 |
20110022723 | Inoue | Jan 2011 | A1 |
20110126075 | Gunnam | May 2011 | A1 |
20110170131 | Kondo | Jul 2011 | A1 |
20110252185 | Arya | Oct 2011 | A1 |
20120054756 | Arnold | Mar 2012 | A1 |
20120072888 | Sugimoto | Mar 2012 | A1 |
20120101731 | Joseph | Apr 2012 | A1 |
20120130960 | Nerger | May 2012 | A1 |
20120216183 | Mahajan | Aug 2012 | A1 |
20120303974 | Lin | Nov 2012 | A1 |
20120331282 | Yurzola | Dec 2012 | A1 |
20130031520 | Asao | Jan 2013 | A1 |
20130073513 | Kemper | Mar 2013 | A1 |
20130104137 | Fukuzaki | Apr 2013 | A1 |
20130227579 | Kurihara | Aug 2013 | A1 |
20130242655 | Tsai | Sep 2013 | A1 |
20130268788 | Baum | Oct 2013 | A1 |
20130304981 | Paz | Nov 2013 | A1 |
20140012963 | Swenson | Jan 2014 | A1 |
20140053135 | Bird | Feb 2014 | A1 |
20140247667 | Dutta | Sep 2014 | A1 |
20140372987 | Strong | Dec 2014 | A1 |
20150089506 | Takasu | Mar 2015 | A1 |
20150095338 | Baggott | Apr 2015 | A1 |
20150100827 | Kho | Apr 2015 | A1 |
20150100835 | Suzuki | Apr 2015 | A1 |
20150234730 | Puthuff | Aug 2015 | A1 |
20150242431 | Vlcek | Aug 2015 | A1 |
20150331897 | Zhou | Nov 2015 | A1 |
20150347122 | Wang | Dec 2015 | A1 |
20150355990 | Cole | Dec 2015 | A1 |
20160077926 | Mutalik | Mar 2016 | A1 |
20160133322 | Zamir | May 2016 | A1 |
20160140014 | Lampert | May 2016 | A1 |
20160154896 | Simitsis et al. | Jun 2016 | A1 |
20160210094 | Nishikawa | Jul 2016 | A1 |
20160246988 | Kwon | Aug 2016 | A1 |
20160282921 | Kodavalla | Sep 2016 | A1 |
20160357958 | Guidry | Dec 2016 | A1 |
20160378990 | Goodman | Dec 2016 | A1 |
20160380849 | Kawamori | Dec 2016 | A1 |
20170083331 | Burger | Mar 2017 | A1 |
20170083335 | Burger | Mar 2017 | A1 |
20170139771 | Chung | May 2017 | A1 |
20180011775 | Baines | Jan 2018 | A1 |
20180018221 | Magro | Jan 2018 | A1 |
20180041476 | Bentley | Feb 2018 | A1 |
20180137131 | Karuppusamy | May 2018 | A1 |
20180139186 | Castagna | May 2018 | A1 |
20180267858 | Bacha | Sep 2018 | A1 |
20180285412 | Zhuang | Oct 2018 | A1 |
20180314619 | Jagadeesan | Nov 2018 | A1 |
20180329708 | Burger | Nov 2018 | A1 |
20180329740 | Geigel | Nov 2018 | A1 |
20180341527 | Ikkaku | Nov 2018 | A1 |
20190018669 | Cook | Jan 2019 | A1 |
20190042278 | Pirvu | Feb 2019 | A1 |
20190051324 | Zhao | Feb 2019 | A1 |
20190196898 | Sekiguchi | Jun 2019 | A1 |
20190281026 | Mitchell | Sep 2019 | A1 |
20190281065 | Xia | Sep 2019 | A1 |
20190294365 | Yoshii | Sep 2019 | A1 |
20190377622 | Kurian | Dec 2019 | A1 |
20200042323 | Sano | Feb 2020 | A1 |
20200042513 | Shima | Feb 2020 | A1 |
20200057626 | Chhuor | Feb 2020 | A1 |
20200151708 | Sui | May 2020 | A1 |
20200184739 | Nathan | Jun 2020 | A1 |
20200334383 | Master | Oct 2020 | A1 |
20200344234 | Haque | Oct 2020 | A1 |
20210067536 | Mylrea | Mar 2021 | A1 |
20210073063 | Kale | Mar 2021 | A1 |
20220045969 | L'Ecuyer | Feb 2022 | A1 |
Number | Date | Country | |
---|---|---|---|
20210081238 A1 | Mar 2021 | US |