The present application claims priority to Chinese Patent Application No. 2021101054598, titled “FAILURE ANALYSIS METHOD, COMPUTER EQUIPMENT, AND STORAGE MEDIUM”, filed on Jan. 26, 2021, which is incorporated herein by reference in its entirety.
The present application relates to the field of storage technology, and in particular, to a failure analysis method, computer equipment, and a storage medium.
A storage system is one of the important components of a computer. The storage system provides the ability to write and read information (programs and data) needed by the operation of the computer to achieve an information memory function of the computer.
The storage failure type of each chip particle in the storage system is usually manually determined, which consumes a lot of time and labor costs and limits the speed and efficiency of analysis.
Based on this, it is necessary to provide a failure analysis method that can improve the efficiency of failure analysis, computer equipment, and a storage medium in response to the above technical problems.
According to multiple embodiments, the first aspect of the present application provides a failure analysis method, including:
According to multiple embodiments, the second aspect of the present application provides a computer equipment, including a memory and a processor, the memory storing a computer program, wherein when the processor executes the computer program, the steps of any one of the above-mentioned failure analysis methods are implemented.
According to multiple embodiments, the third aspect of the present application provides a computer-readable storage medium, storing a computer program thereon, wherein when the computer program is executed by a processor, the steps of any one of the above-mentioned failure analysis methods are implemented.
According to the above-mentioned failure analysis method, computer equipment and storage medium, failure data of IO channels in a target chip particle is obtained and split according to physical modules, so that a storage failure type of the target chip particle can be quickly, effectively and automatically determined according to the characteristics of the physical modules.
In order to explain technical solutions in embodiments of the present application or in the prior art more clearly, the accompanying drawings to be used for describing the embodiments of the present application or the prior art will be introduced simply. Apparently, the accompanying drawings to be described below are merely some embodiments of the present application. A person of ordinary skill in the art may obtain other drawings according to these drawings without paying any creative effort.
In order to facilitate the understanding of the present application, the present application will be described more comprehensively below with reference to the relevant accompanying drawings. Embodiments of the present application are shown in the drawings. However, the present application may be implemented in many different forms, and is not limited to the embodiments described herein. Rather, these embodiments are provided so that the disclosure of the present application is more thorough and comprehensive.
Unless otherwise defined, all technological and scientific terms used herein have the same meanings as commonly understood by those of ordinary skill in the technical field of the present application. The terms used in the description of the present application are only for the purpose of describing specific embodiments, but are not intended to limit the present application.
It may be understood that the terms “first”, “second”, etc. used in the present application may be used herein to describe various features, but these features are not limited by these terms. These terms are only used to distinguish the first feature from another.
When used herein, the singular forms of “a”, “an” and “the” may also include plural forms, unless the context clearly indicates otherwise. It should also be understood that the terms “comprise/include” or “have” and the like designate the existence of the stated features, wholes, steps, operations, components, parts, or combinations thereof, but do not exclude the existence or addition of one or more other features, wholes, steps, operations, components, parts, or combinations thereof. Meanwhile, the term “and/or” used in the description includes any and all combinations of relevant items listed.
The failure analysis method, computer equipment, and storage medium provided in the present application may be applied to failure analysis of system-level failure data of various different types of storage systems, for example, applied to failure analysis of system-level failure data of a double data rate synchronous dynamic random access memory (DDR) system.
Alternatively, the failure analysis method, computer equipment, and storage medium provided in the present application may also be applied to failure analysis of failure data of various different types of monolithic chip particles.
In one embodiment, referring to
As an example, the method in this embodiment may be used for failure analysis of system-level failure data of a storage system. The storage system may include a plurality of chip particles, the number of the plurality of chip particles is N, where N is a positive integer greater than or equal to 2. That is, the storage system may include a plurality of chip particles.
Specifically, each chip particle may include a plurality of physical modules, and the physical modules are banks. Each physical module may form a bank. Also, each physical module may include a plurality of storage units. Each storage unit can output one bit of test data once when it is turned on. Understandably, when the chip particles are tested, there are normal data and failure data among the test data of all the storage units. The “failure data” mentioned herein is the failure data among the test data.
In addition, each physical module may include a plurality of IO channels. Each IO channel is connected to part of the storage units correspondingly, so that each storage unit outputs data through the corresponding IO channel.
Understandably, “a plurality of” here may be one or more. When the number of IO channels in the physical module is one, the storage units in the physical module are connected to the same IO channel.
Then, the target chip particle in step S100 is each chip particle in the storage system. Before the failure data of IO channels in the target chip particle is obtained, an original test document of the storage system may be read first. The original test document of the storage system may be an original document of all failure data in the storage system. Then, the original test document of the storage system is split to obtain failure data of each chip particle.
Understandably, the original test document of the storage system includes failure data of all chip particles in a test. Splitting the original test document of the storage system to obtain failure data of a chip particle may be extracting the failure data of the chip particle from the original test document.
Of course, the method in this embodiment may be used for failure analysis of failure data of a single chip particle.
In step S400, the failure data of IO channels in the target chip particle may be split according to the physical modules inside the target chip particle to form M groups of module failure data corresponding to the physical modules.
Also understandably, the failure data of IO channels in the target chip particle here includes failure data of all physical modules in a test. Splitting the failure data of IO channels in the target chip particle to form a group of module failure data corresponding to a physical module may be extracting a group of failure data of the physical module from the failure data of IO channels in the target chip particle.
Since the target chip particle includes a plurality of physical modules, associated storage units are usually located in the same physical module. Therefore, the failure data of IO channels in the target chip particle is split according to the physical modules, so that failure analysis can be performed more accurately.
Meanwhile, when the target chip particle is tested, there may be repeated tests. Therefore, while the failure data of IO channels in the target chip particle is split, repeated failure data can also be removed.
In step S500, partial failure types of all physical modules may be determined according to the corresponding module failure data.
In step S600, after the partial failure types of all the physical modules are determined, the storage failure type of the target chip particle may be comprehensively determined according to the partial failure types of all the physical modules.
As an example, the final storage failure type of the target chip particle may be determined according to the priority levels of various partial failure types of the physical modules, and outputted.
Specifically, the priority levels of various partial failure types may be determined according to actual conditions. For example, larger failure area, failure data related to IO channel problems, more failed physical modules, etc. are all serious failures. Therefore, the priority levels of various partial failure types can be comprehensively determined according to the size of the failure area, whether the failure data is related to IO channel problems, the number of failed physical modules, etc.
In the method of this embodiment, failure data of IO channels in a target chip particle is obtained and split according to physical modules, so that a storage failure type of the target chip particle can be quickly, effectively and automatically determined according to the characteristics of the physical modules.
In one embodiment, referring to
If the failure data does not satisfy the whole failure type criterion, the storage failure type of the target chip particle is determined according to the partial failure type of each physical module.
Specifically, if the failure data does not satisfy the whole failure type criterion, step S400 may be performed, and steps S400, S500, and S600 are performed in order.
Of course, steps S200, S300, S400, and S500 are not necessarily performed in this order, and there is no strict order restriction. These steps may be performed in other reasonable order, which is not limited in the present application.
Step S200, the whole failure type criterion is a criterion for determining whether the failure data of IO channels of the target chip particle has certain regular characteristics as a whole, so as to quickly determine the storage failure type of the target chip particle.
In the method of this embodiment, when the failure data of IO channels in the target chip particle satisfies the whole failure type criterion, the storage failure type of the target chip particle is determined according to the whole failure type.
When the failure data of IO channels in the target chip particle does not satisfy the whole failure type criterion, the storage failure type of the target chip particle is determined according to the partial failure type of each physical module.
Therefore, this embodiment can accurately and efficiently determine the storage failure type of the target chip particle through a combination of whole and partial analysis.
In one embodiment, the failure analysis method is applied to failure analysis of a storage system, the storage system including a plurality of chip particles, the number of the plurality of chip particles is N, where N is a positive integer greater than or equal to 2. The whole failure type criterion includes a system-level whole failure criterion and a particle-level whole failure criterion.
The system-level whole failure criterion is a basis for representing whole failure of failure data of IO channels on chip particles including the target chip particle in the storage system, for example, a basis for representing a whole contact failure.
The particle-level whole failure criterion is a basis for representing whole failure of failure data of IO channels in the target particle, for example, a basis for representing a whole block move failure.
At this point, failure analysis can be performed more comprehensively in combination with system characteristics and particle characteristics.
In one embodiment, the system-level whole failure criterion includes a contact failure type criterion.
When the storage system is tested, it is usually put into a test carrier such that each chip particle thereon is in electrical contact with the carrier for testing. The contact failure may specifically be a failure caused by poor contact between the chip particles and the carrier.
The contact failure type criterion indicates simultaneous failure of failure data of a preset number of IO channels of other chip particles.
Then, referring to
That is, if among the failure data of IO channels in the target chip particle, the failure data of IO channels greater than a preset number are all failure data, then it is determined that the storage failure type of the target chip particle is the contact failure type.
In one embodiment, on the basis of the foregoing embodiment, after step S210, if the failure data does not satisfy the contact failure type criterion, a particle-level whole failure type determination is performed. The particle-level whole failure criterion includes a block move failure type criterion.
Specifically, when the storage system is tested, various pattern tests are usually performed, and the various pattern tests may include a block move pattern test. The block move pattern is data block move between different chip particles.
Therefore, the failure data of IO channels of each chip particle in the storage system usually includes various types of failure data. The various types of failure data include failure data of the block move pattern test.
The block move failure type criterion indicates that the failure data is test data of the block move pattern test.
Then, referring to
At this time, if the failure data does not satisfy the block move failure type criterion, the storage failure type of the target chip particle is determined according to the partial failure type of each physical module.
It may be understood that, in the embodiments of the present application, the whole failure type is not limited to the contact failure type and the block move failure type in the foregoing embodiments. The whole failure type may also be or include other forms of whole failure types. Correspondingly, steps S200, S300, etc. are not limited to the form in the foregoing embodiments.
In one embodiment, referring to
As an example, in step S510, for each physical module, the method for determining the module failure category may include:
This embodiment first determines the module failure category of the physical module according to the situation of the IO channel corresponding to the failure data in the physical module. Then, the partial failure type of each physical module is determined according to a different method for determining the corresponding module failure category, so that the determination on the partial failure type of each physical module is more accurate.
In one embodiment, the physical module includes a plurality of storage units arranged in an array. Each storage unit outputs data through a corresponding IO channel.
Referring to
In one embodiment, the first determination parameter includes at least a maximum row spacing, a minimum row spacing, a maximum column spacing, a minimum column spacing, a row continuous spacing ratio and a column continuous spacing ratio between the storage units corresponding to each failure data, the row continuous spacing ratio is a ratio of failure data whose row spacing between the corresponding storage units is less than or equal to a row spacing threshold, and the column continuous spacing ratio is a ratio of failure data whose column spacing between the corresponding storage units is less than or equal to a column spacing threshold.
The first determination parameter may be obtained before step S13. For example, it may be obtained, but not limited to, after step S4 and before step S520.
In this embodiment, referring to
It may be understood that the “row spacing threshold”, “column spacing threshold”, and “ratio thresholds” in the steps may be set according to actual conditions. As an example, the “row spacing threshold” may be set to 2 uniformly, and the “column spacing threshold” may be set to 8 uniformly.
In this embodiment, the maximum row spacing, the minimum row spacing, the maximum column spacing, and the minimum column spacing are used as the first determination parameter to effectively determine the partial failure type of the physical module.
Of course, in other embodiments, the first determination parameter is not limited to the determination parameter in this embodiment. Correspondingly, the process of determining the partial failure type of the physical module is not limited to the form of this embodiment.
In one embodiment, the physical module includes a plurality of storage units arranged in an array. Each storage unit outputs data through a corresponding IO channel.
The method for determining the multi-channel failure category in step S530 includes:
The second determination parameter may be obtained before step S21. For example, it may be obtained, but not limited to, after step S4 and before step S520.
In one embodiment, the second determination parameter includes at least a minimum row spacing, a maximum row spacing, a maximum column spacing and a row continuous spacing ratio between the storage units corresponding to each failure data, the row continuous spacing ratio is a ratio of failure data whose row spacing between the corresponding storage units is less than or equal to a row spacing threshold.
Then, referring to
The failure data in the sudden failure type and the random failure type here are data related to IO channel problems.
It is understandable that the “row spacing threshold”, “column spacing threshold”, “ratio threshold”, “first threshold”, and “second threshold” in the steps may be set according to actual conditions. As an example, the “row spacing threshold” may be set to 2 uniformly, and the “column spacing threshold” may be set to 8 uniformly.
In this embodiment, the maximum row spacing, the minimum row spacing, and the maximum column spacing are used as the second determination parameter, which can effectively determine the partial failure type of the physical module.
Of course, in other embodiments, the second determination parameter is not limited to the determination parameter in this embodiment. Correspondingly, the process of determining the partial failure type of the physical module is not limited to the form of this embodiment.
In one embodiment, referring to
The repair can effectively improve the yield of the target chip particle.
The repairable type may include, for example, a single-bit failure type. a plurality of replacement units may be provided in the target chip particle. When a storage unit corresponding to the single-bit failure type has an error, the wrong storage unit may be replaced with a replacement unit, so as to repair the target chip particle according to the failure data of the IO channel.
In an embodiment, after step S700, the method further includes:
As an example, step S900 may include:
By analyzing the failure cause, possible problems of the target chip particles may be found, and the yield control method of the chip particle can thus be obtained, which can effectively improve the yield of chip particles produced later.
Specifically, the analysis system may store a plurality of failure causes and a plurality of yield control methods. In addition, the analysis system may store corresponding relationships between the failure causes and the yield control methods. As such, the yield control method of the target chip particle may be obtained according to the failure cause of the target chip particle.
Here, in practical applications, while the yield control method is automatically obtained by an analysis system, engineers can also perform engineering analysis, thereby improving the yield of chip particles more effectively by combining the engineering analysis results with the results obtained by the analysis system.
It should be understood that although various steps in the flowcharts of
In one embodiment, computer equipment is provided, including a memory and a processor, the memory storing a computer program therein, wherein when the processor executes the computer program, the following steps are implemented:
In an embodiment, when the processor executes the computer program, the following steps are further implemented:
If the failure data does not satisfy the whole failure type criterion, the storage failure type of the target chip particle is determined according to the partial failure type of each physical module.
In one embodiment, a computer-readable storage medium is provided, storing a computer program thereon, wherein when the computer program is executed by a processor, the following steps are implemented:
In an embodiment, when the computer program is executed by the processor, the following steps are further implemented:
If the failure data does not satisfy the whole failure type criterion, the storage failure type of the target chip particle is determined according to the partial failure type of each physical module.
A person of ordinary skill in the art can understand that all or part of the processes in the methods of the foregoing embodiments may be implemented by a computer program instructing relevant hardware. The computer program may be stored in a non-volatile computer-readable storage medium. The computer program, when executed, may include the processes of the embodiments of the above methods. Any reference to the memory, storage, database or other media used in the embodiments provided by the present application may include at least one of non-volatile and volatile memories. The non-volatile memory may include a Read-Only Memory (ROM), a magnetic tape, a floppy disk, a flash memory, or an optical memory. The volatile memory may include a Random Access Memory (RAM) or an external cache memory. Illustratively, rather than limiting, the RAM may be in various forms, such as a Static Random Access Memory (SRAM) or a Dynamic Random Access Memory (DRAM).
In the description of this specification, the description with reference to the terms “one embodiment”, “other embodiment”, etc. means that the specific feature, structure, material or feature described in conjunction with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic description of the above terms does not necessarily refer to the same embodiment or example.
The technical features of the above embodiments may be combined arbitrarily. For the purpose of simplicity in description, all the possible combinations of the technical features in the above embodiments are not described. However, as long as the combinations of these technical features do not have contradictions, they shall fall within the scope of the specification.
The foregoing embodiments only describe several implementations of the present application, and their descriptions are specific and detailed, but cannot therefore be understood as limitations to the patent scope of the present invention. It should be noted that a person of ordinary skill in the art may further make variations and improvements without departing from the conception of the present application, and these all fall within the protection scope of the present application. Therefore, the patent protection scope of the present application should be subject to the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
202110105459.8 | Jan 2021 | CN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2021/101736 | 6/23/2021 | WO |