This invention relates generally to data backup and recovery processes and, more particularly, to automated data processing systems and methods for detecting incongruous, inconsistent, or incorrect media in a data recovery process.
Data recovery services have become an important part of today's digital world. Many entities, especially those that create and store significant amounts of electronic data, rely on data recovery services to recover data from data backup storage media such as tapes, discs, disk drives, or other removable storage devices.
Recovering a large volume of electronic data typically involves processing many, in some cases, thousands of items of backup storage media. In practice, on the production floor, multiple large volumes of data recovery jobs may be processed coincidentally. Factors such as close temporal and spatial relationships among thousands of data media may cause incongruous, inconsistent, incorrect, or improper media items to be loaded into a job that they do not belong.
Currently, most data recovery services employ manual inspections to detect out of place media in a data recovery job. Human inspector(s) must manually check every physical media label and cross reference it against a master list for a particular data recovery job. This process is time-consuming, tedious, and potentially inaccurate. Moreover, even after manually checking every label, there is no guarantee that all data backup media loaded into a data recovery job are correct because a correct label could have been mistakenly applied on or otherwise attached to a wrong medium (e.g., a tape that is not a member of the backup job being recovered). In the event that a mistake (e.g., an incorrect medium) is found, it can cause an entire job to be re-executed, wasting time and money.
A need exists for a computer-implemented, automated data recovery system and method that can detect mistakes at various stages of a data recovery process, avoiding entirely or substantially reducing the probability of loading incongruous or incorrect media into a job that they do not belong. Embodiments of the present invention address this need and more.
Embodiments of the present invention provide an automated system and method for detecting and preventing wrong media from becoming part of a job that they do not belong. A job is a physical and/or logical set of data that belong to a customer order for data recovery, which can be classified into native data recovery and non-native data recovery. As one skilled in the art will appreciate, while embodiments of the invention disclosed herein can be utilized to process native data recovery jobs, they can be particularly useful for non-native data recovery jobs. In addition, embodiments of the invention disclosed herein can be particularly useful in processing large volumes of data recovery jobs.
One embodiment of the invention can detect when a wrong data backup storage medium (e.g., a tape or cartridge) is incorrectly placed into a data recovery job (e.g., a set of tapes) on the production floor. Detecting such an error early in a data recovery process can save time, reduce costs, and help prevent wrong data from being returned to a customer.
One embodiment of the invention can detect physical characteristics, for example, tape type, density, identifying tags or labels, etc., of a data backup storage medium and, based on the detected physical characteristics, compare the data backup storage medium with a predefined set of acceptable media for a particular job or job lot. This method can detect the most blatant media errors and can be combined with other methods to detect and eliminate most errors.
By way of example, a first method restricts media in a job based upon source data type. For example, if a job to be processed contains Tivoli® tapes only (i.e., tapes recorded using a Tivoli® data-backup system), then only Tivoli loading tools are allowed to run. An attempt to run other types of loading tools, e.g., NetBackup®, will result an error message. One factor to be considered is that two different jobs may involve the same source data type.
A second method employs a predefined set of media identifications to determine whether a data backup storage medium belongs to a particular job. Media identifications can be arbitrary and, in one embodiment, can be configured and set by customers. For example, a job may contain a set of unique media identifications, each of which can include, for example, a serial number, a physical label, or a combination thereof. On most electronic data backup systems, media are usually identifiable, by physical labels bearing barcodes positioned on the outside of the media. The media identifications contained in the barcodes can also be electronically written to some type of media header (e.g., on the first block or two of the data). Other types of media identifications can be adapted to implement this method. Assuming that a list of media identifications can be generated (e.g., from barcodes) to identify all of the media belonging to the same job, this method can systematically and programmatically determine whether a particular data backup storage medium is incongruous (i.e., out of place, inappropriate, inconsistent, or incorrect). One factor to consider is that, although current data backup systems commonly require media barcodes to be entered in sequence, human errors may still be possible (e.g., a correct barcode label is placed on a wrong backup medium, and vice versa).
One embodiment of the invention comprises a software tool or a set of tools (referred herein as the “Fingerprinter”) that can develop/generate a fingerprint, signature, or profile for a job by creating a signature for each storage medium in the job and comparing these signatures. In this embodiment, the Fingerprinter operates to analyze directory structure patterns and naming conventions and apply a pattern representation formula to them that yields a signature for that job. As each file or item is read, a new signature is correspondingly calculated and compared to the overall signature of the job. In one embodiment, a set of media signatures obtained or generated thus far for the job represents the overall signature of the job. This overall job signature can be updated as additional media are analyzed and/or recovered while processing the job. Media signatures for each storage medium must match defined criteria to be considered part of the same job. The Fingerprinter can be customized so that if the signature of the new storage medium is different from the rest by some preset margin; a human inspector or operator may be notified or warned and asked to double check whether the storage medium is correct (i.e., it is part of that job). If it is correct, the overall signature for the job can be revised to integrate the new signature. If it is incorrect, the Fingerprinter or another software tool can operate to remove all data in the recovery job that had been incorrectly loaded from the wrong storage medium. In this manner, the accuracy of this method of ensuring correct media increases with the number of media processed. After several media have been processed, the anomaly (i.e., wrong media) can analyzed more accurately.
This invention represents a significant improvement in detecting incongruous or incorrect media in data recovery processes. Embodiments disclosed herein can deliver fast and timely results, can detect the presence of a wrong medium quickly and early in a data recovery process and can automatically stop processing an incongruous or incorrect medium. Any medium identified as incorrect or, potentially incorrect can either be removed or confirmed to be valid. Other objects and advantages of the present invention will become apparent to one skilled in the art upon reading and understanding the detailed description of the preferred embodiments described herein with reference to the following drawings.
The invention and various features and advantageous details thereof are explained more fully with reference to the exemplary, and therefore non-limiting, embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of known programming techniques, computer software, hardware, operating platforms and protocols may be omitted so as not to unnecessarily obscure the invention in detail. It should be understood, however, that the detailed description and the specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only and not by way of limitation. Various substitutions, modifications, additions and/or rearrangements within the spirit and/or scope of the underlying inventive concept will become apparent to those skilled in the art from this disclosure.
Embodiments of the invention provide an automated system and method for detecting and identifying that a storage medium (e.g., removable tape or disk) in a data recovery job may be incorrect and/or may not belong to a physical and/or logical set (e.g., a media pool of a particular job). According to one embodiment of the invention, the automation disclosed herein can be described as semi-automatic, for it can employ limited human inspection for verification purposes.
Media type and format validation. As illustrated in
In step 120, a media processing application or tool (e.g., software programming) is run to read a data backup storage medium (e.g., a DLT® tape cartridge) and determines the media type and media format of the storage medium. In step 130, the media processing tool performs an environmental check to validate or invalidate the storage medium. If the media type and media format environmental setting matches the function of the media processing tool, it advances to the next stage via step 201. Otherwise, the media processing tool sends an error message indicating that the storage medium is not permitted to run in the current environment (step 140). In the example above, the media processing tool would examine the storage medium to determine if the storage medium was a DLT® tape containing data stored in the Legato® backup software data format. If the storage medium was a DLT® tape containing Legato® formatted data, the media processing tool would indicate that the storage medium “passed” at step 130. If the tape was of a different type or of a different media format, the media processing tool would indicate that the storage medium did not pass and could move to step 140. In one embodiment, a human inspection may be employed to verify the media type of the rejected medium (step 150). This can safeguard machine errors and media irregularities and allow correct media types that might otherwise have been deemed invalid by the media processing tool to be inserted into a job. In one embodiment, no human inspection is utilized, in which case, steps 140 and 150 are skipped. In one embodiment, in case where a first media processing tool exits on the first try or the first few tries, a second (or more if necessary) media processing tool may be run to read the data backup storage medium. In one embodiment, once a media processing tool passes the environmental check, it may be employed to validate/invalidate media type and format for all media in a data recovery job. In one embodiment, process 100 may include step 160 to determine whether the job has been completed when an incorrect storage medium is detected. If not, the media processing tool may operate to read the next data backup storage medium. If so, the process validation flow 100 is completed for the particular job or subset of a particular job.
Media pool membership. As discussed above, once the media processing tool passes the environmental check, it can start on the next stage, media pool membership verification process 200, via step 201. According to one embodiment of the invention, media pool membership verification 200 is the second major stage in the media validation process. Generally, when media are received by a processing plant, the first step is media registration. This means that, for a given set of media, each physical medium of the set of media is inspected for a unique identifier (ID), such as a barcode, radio frequency ID (RFID) or the like, that can be recorded for later verification. These identifiers can generally be found on the physical media themselves (e.g., a barcode label). During media registration, each storage medium is visually and/or electronically inspected and its physical identifier is manually or electronically entered into an electronic list or database. Hence, in one embodiment, a media pool refers to a list of members of a particular set (e.g., a customer's job or a subgroup thereof). This list can be used to identify/verify media pool membership. In a simple example, a job for a customer could consist of 100 Legato® formatted DLT® tapes with identifications 1 through 100.
Referring to
Another type of media identification is associated with the initial data on the media; for example, the header of a storage medium. The header information can often be analyzed to ascertain a media identifier. Once at least one of these two types of media identifications are obtained at step 210, multiple distinct checks can be performed. In step 220, as an example, barcode scanned from a physical label of the storage medium (i.e., first type of media identification) can be compared with the media identifier derived from the header (i.e., second type of media identification). This confirms the physical label of the storage medium matches the media identification of storage medium. If they match, then the media identifier is compared with the media pool membership list at step 230 to verify whether the current storage medium is a member of the media pool that is being processed. In alternative embodiments, only one or the other media identification type from a particular storage medium is verified against the media pool membership list at step 220.
If the media identifier(s) matches an entry on the media pool membership list, the probability is high that the storage medium is valid (i.e., that the current storage medium belongs to the given set of media). Accordingly, it is allowed to proceed to the next stage at step 301. If the media identifier does not match an entry on the media pool membership list, processing is suspended and optionally a request can be made for a human inspection at step 240. In this embodiment, a human operator or inspector may be asked to validate the storage medium. In step 250, if the inspector validates the storage medium as being correct (i.e., valid), the information is logged for future reference (e.g., updating the media pool membership list with the media identifier) and the media processing proceeds to the next stage via step 301. If the inspector confirms that the storage medium is incorrect or invalid, the media processing tool ejects the current storage medium and proceeds to step 160 to process the next storage medium until all of the media in the given set of media (i.e., the media pool or the job), are processed. In one embodiment, human inspection steps 240 and 250 may be skipped so that, if there is no match, the media processing tool simply ejects the invalid storage medium without human inspection and proceeds to step 160.
Referring back to step 220, in the case that the physical medium identifier (e.g., a barcode from the media container) and the digital media identifier (e.g., data identification in the header) do not match, both can be independently checked against the media pool membership list in step 230, according to one embodiment of the invention. If both appear to be on the media pool membership list or either one matches an entry on the media pool membership list, then the storage medium can potentially be valid. To be sure, in one embodiment, a human operator or inspector may be asked to confirm the current storage medium's membership in the media pool. The inspector may accept (approve) or reject it. If the storage medium is accepted, the media pool membership list can be updated accordingly and the media processing tool proceeds to the next stage via step 301. If the storage medium is rejected, or neither the barcode from the media container nor the media identifier matches an entry on the media pool membership list, the storage medium is ejected and the media processing tool proceeds to step 160.
Content fingerprinting. Once a storage medium's media pool membership is confirmed, whether it is through matching an entry on a media pool membership list or by the approval of a human inspector, it is ready for the next stage, content fingerprinting 300, one embodiment of which is illustrated in
In one embodiment, if a storage medium contains a certain percentage of terms that appear on the master list, that particular storage medium is considered valid. If none or very few terms can be found to match terms on the master list, a human user is asked to verify the storage medium's validity. If the human user does confirm that the particular storage medium is valid, the master list can be updated to include new term(s) extracted from the file paths of the human inspected storage medium. If the inspected storage medium is deemed invalid for this particular media pool, then any data thus far processed from the storage medium is marked as invalid and purged from the processing system. In one embodiment, human inspection for the first storage medium to be processed for a media pool is always requested to ensure that the initial master list is not an invalid list.
Referring to
In one embodiment, a total term counter is initially set to zero. In one embodiment, if a term identified in the file path at step 311 is a new term (see step 312), it is counted once and added to a candidate list at step 313. A candidate list is a list of unique terms identified and extracted from file paths of members of the media pool. The total number of terms on the candidate list is increased by one at step 318 and checked against a preset goal (e.g., 1000 terms) at step 319. If the term identified at the file path at step 311 is not a new term (see step 312), then that term's count is increased by one at step 314 and compared with a predetermined threshold (e.g., 10 repeats). In one embodiment, once a term has been found to repeat the same as or more times than the predetermined threshold at step 315, it may be added to a working list of common terms at step 317 if it is not already on the working list (step 316). A working list is a list of repeat terms, each exceeding the predetermined threshold (e.g., 10 or more repeats). Each time a term is extracted, be it a new term or a repeat, the term total is increased by one (step 318). At step 319, if the term total does not meet the predetermined goal (e.g., 1000 terms), the next term from the file path is extracted and analyzed at step 311. If there are no more terms that can be extracted from the file path, the next file or item is read from the storage medium and its file path scanned for terms at step 310. One of ordinary skill in the art will appreciate that the preset goal can be any number and thus is not limited to 1000. Similarly, the predetermined threshold can be any number of repeats and is not limited to 10. As will be explained below with reference to
Referring to
Once the master list is compiled, it can represent a unique profile or “fingerprint” for the given media pool. The lists (i.e., candidate list, working list, and master list) can be continuously revised as each storage medium of a given media pool is read. If the media pool represents a subset of one or more media pools, the lists can be reused for other media pools that may be of a different media type and/or format.
Flow diagram 375 of
In one embodiment, if the storage medium can not be validated at step 335, it is ejected and the media processing tool proceeds to process the next storage medium. In one embodiment, if the storage medium can not be validated at step 335, then all of the data (e.g., files) that are thus far processed from that invalid storage medium can be marked as invalid and removed from the job. In one embodiment, this can be done by utilizing the global media identifier that is associated with content from that storage medium (e.g., a volume ID or other equivalent identifier) to flag the content as invalid. Flagged invalid content can be purged from the processing system during a garbage collection cycle. In one embodiment, marking and purging invalid content can be done as part of step 335. In addition, all of the counts and new terms from the invalid storage medium may be removed from the candidate and master lists at step 335. The media processing tool then proceeds to process the next storage medium or concludes the job (step 160).
In this example, HD 614 is programmed with computer-executable program instructions embodied on a computer-readable medium 620. These computer-executable program instructions, when executed by CPU 611, can implement the methods disclosed herein with reference to
Although the present invention has been described and illustrated in detail, it should be understood that the embodiments and drawings are not meant to be limiting. Various alterations and modifications are possible without departing from the spirit and scope of the invention. Accordingly, the scope of the invention should be determined by the following claims and their legal equivalents.
This application claims priority from U.S. Provisional Patent Application Ser. No. 60/634,352, filed Dec. 8, 2004, which is hereby incorporated herein by reference in its entirety for all purposes.
Number | Name | Date | Kind |
---|---|---|---|
5107419 | MacPhail | Apr 1992 | A |
5350303 | Fox et al. | Sep 1994 | A |
5535381 | Kopper | Jul 1996 | A |
5617566 | Malcolm | Apr 1997 | A |
5689699 | Howell et al. | Nov 1997 | A |
5717913 | Driscoll | Feb 1998 | A |
5732265 | Dewitt et al. | Mar 1998 | A |
5742807 | Masinter | Apr 1998 | A |
5778395 | Whiting et al. | Jul 1998 | A |
5813009 | Johnson et al. | Sep 1998 | A |
5813015 | Pascoe | Sep 1998 | A |
5884298 | Smith et al. | Mar 1999 | A |
5926811 | Miller et al. | Jul 1999 | A |
5937401 | Hillegas | Aug 1999 | A |
5982370 | Kamper | Nov 1999 | A |
6023710 | Steiner et al. | Feb 2000 | A |
6047294 | Deshayes et al. | Apr 2000 | A |
6157931 | Cane et al. | Dec 2000 | A |
6182029 | Friedman | Jan 2001 | B1 |
6189002 | Roitblat | Feb 2001 | B1 |
6199067 | Geller | Mar 2001 | B1 |
6199081 | Meyerzon et al. | Mar 2001 | B1 |
6216123 | Robertson et al. | Apr 2001 | B1 |
6226630 | Billmers | May 2001 | B1 |
6226759 | Miller et al. | May 2001 | B1 |
6240409 | Aiken | May 2001 | B1 |
6243713 | Nelson et al. | Jun 2001 | B1 |
6256633 | Dharap | Jul 2001 | B1 |
6269382 | Cabrera et al. | Jul 2001 | B1 |
6278992 | Curtis et al. | Aug 2001 | B1 |
6324548 | Sorenson | Nov 2001 | B1 |
6389403 | Dorak | May 2002 | B1 |
6421767 | Milillo et al. | Jul 2002 | B1 |
6477544 | Bolosky et al. | Nov 2002 | B1 |
6493711 | Jeffrey | Dec 2002 | B1 |
6662198 | Satyanarayanan et al. | Dec 2003 | B2 |
6708165 | Jeffrey | Mar 2004 | B2 |
6745197 | McDonald | Jun 2004 | B2 |
6751628 | Coady | Jun 2004 | B2 |
6778979 | Grefenstette et al. | Aug 2004 | B2 |
6810395 | Bharat | Oct 2004 | B1 |
6834110 | Marconcini et al. | Dec 2004 | B1 |
6859800 | Roche et al. | Feb 2005 | B1 |
6915435 | Merriam | Jul 2005 | B1 |
6928526 | Zhu et al. | Aug 2005 | B1 |
6947954 | Cohen et al. | Sep 2005 | B2 |
6954750 | Bradford | Oct 2005 | B2 |
6996580 | Bae et al. | Feb 2006 | B2 |
7047386 | Ngai et al. | May 2006 | B1 |
7089286 | Malik | Aug 2006 | B1 |
7137065 | Huang et al. | Nov 2006 | B1 |
7146388 | Stakutis et al. | Dec 2006 | B2 |
7174368 | Ross, Jr. | Feb 2007 | B2 |
7260568 | Zhang et al. | Aug 2007 | B2 |
7269564 | Milsted et al. | Sep 2007 | B1 |
7284191 | Grefenstette et al. | Oct 2007 | B2 |
7287025 | Wen et al. | Oct 2007 | B2 |
7313556 | Gallivan et al. | Dec 2007 | B2 |
7325041 | Hara et al. | Jan 2008 | B2 |
7458082 | Slaughter et al. | Nov 2008 | B1 |
7526478 | Friedman | Apr 2009 | B2 |
7533291 | Lin | May 2009 | B2 |
7650341 | Oratovsky et al. | Jan 2010 | B1 |
20020002468 | Spagna et al. | Jan 2002 | A1 |
20020019814 | Ganesan | Feb 2002 | A1 |
20020038296 | Margolus et al. | Mar 2002 | A1 |
20020059317 | Black et al. | May 2002 | A1 |
20020107803 | Lisanke et al. | Aug 2002 | A1 |
20020107877 | Whiting et al. | Aug 2002 | A1 |
20020116402 | Luke | Aug 2002 | A1 |
20020120925 | Logan | Aug 2002 | A1 |
20020138376 | Hinkle | Sep 2002 | A1 |
20020140960 | Ishikawa | Oct 2002 | A1 |
20020143737 | Seki et al. | Oct 2002 | A1 |
20020143871 | Meyer et al. | Oct 2002 | A1 |
20020147733 | Gold et al. | Oct 2002 | A1 |
20020161745 | Call | Oct 2002 | A1 |
20020178176 | Sekiguchi et al. | Nov 2002 | A1 |
20020194324 | Guha | Dec 2002 | A1 |
20030028889 | McCoskey et al. | Feb 2003 | A1 |
20030069803 | Pollitt | Apr 2003 | A1 |
20030069877 | Grefenstette et al. | Apr 2003 | A1 |
20030105718 | Hurtado et al. | Jun 2003 | A1 |
20030110130 | Pelletier | Jun 2003 | A1 |
20030126247 | Strasser et al. | Jul 2003 | A1 |
20030126362 | Camble et al. | Jul 2003 | A1 |
20030135464 | Mourad et al. | Jul 2003 | A1 |
20030145209 | Eagle et al. | Jul 2003 | A1 |
20030182304 | Summerlin et al. | Sep 2003 | A1 |
20030233455 | Leber et al. | Dec 2003 | A1 |
20040034632 | Carmel et al. | Feb 2004 | A1 |
20040054630 | Ginter et al. | Mar 2004 | A1 |
20040064447 | Simske et al. | Apr 2004 | A1 |
20040064537 | Anderson et al. | Apr 2004 | A1 |
20040068604 | Le et al. | Apr 2004 | A1 |
20040083211 | Bradford | Apr 2004 | A1 |
20040143609 | Gardner et al. | Jul 2004 | A1 |
20040158559 | Poltorak | Aug 2004 | A1 |
20040186827 | Anick et al. | Sep 2004 | A1 |
20040193695 | Salo et al. | Sep 2004 | A1 |
20040205448 | Grefenstette et al. | Oct 2004 | A1 |
20050097081 | Sellen et al. | May 2005 | A1 |
20050097092 | Annau et al. | May 2005 | A1 |
20050144157 | Moody et al. | Jun 2005 | A1 |
20050160481 | Todd et al. | Jul 2005 | A1 |
20050223067 | Buchheit et al. | Oct 2005 | A1 |
20050234843 | Beckius et al. | Oct 2005 | A1 |
20050283473 | Rousso et al. | Dec 2005 | A1 |
20060167842 | Watson | Jul 2006 | A1 |
20060173824 | Bensky et al. | Aug 2006 | A1 |
20060230035 | Bailey et al. | Oct 2006 | A1 |
20070011154 | Musgrove et al. | Jan 2007 | A1 |
20070033177 | Friedman | Feb 2007 | A1 |
20070033183 | Friedman | Feb 2007 | A1 |
20070033410 | Eagle et al. | Feb 2007 | A1 |
20070038616 | Guha | Feb 2007 | A1 |
20070050339 | Kasperski et al. | Mar 2007 | A1 |
20070050351 | Kasperski et al. | Mar 2007 | A1 |
20070061335 | Ramer et al. | Mar 2007 | A1 |
20070088687 | Bromm et al. | Apr 2007 | A1 |
20070192284 | Finley et al. | Aug 2007 | A1 |
20070198470 | Freedman et al. | Aug 2007 | A1 |
20070233692 | Lisa et al. | Oct 2007 | A1 |
20070245108 | Yasaki et al. | Oct 2007 | A1 |
20070253643 | Nagarajan | Nov 2007 | A1 |
20070255686 | Kemp et al. | Nov 2007 | A1 |
20070266009 | Williams | Nov 2007 | A1 |
20070282811 | Musgrove | Dec 2007 | A1 |
20070282826 | Hoeber et al. | Dec 2007 | A1 |
20070288450 | Datta et al. | Dec 2007 | A1 |
20080005651 | Grefenstette et al. | Jan 2008 | A1 |
20080059187 | Roitblat et al. | Mar 2008 | A1 |
20080059512 | Roitblat et al. | Mar 2008 | A1 |
20080077570 | Tang et al. | Mar 2008 | A1 |
20080097975 | Guay et al. | Apr 2008 | A1 |
20080104032 | Sarkar | May 2008 | A1 |
20080147644 | Aridor et al. | Jun 2008 | A1 |
20080162498 | Omoigui | Jul 2008 | A1 |
20080195601 | Ntoulas et al. | Aug 2008 | A1 |
20090024612 | Tang et al. | Jan 2009 | A1 |
20090182737 | Melman | Jul 2009 | A1 |
Number | Date | Country | |
---|---|---|---|
60634352 | Dec 2004 | US |