This present invention relates in general to a data management system and method, and more particularly, to an automated data management system and method for organizing and processing a large volume of various types of data files.
With more and more information being stored electronically, it is found that the information is often stored in different formats, i.e., different types of files, on different storage media, using different versions of applications, or run by different operating systems. For example, some data may be in Microsoft Word format, while other data may be in WordPerfect format. Some data is in Microsoft Excel format, while others are in a variety of formats including, but not limited to, Microsoft Mail, Outlook, GroupWise, Lotus Notes, etc. Further, data may be stored in a hard drive, a floppy disk, a backup tape, a CD, or an optical device, etc. Furthermore, data may be operated by a UNIX, NOVELL, NT, or DOS system, etc.
To review and/or manipulate any of data that are stored in different file types, using different versions, on different media, run by different operating systems, a customer often needs to open/close the corresponding different software programs, such as Word, WordPerfect, Excel, Email Outlook, etc. This is a very inefficient way of reviewing and manipulating the stored data. Further, one has to have these software programs and their updated versions to review and/or manipulate the stored data.
In an area of litigation support, in particular, huge amount of documents and/or exhibits may have to be produced, organized, reviewed, reproduced, etc., for example, in merger and acquisition, intellectual property, anti-trust, and class action cases. The documents and/or exhibits may come from different locations in different file types using different versions. The existing methods of handling documents and/or exhibits include hand-coding or bar-coding. The hand-coding or bar-coding methods are not truly automated methods, and these methods are not efficient particularly in handling a volumetric amount of documents and/or exhibits.
Many litigation support companies often send out huge amounts of electronic documents to a third world developing country or hire scores of temporary workers. These workers would open documents, print documents, and enter information about a document by hand into an organized file. These methods are often time consuming, labor intensive, and prone to human mistakes. The sheer volume of data that one needs to review under strict discovery deadlines becomes a challenging and time demanding task. As a reviewer gathers electronic information, the reviewer is required to be confident that s/he has thoroughly searched, found, and reviewed all of the information residing on laptops, desktops, servers, and backup tapes, and sometimes in multiple locations.
The existing data management systems use data paths, such as data source paths and data destination paths, to organize and/or log or access data files. When one process the data files, s/he has to find the data paths. Further, the number of data paths is limited. For example, to administer and process three data files, i.e. two generated by John Smith at ABC company on Sep. 12, 2000 in its two New York branch offices and one by Jay Smith at ABC company on Sep. 12, 2000 in one of its New York branch offices, the existing data management systems have used the data paths, such as ABC\9/12/2000\NY\JohnSmith\file name; ABC\9/12/2000\JohnSmith\NY2\file name; and ABC\9/12\2000\NY\JaySmith\file name. These data paths closely tie to a specific user, location, etc. The quality and efficiency of processing data files are significantly dependent on a process controller's experience and knowledge of data path structures.
Accordingly, there is a need for an efficient, automated data management system and method for organizing and processing a large volume of various types of data files. Further, improvements on administering and controlling the automated data management process are desired.
It is with respect to these or other considerations that the present invention has been made.
In accordance with this invention, the above and other problems were solved by providing an efficient, automated data management system for logging, processing, and reporting a large volume of data capable of being in any types.
In one embodiment, a data management system in accordance with the principles of the present invention provides a data slice which is used to describe and categorize a unique set of data where every data file in that set of data has common characteristics, such as, but not limited to, owner/creator, location, backup date, or data type, etc., that are important in describing and labeling the data files. In other words, a data slice is a label assigned to a set or collection of data, and a data slice generally includes data descriptors or characteristics, such as company, user, date, location, etc. A data slice preferably has an ID number that is stored in a database.
One embodiment of a data management system in accordance with the principles of the present invention includes: a first processor for restoring a plurality of received data files, the data files being capable of being different file types; a file organizing/categorizing processor for organizing the received data files into data slices, each data slice including an identification number and a descriptor that describes characteristics of the received data file; a file logging processor for logging the received data files into a first database based on the data slices; a data uploading processor for uploading the first database to a second database; a de-duplicate processor for calculating a SHA value of the received data files to determine whether the received data files have duplicates and flagging duplicated data files in the second database; an image conversion processor for converting at least a portion of the received data files into image files; and a second processor for exporting the image files.
In one embodiment, the first database is a local database for a specific data slice or a predetermined number of data slices, and the second database is a global database for the data slices in combination. The image files are preferably stored in the global database to be viewed.
Further in one embodiment, the image files that are converted from the data files are in a standardized image format, such as tiff format, PDF format, etc. The image files can then be exported/outputted, e.g. printed, etc.
Yet in one embodiment, the data files are in a variety of formats including, but not limited to, Microsoft Mail, Outlook, GroupWise, Lotus Notes, etc. Also, the data files have a variety of formats including Word, Excel, PowerPoint, and Access. The data files may include an attachment data file, which in turn may contain additional attachment data file. The process is designed to handle an endless number of levels of embedded data files.
Additionally in one embodiment, an attachment data file is generally associated with a data file such that image files for the data file and the corresponding attachment data file can be viewed together.
Still in one embodiment, the file logging processor, the image conversion processor, and the second processor are parallel processors such that the data files are parallel-processed in a data file logging stage, an image conversion stage, and an image file output stage.
Further in one embodiment, the data files having the same file type are preferably converted into the image files together.
Yet in one embodiment, the data management system includes a plurality of image conversion processors, each of the image conversion processors being capable of converting the data files having the same file type into the corresponding image files.
Additionally in one embodiment, the file logging processor identifies the file type of the data files based on the SHA value and a file header of each of the data files.
Still in one embodiment, the data management system may include a keyword search processor for searching a keyword from the received data files or processed image files. The keyword search can be performed either before processing the data files or after processing the data files. If a preprocessing keyword search, i.e. the keyword search is performed before processing the data files, is desired and preformed, and if there is a hit, the corresponding data file that is being searched is retained for processing, and the data file without a hit is discarded without being processed. If a post-processing keyword search, i.e. the keyword search is performed after processing the data files, is desired and performed, and if there is a hit, the corresponding image file is exported, and the image file without a hit is not exported.
The present invention also provides a method of logging, processing, and reporting a large volume of data capable of being in different types.
In one embodiment, the method in accordance with the principles of the present invention includes the steps of: restoring a plurality of received data files, the data files being capable of being different file types; organizing/categorizing the received data files into data slices, each data slice including an identification number and a descriptor that describes characteristics of the received data file; logging the received data files into a first database based on the data slices; uploading the first database to a second database; de-duplicating duplicates in the received data files by calculating a SHA value of the received data files to determine whether the received data files have duplicates and flagging duplicated data files in the database; converting at least a portion of the received data files into image files, respectively; and exporting the image files.
Still in one embodiment, the method further includes the step of viewing the image files stored in the second database.
Further in one embodiment, the converting of the data files includes converting the data files into the corresponding image files in a standardized image format, such as a PDF format, a tiff format, etc.
One of the advantages of the present invention is that the data files are organized and processed in an efficient automated manner. The turn around time for generating a report containing the organized image files is substantially shortened. The quality and efficiency of processing data files are improved.
Another advantage of the present invention is that the duplicates in the data files can be eliminated (i.e. de-duplicating). The size of the entire data files can be substantially reduced.
A further advantage of the present invention is that the parallel processing of the data files allows the processing of the data files to be scalable.
An additional advantage of the present invention is that the converted image files are organized such that it allows readily further processing of the data files.
Yet another advantage of the present invention is that every data file logged associates with a data slice id, which allows the processes, such as de-duplication, image conversion, and image output, to be performed on the data slice level.
These and various other features as well as advantages which characterize the present invention will be apparent from a reading of the following detailed description and a review of the associated drawings.
Referring now to the drawings in which like reference numbers represent corresponding parts throughout:
The present invention discloses an efficient, automated data management system for logging, processing, and reporting a large volume of data capable of being in different types, using different versions, stored on different media, and/or run by different operating systems.
A preferred embodiment of a data management system 20 in accordance with the principles of the present invention is shown in
An example of a data slice structure or database is shown in
As shown in
A de-duplicate processor 34 is coupled to the data upload processor 32. The de-duplicate processor 34 flags duplicates of the data files, i.e. de-duplicates the data files by creating a unique subset of data files and flagging duplicated files as such and storing this information in the global database 30. Generally, the de-duplicate processor 30 calculates a SHA value of the received data files to determine whether the received data files have duplicates and flags duplicated data files in the global database 30. The data slice structure of the system 20 allows one to have options of de-duplicating the entire database, no de-duplicating at all, or de-duplicating per data slice or a set of data slices.
An image conversion processor 36 is coupled to the de-duplicate processor 34. The image conversion processor 36 converts the data files into image files. The data slice structure of the system 20 allows one to convert the desired data slice.
A data file output processor 38 is coupled to the image conversion processor 36. The data file output processor 38 exports the image files. The data slice structure of the system 20 allows one to have options of exporting the entire converted image files or exporting a set of converted image files. The exporting may include, but not limited to, printing the image files, or sending the image files to a device, etc.
The application of the data management system 20 may include three phases of data processing. Phase 1 is the file logging/uploading/de-duplicating process. Phase 2 is the file converting process. Phase 3 is the file exporting process. The details of three phases are discussed in operational flows shown in
Next in an operation 50, the received data files are de-duplicated by calculating a SHA value of the received data files so as to determine whether the received data files have the same SHA value. If the data files have the same SHA value, then the data files are duplicates. If duplicates of the data files are found, they are flagged in the global database. Data files are then converted into image files in an operation 52. The control of the operational flow 40 allows one to have the options of converting the de-duplicated data files, i.e. the data files without deplicates, or converting the data files disregard of the duplicates, i.e. no de-duplicate, or converting a part of de-duplicated data files. Next in an operation 54, the converted image files are exported to a device, e.g. a printer, a viewer program, a PDA (Personal Digital Assistant), etc.
The QA operation 68 can be implemented in a user interface to the system. The user interface may provide the status of operations in each phase. For example, the user interface may indicate whether the selected or current data file is in a New status, In-Progress status, Done status, Error status, Ignore status, Check/Search status, QA In-Process status, or No Data status, etc.
Next, the image file is stored in the global database in an operation 92. Then, an operation 94 determines whether there is another file of this file type category left to convert. If “Yes”, then the operational flow 80 goes to the operation 88 to select a new data file under the selected file type. If “No”, then an operation 96 determines whether there is another file type left to select. If “Yes”, then the operational flow 80 goes to the operation 96 to select a new file type. If “No”, then an operation 98 determines whether there is another file status left to select. If “Yes”, then the operational flow 80 goes to the operation 84 to select a new file status. If “No”, then the image conversion operational flow 80 ends.
Also shown in
It is appreciated that the sequence or order of the operational flows 40, 56, 70, 80, and 100 can be varied within the scope of the present invention. Also, it is appreciated that some steps in the operation flows 40, 56, 70, 80, and 100 can be added, merged, and/or eliminated depending on a customer's needs without departing from the scope of the present invention.
In box 124, the user selects the status of data slices that s/he wants to process, for example, New, In Progress, etc. As described above, usually status “New” is selected for processing. If a data slice had a problem, such as the machine it was running on was shut down, etc., that data slice would have the status “In Progress”. In order to view this problematic data slice to select it for processing, the status is set to “In Progress”.
Then, the system displays all data slices that have the selected phase and status as shown in box 126. Next, the user selects a data slice for processing in box 128. If phase 2, i.e. image conversion, is selected from box 130, i.e. “Yes” path, it is determined whether to process specific file types or file status in box 132. If “Yes”, the user selects status (e.g. New. In Progress, etc.) of the files that s/he wants to process in box 134 and selects category or file type (Word Processing, Spreadsheet, etc.) of the files that s/he wants to process in box 136. Then, the system sets the status of the selected data slice to “In Progress” in box 138. If no specific file type or file status is processed from box 132, or if the user does not want to process phase 2, i.e. the image conversion phase, from box 130, the system sets the data slice status to “In Progress” as shown in box 138. Then, the system processes the data slice in box 140 as shown in
Next, the system checks for processing problems to ensure quality and assurance (QA) and posts QA information in box 142 as described above. Then, the system sets data slice status to “Done” in box 144. The user determines whether the QA results are good in box 146. If “No”, then the system sets data slice status to “Error” in box 148 and determines whether to continue processing data slices with the same phase and status in box 150. If it is to continue, i.e. “Yes” path, then the operational flow 120 goes to the operation 128 to select a data slice for processing. If it is not to continue, i.e. “No” path, then the operational flow 120 goes to the operation 122 to select a phase of data slices that the user wants to process.
If the QA results are good from the box 146, i.e. “Yes” path, then the user sets the data slice Phase to the next Phase Status to “New” in box 152. Then, the operational flow 120 goes to the operation 150 as described above.
It will be clear that the present invention is well adapted to attain the ends and advantages mentioned as well as those inherent therein. While presently preferred embodiments have been described for purposes of this disclosure, various changes and modifications may be made which are well within the scope of the present invention. For example, in
Number | Date | Country | |
---|---|---|---|
Parent | 09894373 | Jun 2001 | US |
Child | 10941065 | Sep 2004 | US |