The present invention relates in general to a data management system and method, and more particularly, to an automated data management system and method for organizing and processing a large volume of various types of data files.
With more and more information being stored electronically, it is found that the information is often stored in different formats, i.e., different types of files, on different storage media, or run by different operating systems. For example, some data may be stored in Microsoft Word format, some data may be stored in WordPerfect format, some data may be stored in Microsoft Excel format, and some data may be stored in a variety of email formats including, but not limited to, Microsoft Mail, Outlook, Group Wise, Lotus Notes, etc. Also, data may be stored in a hard drive, a floppy disk, a backup tape, a CD, or an optical device, etc. Further, data may be operated by a UNIX, NOVELL, NT, or DOS system, etc.
To review and/or manipulate any of these data that are stored in different file types, different media, run by different operating systems, a customer often needs to open/close the corresponding different software programs, such as Word, WordPerfect, Excel, Email Outlook, etc. This is a very inefficient way of reviewing and manipulating the stored data. Further, one has to have these software programs and their updated versions to review and/or manipulate the stored data.
In an area of litigation support, in particular, huge amount of documents and/or exhibits may have to be produced, organized, reviewed, reproduced, etc., for example, in merger and acquisition, intellectual property, anti-trust, and class action cases. The documents and/or exhibits may come from different locations in different file types. The existing methods of handling documents and/or exhibits include hand-coding or bar-coding. The hand-coding or bar-coding methods are not truly automated methods, and these methods are not efficient particularly in handling a volumetric amount of documents and/or exhibits.
Many litigation support companies often send out huge amounts of electronic documents to a third world developing country or hire scores of temporary workers. These workers would open documents, print documents, and enter information about a document by hand into an organized file. These methods are often time consuming, labor intensive, and prone to human mistakes. The sheer volume of data that one needs to review under strict discovery deadlines becomes a challenging and time demanding task. As a reviewer gathers electronic information, the reviewer is required to be confident that s/he has thoroughly searched, found, and reviewed all of the information residing on laptops, desktops, servers, and backup tapes, and sometimes in multiple locations.
Accordingly, there is a need for an efficient, automated data management system and method for organizing and processing a large volume of various types of data files.
It is with respect to these or other considerations that the present invention has been made.
In accordance with this invention, the above and other problems were solved by providing an efficient, automated data management system for logging, processing, and reporting a large volume of data capable of being in different types.
In one embodiment, a data management system in accordance with the principles of the present invention includes: a first server processor for restoring a plurality of received data files, the data files being capable of being different file types; a file organizing/categorizing processor for organizing the received data files, based on a predetermined user list, into a source directory structure and a destination directory structure; a file logging processor for logging the received data files into a database formed by the source and destination directory structures and identifying a file type of the received data files; a de-duplicate processor for calculating a SHA value of the received data files to determine whether the received data files have duplicates and flagging duplicated data files in the database; an image conversion processor for converting the remaining subset of de-duplicated data files into image files, respectively; and a second server processor for exporting the image files.
Still in one embodiment, the image files are stored in the database to be viewed.
Further in one embodiment, the image files converted from the data files are in a tiff format to be printed.
Yet in one embodiment, the data files include email data files and user data files. The email data files are in a variety of formats including, but not limited to, Microsoft Mail, Outlook, Group Wise, Lotus Notes, etc. The user data files have a variety of formats including Word, Excel, PowerPoint, and Access. The email data files may include attachment email or data files, which in turn may contain additional attachment or email files. The process is designed to handle an endless number of levels of embedded files
Additionally in one embodiment, the attachment data and email files are associated with the email data files such that the image data files for the email data files and the corresponding attachment data and email files can be viewed together.
Still in one embodiment, the file logging processor, the image conversion processor, and the second server processor are parallel processors such that the data files are parallel-processed in a data file logging stage, an image conversion stage, and an image file output stage.
Further in one embodiment, the data files having the same file type are converted into the image files together.
Yet in one embodiment, the data management system includes a plurality of image conversion processors, each of the image conversion processors being capable of converting the data files having the same file type into the corresponding image files.
Additionally in one embodiment, the file logging processor identifies the file type of the data files based on the SHA value and a file header of each of the data files.
The present invention also provides a method of logging, processing, and reporting a large volume of data capable of being in different types.
In one embodiment, the method in accordance with the principles of the present invention includes the steps of: restoring a plurality of received data files, the data files being capable of being different file types; organizing/categorizing the received data files, based on a predetermined user list, into a source directory structure and a destination directory structure; logging the received data files into a database formed by the source and destination directory structures and identifying a file type of the received data files; de-duplicating duplicates in the received data files by calculating a SHA value of the received data files to determine whether the received data files have duplicates and flagging duplicated data files in the database; converting the remaining data files into image files, respectively; and exporting the image files.
Still in one embodiment, the method further includes the step of viewing the image files stored in the database.
Further in one embodiment, the converting of the data files includes tiffing the data files into the corresponding image files.
Yet in one embodiment, the identifying of the data files includes identifying email data files and user data files. The email data files are in a variety of formats including, but not limited to, Microsoft Mail, Outlook, Group Wise, Lotus Notes, etc. The user data files have a variety of formats including Word, Excel, PowerPoint, and Access. The email data files may include attachment data and email files.
Additionally in one embodiment, the method includes associating the email data files with the corresponding attachment data and email files such that the image data files for the email data files and the corresponding attachment data and email files can be viewed together.
Still in one embodiment, the method includes parallel processing the steps of logging, converting, and exporting such that the data files are parallel-processed in a data file logging stage, an image conversion stage, and an image file output stage.
Further in one embodiment, the converting of the data files includes converting the data files having the same file type into the image files together.
Yet in one embodiment, the converting of the data files is processed by a plurality of image conversion processors, each of the image conversion processors being capable of converting the data files having the same file type into the corresponding image files.
Additionally in one embodiment, the identifying of the file type of the data files is based on the SHA value and a file header of each of the data files.
One of the advantages of the present invention is that the data files are organized and processed in an efficient automated manner. The turn around time for generating a report containing the organized image files is substantially shortened.
Another advantage of the present invention is that the duplicates in the original data files can be eliminated. The size of the entire data files is substantially reduced.
A further advantage of the present invention is that the parallel processing of the data files allows the processing of the data files to be scalable.
An additional advantage of the present invention is that the converted image files are organized such that it allows readily further processing of the data files.
These and various other advantages and features of novelty which characterize the invention are pointed out with particularity in the claims annexed hereto and form a part hereof. However, for a better understanding of the invention, its advantages, and the objects obtained by its use, reference should be made to the drawings which form a further part hereof, and to accompanying descriptive matter, in which there are illustrated and described specific examples of an apparatus in accordance with the invention.
Referring now to the drawings in which like reference numbers represent corresponding parts throughout:
The present invention discloses an efficient, automated data management system for logging, processing, and reporting a large volume of data capable of being in different types, stored on different media, and/or run by a different operating system.
In
Also shown in
The details of logging, de-duplicating, and converting the data files and outputting the corresponding image files are discussed in operation flows shown in
It is appreciated that the sequence or order of the operation flows 36, 50, 62, 72, and 86 can be varied within the scope of the present invention. Also, it is appreciated that some steps in the operation flows 36, 50, 62, 72, and 86 can be added, merged, and/or eliminated depending on a customer's needs without departing from the scope of the present invention.
The data management system and methodology for a specific application in accordance with the principles of the present invention described below is just an example. The specific application of the data management system and method includes a pre-processing/data massaging step and three phases of data processing.
The pre-processing/data massaging step includes storing and restoring data from any media, file system, or backup system. It is appreciated that the pre-processing/data messaging step may also include recovering corrupted data if the data on the media, file system, or backup system is corrupted, lost, or damaged.
The original data files can be received via email, mail, the Internet, or any other network or server systems. Also, the original data files can be obtained on-site via backups. Further, the data files can be in any form or on any media, for example, backup tapes, hard drives, floppies, CDs, opticals, etc. The data files can be extracted from any file system including UNIX, NOVELL, NT, DOS, etc.
The received data files are then copied and moved into an appropriate database structure. The directory structure is based on a master user list, e.g. a folder or directory and subsequent sub-directories, etc. The data files can be converted into a standard format, such as Group Wise, Lotus Notes, Microsoft format if desired. The data files can also be broken up into sub-categories, such as email data files and user data files. Accordingly, all email data files, such as personal folders and email messages, are moved to a special directory for a specific user. Then, sub-directories, such as location or time-slice, are used to better delineate the data files. For example, the directory and sub-directories are created for Joe Smith's email as: Source\Minneapolis\Email\9-12-88\Joe Smith\.
Meanwhile, an example of a destination directory and sub-directories for storing image files for an output report is created for Joe Smith's email as: Destination\Minneapolis\Email\9-12-88\Joe Smith\.
Accordingly, with the source and destination directories and sub-directories, the breaking up of the received data files is used to help process Joe Smith's and others' data files.
The five phases of data processing include Logging/Extracting (Phase 1), Processing/Tiffing (Phase 2), Reporting/Exporting (Phase 3), Delivery/Printing (Phase 4), and Review/Second Print (Phase 5). The use of five phases allows one to control the quality and speed of data processing in each phase.
Phase 1 is to gather and log information about all data files. Based on a master list of users, i.e. the directories and sub-directories as described above, the directories corresponding to a user from the master list of users are selected. The master list of users can be stored as part of the database to increase automation. Since there is a master list of where each user's data is currently in the process, it prevents users from accidentally being double processed or skipped. It also allows for easy reporting on progress on the entire process as a whole. A list of file types to process is also used. Meanwhile, the master list is updated to indicate that this user is in Phase 1. The information on the selected source directories is uploaded directory by directory and file by file for processing. The following steps are implemented:
STEP 1:
STEP 2:
STEP 3:
STEP 4 (if email data files are being processed):
STEP 5:
STEP 6:
Phase 2 is the step where image files (e.g. Tiff format files) of the logged data files are generated.
STEP 1:
STEP 2:
STEP 3:
STEP 4:
Repeat STEPS 1 to 3 for all file types.
STEP 5:
Phase 3 is to generate ordered output for a customer or a print shop. Based on a master list of users, the directories and sub-directories that correspond to a particular user are selected for processing in Phase 3. The master list is updated to indicate that the particular user is in progress for Phase 3. Based on files tiffed up (i.e. the image files) in Phase 2, a report can be generated which contains a listing of all tiffed files. These image files are arranged in a hierarchy relationship. For example, email data files are arranged to be associated with their attachments.
STEP 1:
STEP 2:
STEP 3:
STEP 4:
STEP 5
STEP 6
STEP 7
STEP 8
Once the report is generated, the report can be delivered to a customer. It is appreciated that the delivery of the report can be in a paper print format or in an electronic viewer format. It is appreciated that other methods of delivery can be used without departing from the present invention. For example, the report or print can be delivered via emails, the Internet, etc., or hardware such as CDs, etc.
STEP 1
STEP 2
The review log information is uploaded into the database, and all files that are responsive are flagged.
After a customer reviews the report generated, the customer may want to exclude and/or include some data files. The data files that are relevant are flagged. In this case, the data management system generates a new list of users and produces/prints only those image files that are flagged as relevant. A new set of sequential bates numbers are assigned. Slip sheets can be re-generated as described above if desired.
A process similar to Phase 3 is done here whereby only those documents that are marked as responsive are produced for print or export. A new set of bates numbers are assigned to the new subset of pages. All non-responsive documents are not considered for this re-print.
The foregoing description of the exemplary embodiment of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the invention be limited not with this detailed description, but rather by the claims appended hereto.
This application claims the benefit of U.S. Provisional Application No. 60/229,874 filed Aug. 31, 2000, entitled SYSTEM AND METHOD FOR DATA MANAGEMENT, and which is in its entirety incorporated herewith by reference.
Number | Name | Date | Kind |
---|---|---|---|
5608874 | Ogawa et al. | Mar 1997 | A |
5732265 | Dewitt et al. | Mar 1998 | A |
5778395 | Whiting et al. | Jul 1998 | A |
5813009 | Johnson et al. | Sep 1998 | A |
5848415 | Guck | Dec 1998 | A |
5974412 | Hazlehurst et al. | Oct 1999 | A |
6020980 | Freeman | Feb 2000 | A |
6052692 | Anderson et al. | Apr 2000 | A |
6092090 | Payne et al. | Jul 2000 | A |
6128627 | Mattis et al. | Oct 2000 | A |
6192165 | Irons | Feb 2001 | B1 |
6289353 | Hazlehurst et al. | Sep 2001 | B1 |
6389433 | Bolosky et al. | May 2002 | B1 |
6421685 | Nishikawa | Jul 2002 | B1 |
6442573 | Schiller et al. | Aug 2002 | B1 |
6547829 | Meyerzon et al. | Apr 2003 | B1 |
6573907 | Madrane | Jun 2003 | B1 |
20020059317 | Black et al. | May 2002 | A1 |
20020065892 | Malik | May 2002 | A1 |
20020107877 | Whiting et al. | Aug 2002 | A1 |
20020156827 | Lazar | Oct 2002 | A1 |
20030037022 | Adya et al. | Feb 2003 | A1 |
Number | Date | Country |
---|---|---|
2001075890 | Mar 2001 | JP |
WO 9818092 | Apr 1998 | WO |
Number | Date | Country | |
---|---|---|---|
20020059317 A1 | May 2002 | US |
Number | Date | Country | |
---|---|---|---|
60229874 | Aug 2000 | US |