1. Technical Field
The invention relates to electronic data discovery. More particularly, the invention relates to the evaluation of the processing, i.e. full text indexing and archive extraction, status of digital content collected for electronic data discovery purposes. Still more particularly, the invention relates to a method and apparatus for providing collection transparency information to an end user to achieve a guaranteed quality document search and production in electronic data.
2. Description of the Prior Art
Electronic discovery, also referred to as e-discovery or EDiscovery, concerns discovery in civil litigation, as well as Tax, Government Investigation, and Criminal Proceedings which deals with information in electronic form. In this context, electronic form is the representation of information as binary numbers. Electronic information is different from paper information because of its intangible form, volume, transience, and persistence. Such information is typically stored in a content repository. Also, electronic information is usually accompanied by metadata, which is rarely present in paper information. Electronic discovery poses new challenges and opportunities for attorneys, their clients, technical advisors, and the courts, as electronic information is collected, reviewed, and produced. Electronic discovery is the subject of amendments to the Federal Rules of Civil Procedure which are effective Dec. 1, 2006. In particular Rules 16 and 26 are of interest to electronic discovery.
Examples of the types of data included in e-discovery include e-mail, instant messaging chats, Microsoft Office files, accounting databases, CAD/CAM files, Web sites, and any other electronically-stored information which could be relevant evidence in a law suit. Also included in e-discovery is raw data which forensic investigators can review for hidden evidence. The original file format is known as the native format. Litigators may review material from e-discovery in one of several formats: printed paper, native file, or as TIFF images.
Content Repository Uncertainty with File Indexing Status
A typical content repository, i.e. content storage, has certain problems that impair search results and that may cause problems in EDiscovery
Uncertainty with File Indexing Status
Usually, indexing status of a content repository is estimated in the following ways:
Some systems try to go beyond these two approaches by warning the user what files are still in the indexing queue.
The optimistic approach is entirely unsafe when it comes to importing or indexing very large files. For example, in Oracle 9i it takes up to several minutes to index a very large document, and it takes several seconds to put large files into indexing queue. This makes the optimistic approach undesirable for EDiscovery. Failure to index files causes incorrect search results for both approaches.
None of the applications on the market implement a comprehensive processing status information solution that combines index-ability, indexing status, and container extraction, e.g. opening of such files as zip files, status information.
An EDiscovery Management Application (EMA) is a content management system responsible for managing collections and holds, which communicates collection and hold requests to data sources, and which collects content from data sources (see related U.S. patent application Ser. No. 11/963,383, filed on Dec. 21, 2007, the entirety of which is incorporated herein by this reference thereto). Some files collected into an EMA content repository during the EDiscovery process must undergo full text indexing to allow their contents to become searchable by the end user. However, the following limitations with this approach to indexing should be noted:
AP: I changed the last sentence because otherwise it sounds like we are criticizing some approach, which we are going to reject. Whereas, these are natural limitation of every indexing process.
Extracting files from container files, such as ZIP, CAB, WAR, RAR, EAR file archives, PST, NSF email archives, email message MSG files, and others, collected into an EMA content repository during the collection process creates even more uncertainty when it comes to understanding the processing status of files in the content repository. For example, the following limitations should be noted:
In EDiscovery, failing to find and produce files may result in substantial litigation risks and penalties. This is why it is very important to understand the indexing and extraction status of content collection in EDiscovery precisely. For example, the failure of a defendant to locate an email message that was saved by the plaintiff may be treated by the court as negligent misconduct or an attempt to hide evidence, and may result in heavy penalties.
Users Need to Access Processing Status Information in a User-Friendly Form
Both file indexing and container extraction status information should be available to a user performing the file search to allow the user to understand the processing status of the collected content and make decisions on completeness of file search results. Also, because the overall size of the collection may be huge, processing status information must be tailored for the subset of data the user tries to query when a search is performed. Finally, the user should know which files may contain the information specified in the query criteria, although the collection repository cannot search these files; and there should be a way for a user to browse and view files that failed to index, not indexable, or have not been indexed yet and containers that failed to explode or have not been exploded yet manually by viewing the files.
In this context, it would be advantageous to provide collection transparency information to an end user to achieve a guaranteed quality document search and production in electronic data.
An embodiment of the invention displays full text index-ability, indexing, and container extraction status of files in a collection repository in connection with content management in EDiscovery.
As opposed to the optimistic and pessimistic approaches the inventive technique disclosed herein guarantees that the user knows which files failed to index and explode and which files that are not indexable.
As opposed to the optimistic approach, the invention provides a technique that tells the user which files have not been indexed yet, so they are not omitted from the analysis.
As opposed to the pessimistic approach, the invention provides a technique that allows users to start working on the collected files without waiting for the maximum possible indexing period. Further, as opposed to the pessimistic approach, the invention provides a technique that allows users to start working immediately on the collected content, thus avoiding slowing down the work during frequent updates to the content repository.
An embodiment of the invention allows for displaying indexing and extraction status information that is relevant only to the search query, thus minimizing the time needed to analyze the files that are not indexed or not exploded manually.
An embodiment of the invention also allows for automatic and manual update of a list of un-indexable file types based on historical information collected during collection repository operation, thus enhancing the user experience with time and adapting the EMA to new file types.
Finally, an embodiment of the invention allows for keeping both the person who performed the collection and the person who manages the EDiscovery effort informed about the processing status of the collection by sending notifications, displaying alerts, and by providing appropriate views.
Terms
For purposes of the discussion herein, the following terms have the meaning associated therewith:
Electronic Data Discovery (e-discovery or EDiscovery) is discovery of electronically stored evidence in civil litigation, as well as tax, government investigation, and criminal proceedings.
EDiscovery Management Application (EMA). A system responsible for managing the electronic discovery process and storing the collected content in the content repository.
Documents collected for EDiscovery undergo a certain transformation inside the collection repository. Namely:
Sometimes these transformations may fail. For example, when an indexing engine times out and fails to index a file, or when an archive is password protected. Also, these transformations cannot be performed instantly. As a result, after a file import, some of the files that are supposed to be indexed may stay un-indexed, and some containers that are supposed to be exploded may stay un-exploded for a certain period of time. This can create a situation when a user failed to find or view a file that is supposed to be found for the purposes of litigation. For example, a file containing a certain word combination is not displayed in the full text search results because it has not been indexed properly, or even because it has not been extracted from a container. This may cause significant legal consequences, for example, in the situation when a defendant has an obligation to produce a document.
In an embodiment of the invention, the EMA displays the indexing status of files pertaining to a given matter or legal request in the content repository. This display is provided in a processing status area of the search results page. Files in the content repository can be classified, for example, the following way when it comes to full text index-ability and indexing state:
The EMA can extract indexing state and index-ability information from the content repository. For example, in Oracle, index-ability information is stored in a specially defined “IGNORE” field, and indexing status information can be extracted from Oracle Context Views. Note that each database product usually exposes some data that allows a programmer to derive full text indexing status. If some data are unavailable, there are ways to approximate this information.
In another embodiment, the EMA displays the container extraction status of container files pertaining to a given matter or legal request residing in the EMA content repository. This display is provided in a processing status area of the search results page. More generally, summary information is provided on a search results page. Thus, the EMA can read container extraction status from the content repository and display this information as error or warning messages in the processing status area of the content repository view page.
Extraction status can be stored in the EMA content repository in many ways. For example, the Atlas LCC has a status field that is originally set to “N” (not extracted). Once files are extracted from the container, the container extraction status is changed to “Y.” If the extraction failed, then the value is changed to “X.”
In another embodiment, the EMA displays the indexing status and/or container extraction status warnings and errors only for the files that may affect search results. This display is provided in the processing status area of the search results page. In the above described embodiments, the EMA displayed indexing and extraction information against all the files collected for a given legal matter or document discovery request. This information becomes overwhelming when where are many files collected for a given matter and request. Therefore, the EMA can display indexing and extraction status information only for the files that may have affected a current search query. These files include, for example, the files for which the EMA could not evaluate whether the file match certain parts of the search criteria.
To produce guaranteed search results, EMA must display the files that meet all the search parameters, except those that the EMA cannot check because of the file's bad processing status. In the example above, the EMA displays indexing warnings and errors to the end user for all files that failed to index, not indexed yet, and optionally the files that are known as un-indexable that belong to case “John Smith vs. XYZ, Inc”, discovery request “Request 1,” and that were created between Jan. 1, 2005 and Jan. 1, 2006, assuming that these files may contain the information about “John Smith.”
Displaying warnings on un-indexable files is optional because the user may understand that certain files are not subject to a full text search, e.g. JPEG files. Thus, such a warning may not be useful.
The system also displays warnings and errors to the end user for all containers that failed to explode or that have not been exploded yet for the case “John Smith vs. XYZ, Inc” and Discovery Request “Request 1” because these containers may contain files the user tries to search for, i.e. files modified between Jan. 1, 2005 and Jan. 1, 2006 and containing keywords “John Smith,” but there is no way for the EMA to figure that out.
To summarize this, the set of files returned to the end user comprises:
For purposes of the discussion herein, the term file metadata search criteria refers to file properties of the contained files, such as name, extension, size, location, modified date, created date, last accessed date, and the criteria that can be derived during file extraction, such as hash value and digital signature.
In another embodiment of the invention, the EMA displays the list of containers which failed to explode to the user who uploaded the containers into the EMA. Users who performed the collection, i.e. uploaded files to the EMA, can view the processing status of the files they have uploaded to the EMA content repository so they can promptly resolve the issues. For example, they can re-upload zip archives that failed to be exploded because of password protection, or they can provide the password as a note in the collection log.
In another embodiment of the invention, the EMA notifies the user who uploaded containers to the EMA and users responsible for coordinating the EDiscovery effort of the fact that some containers failed to explode.
In another embodiment of the invention, the EMA displays a processing status warning next to a file entry so the user can see what processing problems occurred with each file. This can be done both on a search results page and on a file detail information page.
In another embodiment of the invention, the EMA collects and presents information on what file types are not indexable by collecting the statistics of indexing failure per file type. Indexing failure may happen, for example, for any of the following reasons:
When indexing fails, it is hard to determine the reason. However, there is a need to improve indexing capabilities gradually and minimize false positives by making sure that the EMA does not attempt to index un-indexable files. Usually, the EMA maintains a list of file types that are not supposed to be indexed. However, new file types may arrive and the EMA administrator needs to receive information on whether these file types are indexable or not. This can be done by observing indexing failure statistics. Over time, the EMA collects, for example, the following information per file type:
Based on this information, the system calculates the ratio of indexing failure, which can be described by the following formula:
Ratio of failure of a given type=number of failed files of a given type/number of files of a given type uploaded and attempted to index
This information can be reported to the administrator so that file types having a high ratio can be added to a “do not index” list. For example, if the ratio is close to 1, this is definitely not an indexable file type. If the ratio is between 0.2 and 0.8, here the numbers are arbitrary, indexable and un-indexable file types may have the same file extension. If the ratio is low but not 0, a majority of files of this type are getting indexed, but there may be problems with indexing engine timeout or some files may be corrupt.
Another Formula that may be Used for this Purpose is as Follows:
Ratio of failure of a given type=number of failed files of a given type/number of files of a given type successfully indexed
This formula can be derived from the previous formula. The decision points equivalent to those described above are, for example:
The EMA can automatically (or semi automatically, by presenting the information to the administrator and letting the administrator decide) update the “do not index list” with file types that proved to have high ratio of failure. A high ratio of failure can be determined through comparison against a threshold value. The EMA may also postpone the decision until it achieves a representative sample, i.e. a large enough number files of the same type being uploaded and attempted to index. This makes the statistics credible.
In another embodiment of the invention, the EMA provides a separate view containing the list of files that have questionable status.
Although the invention is described herein with reference to the preferred embodiment, one skilled in the art will readily appreciate that other applications may be substituted for those set forth herein without departing from the spirit and scope of the present invention. Accordingly, the invention should only be limited by the Claims included below.
Number | Name | Date | Kind |
---|---|---|---|
5355497 | Cohen-Levy | Oct 1994 | A |
5608865 | Midgely et al. | Mar 1997 | A |
5701472 | Koerber et al. | Dec 1997 | A |
5875431 | Heckman et al. | Feb 1999 | A |
5903879 | Mitchell | May 1999 | A |
6115642 | Brown et al. | Sep 2000 | A |
6128620 | Pissanos et al. | Oct 2000 | A |
6151031 | Atkins et al. | Nov 2000 | A |
6173270 | Cristofich et al. | Jan 2001 | B1 |
6330572 | Sitka | Dec 2001 | B1 |
6332125 | Callen et al. | Dec 2001 | B1 |
6343287 | Kumar et al. | Jan 2002 | B1 |
6401079 | Kahn et al. | Jun 2002 | B1 |
6425764 | Lamson | Jul 2002 | B1 |
6539379 | Vora et al. | Mar 2003 | B1 |
6607389 | Genevie | Aug 2003 | B2 |
6622128 | Bedell et al. | Sep 2003 | B1 |
6738760 | Krachman | May 2004 | B1 |
6805351 | Nelson | Oct 2004 | B2 |
6832205 | Aragones et al. | Dec 2004 | B1 |
6839682 | Blume et al. | Jan 2005 | B1 |
6944597 | Callen et al. | Sep 2005 | B2 |
6966053 | Paris et al. | Nov 2005 | B2 |
6976083 | Baskey et al. | Dec 2005 | B1 |
7076439 | Jaggi | Jul 2006 | B1 |
7103602 | Black et al. | Sep 2006 | B2 |
7104416 | Gasco et al. | Sep 2006 | B2 |
7107416 | Stuart et al. | Sep 2006 | B2 |
7127470 | Takeya | Oct 2006 | B2 |
7162427 | Myrick et al. | Jan 2007 | B1 |
7197716 | Newell | Mar 2007 | B2 |
7225249 | Barry et al. | May 2007 | B1 |
7236953 | Cooper et al. | Jun 2007 | B1 |
7281084 | Todd et al. | Oct 2007 | B1 |
7283985 | Schauerte et al. | Oct 2007 | B2 |
7284985 | Genevie | Oct 2007 | B2 |
7333989 | Sameshima et al. | Feb 2008 | B1 |
7386468 | Calderaro et al. | Jun 2008 | B2 |
7433832 | Bezos et al. | Oct 2008 | B1 |
7478096 | Margolus et al. | Jan 2009 | B2 |
7496534 | Olsen et al. | Feb 2009 | B2 |
7502891 | Shachor | Mar 2009 | B2 |
7512636 | Verma et al. | Mar 2009 | B2 |
7558853 | Alcorn et al. | Jul 2009 | B2 |
7580961 | Todd et al. | Aug 2009 | B2 |
7594082 | Kilday et al. | Sep 2009 | B1 |
7596541 | deVries et al. | Sep 2009 | B2 |
7720825 | Pelletier et al. | May 2010 | B2 |
7730148 | Mace et al. | Jun 2010 | B1 |
7742940 | Shan et al. | Jun 2010 | B1 |
7895229 | Paknad | Feb 2011 | B1 |
20010053967 | Gordon et al. | Dec 2001 | A1 |
20020007333 | Scolnik et al. | Jan 2002 | A1 |
20020010708 | McIntosh | Jan 2002 | A1 |
20020022982 | Cooperstone et al. | Feb 2002 | A1 |
20020035480 | Gordon et al. | Mar 2002 | A1 |
20020083090 | Jeffrey et al. | Jun 2002 | A1 |
20020091553 | Callen et al. | Jul 2002 | A1 |
20020095416 | Schwols | Jul 2002 | A1 |
20020103680 | Newman | Aug 2002 | A1 |
20020108104 | Song et al. | Aug 2002 | A1 |
20020119433 | Callender | Aug 2002 | A1 |
20020120859 | Lipkin et al. | Aug 2002 | A1 |
20020123902 | Lenore et al. | Sep 2002 | A1 |
20020143595 | Frank et al. | Oct 2002 | A1 |
20020143735 | Ayi et al. | Oct 2002 | A1 |
20020147801 | Gullotta et al. | Oct 2002 | A1 |
20020162053 | Os | Oct 2002 | A1 |
20020178138 | Ender et al. | Nov 2002 | A1 |
20020184068 | Krishnan et al. | Dec 2002 | A1 |
20020184148 | Kahn et al. | Dec 2002 | A1 |
20030004985 | Kagimasa et al. | Jan 2003 | A1 |
20030014386 | Jurado | Jan 2003 | A1 |
20030018520 | Rosen | Jan 2003 | A1 |
20030031991 | Genevie | Feb 2003 | A1 |
20030033295 | Adler et al. | Feb 2003 | A1 |
20030036994 | Witzig et al. | Feb 2003 | A1 |
20030046287 | Joe | Mar 2003 | A1 |
20030051144 | Williams | Mar 2003 | A1 |
20030069839 | Whittington et al. | Apr 2003 | A1 |
20030074354 | Lee et al. | Apr 2003 | A1 |
20030097342 | Whittington | May 2003 | A1 |
20030110228 | Xu et al. | Jun 2003 | A1 |
20030139827 | Phelps | Jul 2003 | A1 |
20030229522 | Thompson et al. | Dec 2003 | A1 |
20040002044 | Genevie | Jan 2004 | A1 |
20040019496 | Angle et al. | Jan 2004 | A1 |
20040034659 | Steger | Feb 2004 | A1 |
20040039933 | Martin et al. | Feb 2004 | A1 |
20040060063 | Russ et al. | Mar 2004 | A1 |
20040068432 | Meyerkopf et al. | Apr 2004 | A1 |
20040088283 | Lissar et al. | May 2004 | A1 |
20040088332 | Lee et al. | May 2004 | A1 |
20040088729 | Petrovic et al. | May 2004 | A1 |
20040103284 | Barker | May 2004 | A1 |
20040133573 | Miloushev et al. | Jul 2004 | A1 |
20040133849 | Goger | Jul 2004 | A1 |
20040138903 | Zuniga | Jul 2004 | A1 |
20040143444 | Opsitnick et al. | Jul 2004 | A1 |
20040187164 | Kandasamy et al. | Sep 2004 | A1 |
20040193703 | Loewy et al. | Sep 2004 | A1 |
20040204947 | Li et al. | Oct 2004 | A1 |
20040215619 | Rabold | Oct 2004 | A1 |
20040260569 | Bell et al. | Dec 2004 | A1 |
20050060175 | Farber et al. | Mar 2005 | A1 |
20050071251 | Linden et al. | Mar 2005 | A1 |
20050074734 | Randhawa | Apr 2005 | A1 |
20050114241 | Hirsch et al. | May 2005 | A1 |
20050125282 | Rosen | Jun 2005 | A1 |
20050144114 | Ruggieri et al. | Jun 2005 | A1 |
20050165734 | Vicars et al. | Jul 2005 | A1 |
20050187813 | Genevie | Aug 2005 | A1 |
20050203821 | Petersen et al. | Sep 2005 | A1 |
20050240578 | Biederman, Sr. et al. | Oct 2005 | A1 |
20050283346 | Elkins, II et al. | Dec 2005 | A1 |
20060036464 | Cahoy et al. | Feb 2006 | A1 |
20060036649 | Simske et al. | Feb 2006 | A1 |
20060074793 | Hibbert et al. | Apr 2006 | A1 |
20060095421 | Nagai et al. | May 2006 | A1 |
20060126657 | Beisiegel et al. | Jun 2006 | A1 |
20060136435 | Nguyen et al. | Jun 2006 | A1 |
20060143248 | Nakano et al. | Jun 2006 | A1 |
20060149407 | Markham et al. | Jul 2006 | A1 |
20060149735 | DeBie et al. | Jul 2006 | A1 |
20060156381 | Motoyama | Jul 2006 | A1 |
20060167704 | Nicholls et al. | Jul 2006 | A1 |
20060174320 | Maru et al. | Aug 2006 | A1 |
20060178917 | Merriam et al. | Aug 2006 | A1 |
20060184718 | Sinclair | Aug 2006 | A1 |
20060229999 | Dodell et al. | Oct 2006 | A1 |
20060230044 | Utiger | Oct 2006 | A1 |
20060242001 | Heathfield | Oct 2006 | A1 |
20070016546 | De Vorchik et al. | Jan 2007 | A1 |
20070048720 | Billauer | Mar 2007 | A1 |
20070061156 | Fry et al. | Mar 2007 | A1 |
20070061157 | Fry et al. | Mar 2007 | A1 |
20070078900 | Donahue | Apr 2007 | A1 |
20070099162 | Sekhar | May 2007 | A1 |
20070100857 | DeGrande et al. | May 2007 | A1 |
20070112783 | McCreight et al. | May 2007 | A1 |
20070156418 | Richter et al. | Jul 2007 | A1 |
20070162417 | Cozianu et al. | Jul 2007 | A1 |
20070203810 | Grichnik | Aug 2007 | A1 |
20070208690 | Schneider et al. | Sep 2007 | A1 |
20070219844 | Santorine et al. | Sep 2007 | A1 |
20070220435 | Sriprakash et al. | Sep 2007 | A1 |
20070271517 | Finkelman et al. | Nov 2007 | A1 |
20070282652 | Childress et al. | Dec 2007 | A1 |
20070288659 | Zakarian et al. | Dec 2007 | A1 |
20080033904 | Ghielmetti et al. | Feb 2008 | A1 |
20080034003 | Stakutis et al. | Feb 2008 | A1 |
20080059265 | Biazetti et al. | Mar 2008 | A1 |
20080059543 | Engel | Mar 2008 | A1 |
20080070206 | Perilli | Mar 2008 | A1 |
20080126156 | Jain et al. | May 2008 | A1 |
20080148346 | Gill et al. | Jun 2008 | A1 |
20080195597 | Rosenfeld et al. | Aug 2008 | A1 |
20080229037 | Bunte et al. | Sep 2008 | A1 |
20080294674 | Reztlaff et al. | Nov 2008 | A1 |
20080301207 | Demarest et al. | Dec 2008 | A1 |
20080312980 | Boulineau et al. | Dec 2008 | A1 |
20080319958 | Bhattacharya et al. | Dec 2008 | A1 |
20080319984 | Proscia et al. | Dec 2008 | A1 |
20090037376 | Archer et al. | Feb 2009 | A1 |
20090043625 | Yao | Feb 2009 | A1 |
20090106815 | Brodie et al. | Apr 2009 | A1 |
20090119677 | Stefansson et al. | May 2009 | A1 |
20090150866 | Schmidt | Jun 2009 | A1 |
20090150906 | Schmidt et al. | Jun 2009 | A1 |
20090193210 | Hewett et al. | Jul 2009 | A1 |
20100070315 | Lu et al. | Mar 2010 | A1 |
Number | Date | Country | |
---|---|---|---|
20090187797 A1 | Jul 2009 | US |