Embodiments of the present invention relate to the field of electronic item processing and review.
In the context of civil litigation, documents and other evidence are exchanged and reviewed by the parties in a process known as discovery. Similarly, in the context of other transactions between parties (e.g., mergers and acquisitions, investments, and significant loans), some parties may conduct due diligence on the records of other parties. The review phase typically consists of a team of examiners such as attorneys reading the text of a set of items such as email messages, source code, and documents (such as contracts and memos that may relate to a legal matter or a business transaction). These examiners may classify these items based on characteristics such as relevance, privilege, and confidentiality.
Electronic and computer-based systems are often used to manage the reviewing process by electronically storing items and allowing examiners to review the items on an item-by-item or a page-by-page basis and to tag the items or pages in accordance with tags as customized by the examiners.
Embodiments of the present invention are directed to systems and methods for processing, reviewing, and tagging electronic items.
According to one embodiment of the present invention, a method for processing a plurality of electronic items includes: for each item of the electronic items, each item being associated with an item identifier, segmenting, on a processing device, each item into a plurality of segments, for each segment of the plurality of segments: hashing the segment to produce a segment hash value; updating a first table with the segment and the segment hash value; and adding an entry to a second table, the entry including the item identifier and the segment hash value; and outputting, from the processing device, the first table and the second table.
The updating the first table may include: using the segment hash value to determine if the segment is already in the first table; and if the segment is not in the first table: computing an entropy of the segment; and creating a new entry for the segment in the first table, the entry including: the segment; the segment hash value; and the entropy of the segment.
The segmenting each item into a plurality of segments may include: canonizing the item, the canonizing including: detecting an alias in the item; and replacing the detected alias with a canonical name.
The alias may be one of a name, an address, a telephone number, an account name, an account number, a date, a credit card number, a social security number, an e-mail address, and a user defined pattern.
The item may include text, the text including a plurality of paragraphs and wherein each of the plurality of segments corresponds to one of the paragraphs.
The item may include an image and hashing the image may include: scaling the image to have a first dimension equal to a normalized image size; padding the image to have a second dimension equal to the normalized image size; and computing the segment hash value of the scaled and padded image.
The method may further include clustering similar segments of the items.
According to one embodiment of the present invention, a method for processing a plurality of items, each of the items including a plurality of segments, includes: receiving, on a processing device, a request to display a first item of the items; retrieving, from a second table in a database stored on a computer, a first list of segment entries associated with the first item; retrieving, from a first table stored in the database, a first plurality of segments corresponding to the first list of segment entries; and outputting the first plurality of segments.
The method may further include: receiving a request to tag a first segment of the first plurality of segments of the first item with a tag; storing the tag in a tag table entry associated with the first segment; and storing the entry in a tag table.
The tag stored in the tag table entry may be an indication that the first segment has been reviewed.
The method may further include: loading, from the database, a second plurality of segments associated with a second item of the items, the second plurality of segments including the first segment; loading, from the tag table, the tag table entry associated with the first segment; and displaying the second plurality of segments and the tag stored in the entry associated with the first segment.
The displaying the second plurality of segments may include displaying the first segment in a color different from a color of at least one of the other segments of the second plurality of segments.
The first plurality of segments may be displayed to a first user and the second plurality of segments may be displayed to a second user.
Each of the segments may have an associated timestamp and the request may further include a request to display a second item of the items, the method further including: retrieving the second item; aggregating the segments of the first item and the segments of the second item; sorting the aggregated segments by timestamp; removing duplicate segments to produce a reduced list of segments; and displaying the reduced list of segments, sorted by timestamp.
The first item is a first email and the second item is a second email.
The method may further include displaying a first segment of a first item adjacent to a second segment of a second item, the first segment differing from and having a same position as the second segment.
The method may further include: searching the plurality of items, the searching including: receiving a search query; searching the first table for entries matching the search query; and returning a plurality of matching entries, wherein the first item includes at least one segment associated with a corresponding one of the matching entries.
The method may further include: receiving a selection of a segment of the matching segments; and returning a plurality of items containing the selected segment.
The method may further include: displaying a list of items being a subset of the plurality of items, the list of items including the first item, and the first item having a first item identifier; displaying the first plurality of segments; receiving a request to display a second item; saving position information, the position information including the list of items, the first item identifier, and a segment hash; displaying the second item; loading the position information; and displaying the first item in accordance with the position information.
According to one embodiment of the present invention, a system for processing a plurality of electronic items includes: a database running on a computer, the database being configured to store a first table and a second table; a processing device configured to: segment each item into a plurality of segments, each item being associated with a item identifier; for each segment of the plurality of segments: hash the segment to produce a segment hash value; update the first table with the segment and the segment hash value; and add an entry to the second table, the entry including the item identifier and the segment hash value.
According to one embodiment of the present invention, a method for displaying segments of a plurality of items includes: segmenting each the plurality of items into a plurality of segments; computing, on a processing device comprising a processor and memory, a plurality of similarities between segments of the plurality of segments; clustering, on the processing device, the plurality of segments into a plurality of clusters in accordance with the computed similarities, each of the clusters comprising a plurality of similar segments of the plurality of segments; and displaying a cluster of the plurality of clusters.
According to one embodiment of the present invention, a method for translating an item includes: segmenting, on a processing device comprising a processor and memory, the item into a plurality of segments; computing, on the processing device, a plurality of segment hash values, each of the segment hash values corresponding to one of the plurality of segments; identifying, on the processing device, a translated segment in a translation table in accordance with a segment hash value of the plurality of segment hash values, the identified translated segment corresponding to a segment of the plurality of segments; and displaying the identified translated segment.
According to one embodiment of the present invention, a method for displaying a plurality of items, each of the plurality of items being a different version of an item, includes: segmenting, on a processing device comprising a processor and memory, each of the items into a plurality of segments; hashing, on the processing device, each of the plurality of segments; identifying, on the processing device, a first differing segment of a first item of the plurality of items and a second differing segment of a second item of the plurality of items, the first differing segment having a segment hash value different from a segment hash value of the second differing segment and the first differing segment and the second differing segment having a same respective position within the first item and the second item; and displaying the first differing segment adjacent to the second differing segment.
The accompanying drawings, together with the specification, illustrate exemplary embodiments of the present invention, and, together with the description, serve to explain the principles of the present invention.
In the following detailed description, only certain exemplary embodiments of the present invention are shown and described, by way of illustration. As those skilled in the art would recognize, the invention may be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Like reference numerals designate like elements throughout the specification.
In an item review process, a collection of items (such as documents, images, email correspondence, audio recordings, audio and video transcriptions, etc.) make up a review set that may be reviewed by a group of examiners such as attorneys, paralegals, accountants, and other business personnel. Generally, each item contains one or more “segments” of text (also referred to as a “graf” or “grafs” in U.S. Provisional Application No. 61/729,310). According to one embodiment, each segment corresponds to one paragraph of text. In other embodiments, each segment may correspond to a single sentence, a single line of text, a block of computer code, an image, a table, or other logical subunits of a larger item. Other examples include cells in a spreadsheet or table, posts on a blogging or microblogging platform (e.g., Twitter® tweets, Tumblr® posts, Facebook® status updates, etc.), instant messages, text messages (e.g., SMS and MMS), images, portions of images, and metadata fields (such as a “subject” line and each email address in a list of email addresses, document creation and modification times, notes and comments associated with word processing documents, location information embedded in images, etc.). A set of items may contain a large number of segments that are repeated between items.
For example, repeated segments may include language that is common between standard contracts with a large number of different parties, boilerplate language added to emails, the text of prior emails in a reply to an email or a forwarding of the email to another party, and revised drafts of items (e.g., revised drafts of documents).
Embodiments of the present invention are directed to systems and methods for reducing duplication of effort in reviewing segments of items of a review set by tracking and marking the review of the items at a segment level, rather than on a per-item level. As such, embodiments of the present invention are directed to systems and methods for processing review sets of items to generate sets of segments, for allowing examiners to review and mark segments, for displaying collections of items in a reduced format, and for classifying items for characteristics such as whether they should be marked as privileged.
For example, a review set of items may be initially processed by the system to generate a collection of items, each item including one or more segments. When an examiner reviews an item such as an email, an examiner may mark a particular segment as being reviewed and irrelevant. Later, when the same examiner or another examiner reviews another item containing the same segment, the segment may be displayed in a way to indicate that it was previously reviewed and deemed irrelevant, such as by changing the color of the text to a light gray.
In addition, multiple emails that may be portions of a single conversation may be merged into a single view (or “chronograf view”) on the entire conversation with duplicated portions of the emails removed such that an examiner may review the entire thread without wasted effort in reviewing already-reviewed segments. This “single view” may include not just e-mails, but also items, related items, or any items that were chosen for this treatment, such as instant messaging logs, comment threads in online forums and social media, and updates to wiki entries.
As another example, all items containing the segment “ATTORNEY-CLIENT CONFIDENTIAL” may be automatically flagged as being privileged, thereby simplifying the process of classifying items.
A system and a method for initially processing a review set according to one embodiment of the present invention will be described in more detail below.
The format converter 202 converts the items it receives into a format that may be processed by later stages of the processing. For example, the format converter 202 may be used to extract plain text from various file types including, but not limited to, Microsoft® Word® documents, Microsoft® Excel® spreadsheets, Microsoft® Outlook® mailbox files, Microsoft® Exchange® database files, HTML documents, emails stored in Maildir or mbox formats, Adobe® Portable Document Format (PDF) files, Adobe PostScript files, device independent file format (DVI) documents, etc. The format converter may also identify particular types of fields within the document, such as item titles, subject lines, authors, timestamps, “to” and “from” fields, and body text.
Within the context of embodiments of the present invention, a “word” may be defined as one or more adjoining characters separated by white-space. For instance, “dog” and “cat-burglar” may be treated as “words.” In addition, “white-space” may be used to refer to spacing characters that are not visible or significant to the meaning of the text, such as: spaces (including non-breaking spaces), control characters, tabs, line-feeds, carriage returns, and paragraph markers.
The segment chunker/splitter 203 receives the extracted text from the format converter 202 and divides the text of each field into one or more segments. As shown in
The segment chunker/splitter 203 may also normalize the segment by: 1) Converting all characters into a consistent representation, such as by converting the characters to Unicode; 2) Replacing characters such as fancy quotes (“ ”) and fancy apostrophes (') with their plain equivalents; 3) Simplifying word processor formatting such as “1st” and “Product™” with “1st” and “Product™”, 4) Removing leading and trailing white-space; 5) Removing redundant white-space that does not have any significance, such as extra spaces and tabs between words and sentences; 6) Removing formatting (e.g., bold, italics, underlining, etc.); 7) Removing leading and trailing decorative characters (e.g., leading “>” marks in email replies, bullets and list numbering, and end of line characters); 8) Conversion of text to a standard Unicode composition (e.g., Unicode precomposed characters); and 9) Reencoding text in a standard encoding (e.g., converting ASCII text to UTF-8 or UTF-16). In other embodiments of the present invention, the normalization may be performed by other components of the processing device 102 such as the input/item reader 201 or the format converter 202.
In addition, the segment chunker/splitter 203 also supplies the normalized segment to the pattern canonizer (or “detector”) 207. The text of a segment may include a pattern such as a person's name, a date or a credit card number. If so, a temporary segment may be created that includes the original segment, with the patterns replaced by canonical pattern names. The pattern canonizer 207 analyzes the text of the segment to identify “patterns” such as a date, a web address, an email address, a credit card number, a phone number, a social security number, or a personal or corporate name. The pattern canonizer 207 may identify these patterns using, for example, regular expressions or other techniques well known in the field. Each pattern identified in the segment may be replaced by a canonical name. For example “Jan. 1, 2012” may be replaced by “[DATE]” and “(626) 867-5309” would be replaced by “[PHONE]”.
In addition, the pattern canonizer 207 may be configured to detect aliases and replace those aliases with canonical names. An alias may be defined as a common name for a group of objects with unique names. For instance: USER-00123 might be an alias for the given names, account names or account numbers of a particular custodian; EMPLOYEE might be an alias for any name that may be recognized as an employee of a company; and PATENTNUM might be an alias for any of several patent numbers. Aliases may differ from patterns in that an alias may be generally a defined collection of particular strings, whereas a pattern uses, for example, a regular expression, to specify a class of strings.
The normalized segments from the segment chunker/splitter 203 and the pattern canonizer 207 are then supplied to the measurer 204. The measurer 204 computes a hash of each segment. The hash may be computed using any of a variety of hash functions that are well known in the art, such as MD5, SHA-1, and SHA-3. The measurer 204 then uses the database connector 206 to check a glossary 410 stored in index database 103 to determine whether a segment corresponding to the hash already exists in the glossary 410. According to one embodiment, the glossary 410 is a table in the index database 103. If no match is found, the measurer 204 calculates the entropy of the segment. The entropy of a segment is a representation of an amount of randomness of the information contained in the segment and is described in more detail in “SYSTEM AND METHOD FOR ENTROPY-BASED NEAR-MATCH ANALYSIS,” application Ser. No. 12/722,482, filed in the USPTO on Mar. 11, 2010 and issued on Jul. 17, 2012 as U.S. Pat. No. 8,224,848. The calculated entropy, together with the hash, may be referred to as a “measure” of the hash. The measure and the segment text may then be stored together as an entry (which may be referred to as a “gloss”) in the glossary 410 via database connector 206. As such, according to one embodiment of the present invention, the glossary 410 may be a collection (or hash table) of entries sorted (or keyed or indexed) by hash. If a match is found in the glossary 410, then this particular segment has been seen before and there is no need to store another entry in the glossary 410.
If a segment contains no text or data or is made up entirely of white space, then it may be referred to as a null segment.
If a segment has no patterns, a single entry may be created or updated. If a segment has one or more patterns, then two entries are created (one for the original segment and one for the canonized segment), the canonized segment being linked to the original segment. The canonized segments allow two segments that differ only in their pattern contents to be equated. In this way, a reviewer may find a match for a segment where only the pattern text is changed. For example, common paragraphs in contracts in which only the party names are different could be detected using this “pattern segment” method, while the original segments would be considered completely different entities.
In some embodiments of the present invention, the measurer 204 is further configured to index the words in the segment, as described in more detail below with reference to
As seen in
In some embodiments of the present invention, raster images included in the items (e.g., images embedded in a word processing documents, images within emails and web pages, etc.) are also normalized and hashed to be added to the glossary 410 and the table of contents 310.
Referring to
Methods and operations of various embodiments of the present invention described herein with respect to flowcharts (such as
Referring to
Embodiments of the present invention are not limited to specific details of the method disclosed herein and the normalization of images may be processed in a variety of other ways such as: omitting the padding of the images prior to returning the normalized images; scaling up images that are smaller than the normalized image size before padding; and cropping images to defined distances between salient points in the images (such as by detecting “maximally stable extremal regions”) and other techniques as would be known to one of skill in the art of image processing.
The normalized image 304 can then be hashed using any of the methods described above with respect to hashing text, such as applying the MD5, SHA-1, or SHA-3 algorithms to the normalized image and a segment can be generated for the normalized image in the glossary 410.
By normalizing images prior to computing the hash, differently scaled versions of images or versions of the images differing compression format or file format will likely map to the same hash value, thereby increasing the likelihood that substantially identical copies will be detected.
In addition, in other embodiments, the images can be hashed without first normalizing the images, where the hash can be computed by: computing a normalized luminosity histogram of the image; computing a Radon transform of the image; computing a Haar wavelet of the image and discarding higher order terms in the computed wavelet, or other techniques that would be known to one of skill in the art of image processing.
In addition, in other embodiments, audio and video files could be transcribed into text using available software and the text segments could also be processed.
In addition, metadata automatically inserted into the body of an item by software can also be processed. This text is generally not created by a user, but is included alongside data supplied by a user. For example, when replying to or forwarding an email message, header of the previous message is typically included in the body of the new message. This header metadata typically includes the names and email addresses of the sender, the recipients, subject line, and a timestamp with the sending time of the previous message. Furthermore, metadata associated with older replies are also typically included in the body of the email messages.
In some embodiments, the portions of the items that are identified as being metadata are not processed as segments, thereby reducing the number of segments produced within the item. The metadata can be parsed and applied to particular segments (e.g., segments can be tagged by the metadata).
As such, in some embodiments of the present invention, segments identified between blocks of metadata within an email (or between a block of metadata and the end of the email) are associated with the timestamp of the metadata block above those segments. As such, segments within an email can be accurately associated with a timestamp corresponding to the creation time of the segment, rather than the creation time of the email that the segment appears in. In addition, the segments can be accurately associated with the other metadata such as the sender and recipient fields and the subject lines.
Furthermore, extraneous information located in the metadata blocks can be discarded during a field normalization procedure. For example, replying to or forwarding a message typically causes “RE:” or “FW:” (or variants thereof such as “Re:” and “Fwd:”) to be prepended to the subject line. In some embodiments of the present invention, the subject line is normalized by removing the string of “RE:”, “FW:”, and other additions made by email clients to leave the underlying subject line.
Referring, to
According to one embodiment, the metadata fields which have assigned field identifiers are stored along with the segment hash in the table of contents. Every segment in an item has a field code. Some field codes represent item metadata and other field codes represent item metadata found within the text of the body.
Whether or not an entry already exists for this particular segment in the glossary 410, as shown in
Referring to
Each item may be initially read from a data storage device and converted into a standardized format 501. The item may then be divided into a plurality of fields 502, such as title, subject, body, metadata, etc. Each field may then be divided into one or more segments 504, where the boundaries between segments may be defined, for example, by carriage returns or other markers. Each segment may then be measured 506 (as described in more detail below with reference to
Referring to
As such, the word index stores a relationship between words found in the processed items with segments containing those words and the positions of the words within the segments, thereby allowing a user to input a query to find segments containing a requested word and to receive a list of segments containing the requested word and the positions of the word within those segments.
In addition, in some embodiments of the present invention, segments can be identified as being “similar” if a human would consider the two segments to be essentially alike. Methods for computing the similarity of the segments include: comparing entropy values, counting the number of words in common between two different segments; and computing an edit distance (such as, but not limited to: a Hamming distance; a Levenshtein distance; a Damerau-Levenshtein distance; and a Jaro-Winkler distance) between the two segments.
Similar segments are grouped together into “segment clusters.” These clusters can form a “virtual item” that is disassociated from the item the segments were found in. The segments within a cluster can then be reviewed as a list of virtual items. Each cluster can be named so that a user can understand its content without having to look at the whole cluster.
According to various embodiments of the present invention, the segment clusters can be named by: naming the cluster with the first N words that are in common between the segments in the cluster; naming the cluster with the N most common words found in all of the segments in the cluster; and naming the cluster with the full text of the first segment created in time.
According to another embodiment of the present invention, items are clustered together based on having a certain percentage of segments in common.
If iA and iB do have at least one segment in common, then iA and iB are compared in size in operation 6712. If less than some threshold percentage (“X %” in
If at least the threshold percentage of segments in the smaller is also in the larger, then the larger is added to the cluster in operation 6718. In operation 6720, the collection of items is examined to determine if there are more items to be processed. If there are, then another item is selected in operation 6706. Otherwise, the cluster is named in operation 6722 and the named cluster is returned in operation 6724.
Clusters can be named through a variety of techniques such as naming based on the item that was first added to the item cluster, examining the frequency of words in the subject line or file names of the items in the cluster, or based on the time frame of the items in the cluster or the author of the items in the cluster.
In the embodiment shown in
In addition, although only one client device 107 is shown in
Reviewing of the items in the review set often occurs in teams, with different examiners or reviewers concurrently reading different sets of items at the same time. To accelerate the reviewing process, duplication of work should be reduced or minimized. Embodiments of the present invention allow tracking of the review state of individual segments rather than individual items. As such, already-reviewed segments that reappear in other items may be automatically marked as having been reviewed. When an examiner reviews an item that contains segments that have already been reviewed, those previously reviewed segments are marked as such. This allows an examiner to avoid, when appropriate, unnecessarily reviewing previously reviewed segments while keeping all the content in its original order and context. This has value even when there is only one reviewer, as it allows the reviewer to keep track of content that they have personally previously reviewed.
In addition, while reviewing items, an examiner may read a segment that makes an item relevant. To indicate its relevance, the examiner may “tag” (or “flag”) the item with one or more tags. If a segment is found to be relevant, reviewers often want to examine other items containing the same segments. Using the table of contents 310, the system may easily retrieve a list of such items. As segments are marked as relevant, items that have not been reviewed may be considered to be relevant because they contain a segment that was considered to be relevant in a different item.
According to one embodiment of the present invention, the review state and tags associated with each of the segments may be stored in the session database 105. Although the session database 105 is shown as a separate component in
Similarly, a tag may be removed from a segment by receiving a request to remove that tag and deleting that tag from the list or set of tags associated with that segment.
According to one embodiment, whether a segment has been reviewed or not (the “reviewed” state) may be tracked using a tag, such as a “reviewed” tag.
In some embodiments of the present invention, the tag table may be stored in a separate session for a particular group of users. As such, independent groups of examiners or reviewers may tag items in the review set independently of one another, without encountering the reviewed status or tags set by the other groups. For example, this may be applicable when an in-house counsel performs a first review of the items before sending them for independent verification by another group. As another example, different groups may review the same set of items for responsiveness to different issues.
In addition, in some embodiments of the present invention, an existing session may be copied and used as a starting point for a new, separate session. For example, when an in-house counsel begins review and would like to hand off the review of items to outside counsel, the in-house counsel's session may be copied to provide a starting point for the outside counsel, who may continue tagging items while the in-house counsel continues an independent review.
In many instances, especially during item discovery in the context of litigation, many items are typically protected from discovery by, for example, the attorney-client privilege or the items' being attorney work product. As such, items in review sets are often tagged to indicate whether they are privileged in order to determine whether or not the items should be produced. According to one embodiment of the present invention, tags may be used to mark segments as being associated with attorney-client privileged information or attorney work product. For example, tagging the segment “ATTORNEY WORK PRODUCT” as privileged due to being attorney work product would tag all items containing that segment, thereby automatically applying the tag to all matching items. Similarly, a segment that included the name of an attorney or an attorney's email address in a “from” field could be used to tentatively set an “attorney-client communication” tag on all matching items. As such, embodiments of the present invention may simplify and accelerate the process of tagging items for privilege status.
In
In
As seen in
Embodiments of the present invention are also directed to a system and method for reviewing a group of items containing common segments, such as in multiple emails in a conversation in a “single view mode” or “chronograf.”
The segment hashes are then used in operation 2304 to load the segments associated with the segment hashes. For example, in one embodiment, the creation timestamp of all segments in an email would be the time at which the email was sent. Grafs that have the same timestamp would be sorted by word order, so that the oldest segments appear first, in the order in which the segments appear in their associated items.
As such, the sorted list of segments may be merged to provide a single view of all unique text within a collection of items. This allows an examiner to, for example, review all portions of an email conversation spanning several different copies in a single pass. In other embodiments, the collection of items may be a collection of instant messaging logs (e.g., logs from different users containing overlapping conversations), logs from social media comment threads (e.g., Facebook® comment threads, Yammer® messages and comments, forum postings, etc.), text messaging logs, etc. Timestamps are generally included in the metadata associated with these logs, thereby allowing sorting of the segments identified in the logs.
In various embodiments of the present invention, the order in which the operations shown in
Embodiments of the present invention are also capable of stacked or “browser” style navigation. For example, an examiner may initially choose to review items serially by item ID. After reviewing several items, the examiner may come across an email that contains an interesting lead and choose to explore the entire conversation thread associated with the email and therefore initiates a single view mode on the set of items matching the subject line of the email. In single view mode, the examiner may tag a number of segments having interesting information. After reviewing all of the segments associated with the email conversation, the examiner may jump back up a level to return to the last item he or she had been reviewing serially and to continue reviewing items serially. All of the tagging performed by the examiner while in single view mode may be preserved and the tagging and “reviewed” status changes of segments caused by exploring the email conversation persist and affect the display of the items viewed serially.
As such, embodiments of the present invention track the user's viewing history and allow the user to explore various research pathways while allowing the user to easily return to earlier states.
According to one embodiment of the present invention, the system maintains a history log that stores item ID, view mode, and other information about the prior states of the examiner's view of the data. Each time a request is made to change the view, for example, by moving on to the next item, initiating single view mode, or performing a search for items containing segments matching a particular tag, the current state may be added to the history log. When a user chooses to return to an earlier viewing state, the state information may be read from the history log and used to reconstruct the earlier view.
According to one embodiment of the present invention, the history log may be implemented using a stack, as would be well understood in the art of web browsers and user interface design.
For example, according to one embodiment, each record in the stack of records may contain the entire list of items that the user was reviewing and the currently selected item and segment. The record may also contain all the details about the viewing state, in order that the user could be returned to the exact viewing state they were in before they branched their review.
Some embodiments of the present invention may be used to assist in the translation of items. Language translation, like item review, is normally done on an item basis, but translations suffer from the similar problems where there are many different copies of the same text in different items. Similar items containing only minor changes to some paragraphs may be translated more efficiently by identifying only the changed portions and retranslating only those changed portions. The segment technique according to embodiments of the present invention can provide a solution to this problem. When a segment is translated, these translations can be stored and shown in lieu of the original. In this context, it may be more useful for each segment to correspond to a single sentence. Therefore, even if the segment is only similar to a translated segment, the translation of the similar segment can be shown, which would save translation costs. Thus, even translation of small segments can have a large effect on many different reviewers who are not native speakers.
Embodiments of the present invention may be used to allow organizations to store, centralize, search, and receive business intelligence on archived items. According to one embodiment, such a system would build a segment table of contents 310 that included all segments found in any list of items. The table of contents 310 would include the segments themselves, their relationships to each other, the item they were found in, segment hash, and other details.
According to various embodiments of the present invention, archiving may be accomplished by: storing the original item in a segment index; not storing original item, but instead storing a list of entries with a segment database; and storing only the clustered items in a chronograf (or “single view”) along with a segment database.
Some embodiments of the present invention are directed to providing version comparison and showing item evolution through versions, similar to a “Track Changes” view or a “diff” between two files. Embodiments of the present invention allow a user to view and track the changes to portions of documents (e.g., various provisions in contracts) across multiple versions. Embodiments of the present invention also allow a user to review changes in a document and how those changes persist over time.
When a group of documents is broken into segments, each segment may be associated with a date of first occurrence (or the version number that the segment first appeared in) in addition to a position as to segment placement in an item (i.e. 5th segment, 10th segment, etc.). Thus, embodiments of the present invention can reconstruct the evolution and changes in an item by mapping these changes in a modified chronograf view (or “single view”). This chronograf view would allow analysis of versions of items (e.g., documents, contracts, etc.), to better understand how multiple versions have changed over time. This view would preserve the order of the root (earliest in time) item, and add any new segments found in later items in place so that a user can see how a segment was edited across multiple versions.
One example of a chronograf view for reviewing and tagging multiple versions of an item will be described in more detail with reference to
Referring to
As another example, referring to
Referring to
While the present invention has been described in connection with certain exemplary embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims, and equivalents thereof.
This application claims the benefit of U.S. Provisional Application No. 61/729,310 filed in the United States Patent and Trademark Office on Nov. 21, 2012, the entire disclosure of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61729310 | Nov 2012 | US |