As the amount of digital content continues to grow in various fields, users are confronted with an increasing number of documents to analyze while performing tasks such as web searches, legal discovery, and scientific literature research, among others. In order to review the large number of documents for relevant information, users may rely on various techniques that can sort the documents. However, a user can still spend a considerable amount of time reviewing the sorted documents for relevant information.
The following presents a simplified summary in order to provide a basic understanding of some aspects described herein. This summary is not an extensive overview of the claimed subject matter. This summary is not intended to identify key or critical elements of the claimed subject matter nor delineate the scope of the claimed subject matter. This summary's sole purpose is to present some concepts of the claimed subject matter in a simplified form as a prelude to the more detailed description that is presented later.
An embodiment provides a method for providing organized content. The method can include identifying a spine document from a collection of documents, wherein the spine document comprises a plurality of sections. The method can also include splitting a related document into a plurality of subdocuments. In addition, the method can include mapping the subdocuments to corresponding sections of the spine document. Furthermore, the method can include displaying subdocuments based on a search of the collection of documents.
Another embodiment is a system for providing organized content comprising a display device to display a subdocument, a processor to execute processor executable code, and a storage device that stores processor executable code. In some embodiments, the processor executable code, when executed by the processor, causes the processor to identify a spine document from a collection of documents, wherein the spine document comprises a plurality of sections. The processor executable code can also cause the processor to split a related document into a plurality of subdocuments and map the subdocuments to corresponding sections of the spine document. Furthermore, the processor executable code can cause the processor to display subdocuments based on a search of the collection of documents.
Another embodiment provides one or more tangible computer-readable storage media comprising a plurality of instructions. The instructions can cause a processor to identify a spine document from a collection of documents, wherein the spine document comprises a plurality of sections. The instructions can also cause a processor to split a related document from the collection of documents into a plurality of subdocuments and map the subdocuments to corresponding sections of the spine document. Furthermore, the instructions can cause the processor to display subdocuments based on a search of the collection of documents and a relationship of the subdocuments and the spine document, wherein the relationship between the subdocuments and the spine document comprises one of a complementary relationship, a redundant relationship, and a matched relationship.
The following detailed description may be better understood by referencing the accompanying drawings, which contain specific examples of numerous features of the disclosed subject matter.
Several techniques for providing organized content have been developed, such as providing documents that are ranked based on a calculated relevance, providing documents that are ranked based on a personal relevance, providing documents identified with a clustered search, and providing documents organized with a faceted search, among others. However, these techniques do not assist a user in searching for content within a collection of documents based on the scope of each document. The scope of a document, as referred to herein, is an indication of the various topics included in the document and the amount of text included in each document for each of the various topics.
Various methods for providing organized content are described herein. Content, as referred to herein, can include documents and webpages, among others. In some embodiments, a spine document is identified from a collection of documents. A spine document, as referred to herein, is a document that can include any suitable number of sub-topics represented in a collection of documents. For example, a collection of documents may include a number of related documents, in which each related document includes a number of sub-topics related to a particular topic. In some embodiments, the spine document may be the document from the collection of documents that includes the largest number of sub-topics, or the longest document from the collection of documents, among others. In some embodiments, the related documents can be displayed based on a relationship with the spine document. For example, a related document may include a number of sub-topics discussed in the spine document. In some examples, a sub-topic in a related document may contain information that is included in the spine document (also referred to herein as redundant information), information that is neither a match nor a duplicate of information in a section of the spine document (also referred to herein as complementary information), or information matching the text of a section of the spine document.
As a preliminary matter, some of the figures describe concepts in the context of one or more structural components, referred to as functionalities, modules, features, elements, etc. The various components shown in the figures can be implemented in any manner, for example, by software, hardware (e.g., discrete logic components, etc.), firmware, and so on, or any combination of these implementations. In one embodiment, the various components may reflect the use of corresponding components in an actual implementation. In other embodiments, any single component illustrated in the figures may be implemented by a number of actual components. The depiction of any two or more separate components in the figures may reflect different functions performed by a single actual component.
Other figures describe the concepts in flowchart form. In this form, certain operations are described as constituting distinct blocks performed in a certain order. Such implementations are exemplary and non-limiting. Certain blocks described herein can be grouped together and performed in a single operation, certain blocks can be broken apart into plural component blocks, and certain blocks can be performed in an order that differs from that which is illustrated herein, including a parallel manner of performing the blocks. The blocks shown in the flowcharts can be implemented by software, hardware, firmware, manual processing, and the like, or any combination of these implementations. As used herein, hardware may include computer systems, discrete logic components, such as application specific integrated circuits (ASICs), and the like, as well as any combinations thereof.
As for terminology, the phrase “configured to” encompasses any way that any kind of structural component can be constructed to perform an identified operation. The structural component can be configured to perform an operation using software, hardware, firmware and the like, or any combinations thereof.
The term “logic” encompasses any functionality for performing a task. For instance, each operation illustrated in the flowcharts corresponds to logic for performing that operation. An operation can be performed using software, hardware, firmware, etc., or any combinations thereof.
As utilized herein, terms “component,” “system,” “client” and the like are intended to refer to a computer-related entity, either hardware, software (e.g., in execution), and/or firmware, or a combination thereof. For example, a component can be a process running on a processor, an object, an executable, a program, a function, a library, a subroutine, and/or a computer or a combination of software and hardware. By way of illustration, both an application running on a server and the server can be a component. One or more components can reside within a process and a component can be localized on one computer and/or distributed between two or more computers.
Furthermore, the claimed subject matter may be implemented as a method, apparatus, or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof to control a computer to implement the disclosed subject matter. The term “article of manufacture” as used herein is intended to encompass a computer program accessible from any tangible, computer-readable device, or media.
Computer-readable storage media can include but are not limited to magnetic storage devices (e.g., hard disk, floppy disk, and magnetic strips, among others), optical disks (e.g., compact disk (CD), and digital versatile disk (DVD), among others), smart cards, and flash memory devices (e.g., card, stick, and key drive, among others). In contrast, computer-readable media generally (i.e., not storage media) may additionally include communication media such as transmission media for wireless signals and the like.
The processor 102 may be connected through a system bus 106 (e.g., PCI, ISA, PCI-Express, HyperTransport®, NuBus, etc.) to an input/output (I/O) device interface 108 adapted to connect the computing system 100 to one or more I/O devices 110. The I/O devices 110 may include, for example, a keyboard, a gesture recognition input device, a voice recognition device, and a pointing device, wherein the pointing device may include a touchpad or a touchscreen, among others. The I/O devices 110 may be built-in components of the computing system 100, or may be devices that are externally connected to the computing system 100.
The processor 102 may also be linked through the system bus 106 to a display device interface 112 adapted to connect the computing system 100 to a display device 114. The display device 114 may include a display screen that is a built-in component of the computing system 100. The display device 114 may also include a computer monitor, television, or projector, among others, that is externally connected to the computing system 100. A network interface card (NIC) 116 may also be adapted to connect the computing system 100 through the system bus 106 to a cloud computing environment (also referred to herein as a service over network computing environment) 118. The cloud computing environment 118 can include any suitable number of servers, databases, and other infrastructure that can provide organized content in accordance with the embodiments described herein.
The storage 120 can include a hard drive, an optical drive, a USB flash drive, an array of drives, or any combinations thereof. The storage 120 may include an organizer module 122. The organizer module 122 can identify a spine document, identify subdocuments within a related document, and determine the relationship between each subdocument and the spine document. In some examples, the relationship between each subdocument and the spine document can include redundant subdocuments, duplicate subdocuments, complementary subdocuments, and matching subdocuments, among others. In some embodiments, the spine document can be identified from a collection of related documents. The remaining documents in the collection can be referred to as related documents. Each of the related documents can include any suitable number of subdocuments, which can be identified based on sections or paragraphs, among others. A subdocument, as referred to herein, includes any suitable portion of text, or other content within a document. The organizer module 122 can determine a relevance score for each subdocument in relation to the spine document. The relevance score, as referred to herein, can include the probability that the information of a subdocument matches the sub-topic of a section of a spine document. For example, the organizer module 122 can use any suitable data structure, such as vectors or arrays, among others, to store information related to each subdocument. In some embodiments, vectors can be used to store the number of occurrences of each word in a subdocument. Calculating a relevance score is discussed in greater detail below in relation to
In some embodiments, the organizer module 122 can also display the relationships between the subdocuments and a spine document. In some examples, the organizer module 122 can provide a highlighted related document in which the relationship between each subdocument and the spine document is presented with a different shading or color. In one example, a chart may be provided that indicates the relationship between each subdocument and a spine document. The various techniques for displaying the relationships between subdocuments and a spine document are discussed in greater detail below in relation to
It is to be understood that the block diagram of
At block 202, the organizer module 122 identifies a spine document from a collection of documents, wherein the spine document comprises a plurality of sections. In some embodiments, each section of the spine document may be related to a particular sub-topic. For example, each section of the spine document may include text related to a particular aspect of the general topic of the spine document. In some embodiments, the spine document is identified as an authoritative document on a subject, such as a WIKIPEDIA® page, among others, as the document that contains the most subdocuments, or the document that contains at least one subdocument from the most number of documents. In one embodiment, the spine document is identified by selecting a document that has the highest relevance to a search query, selecting a document with the highest number of words, selecting an authoritative document, such as a WIKIPEDIA® page, or selecting the document with the highest search rank, among others. For example, the topic of the spine document may be identified from a search query such as a legal query or a medical query, among others.
At block 204, the organizer module 122 splits a document into a plurality of subdocuments. In some embodiments, the subdocuments can relate to sub-topics that may be related to the topic of the spine document. For example, the sub-topics may relate to a chronological history of the topic of the spine document, or any other subject matter related to the topic of the spine document. In some embodiments, the subdocuments can be split from the related documents using any suitable granularity. For example, a document may have section headings that identify subdocuments. In some embodiments, any suitable type of formatting can be used to split a related document into subdocuments. For example, paragraph formatting, section formatting, subsection formatting, or sentence formatting, among others can be used to split a document into subdocuments.
At block 206, the organizer module 122 maps the subdocuments to corresponding sections of the spine document. In some embodiments, the subdocuments are mapped to sections of the spine document based on a relevance score for each subdocument. In some examples, the relevance score can be based on a set of calculations. For example, the relevance score can be based on the cosine of a vector representation of the words in the section of the spine document and a vector representation of the words of the subdocument text. In some embodiments, each entry of a vector can correspond to a word in the subdocument or the spine document. The relevance score can also be based on the cosine of a vector representation of the words in the section title of the spine document and a vector representation of the words in the title of the subdocument. In some embodiments, the relevance score can also be based on a cosine of the vector representation of the nouns in a section of the spine document and a vector representation of the nouns in a corresponding subdocument. In some examples, the vector representation can be based on TFIDF algorithms. In one embodiment, the relevance score can also be based on a similarity determined by BM25 algorithms. A term frequency-inverse document frequency (also referred to herein as TFIDF) vector representation can store the number of occurrences of each word from a section or title of text. In some embodiments, techniques are used to account for common words such as “a” and “an”, among others. For example, the number of occurrences of a word in a subdocument may be divided by the number of documents in a collection to normalize the TFIDF vector representation of a subdocument. An Okapi BM25 algorithm (also referred to herein as BM25) can rank subdocuments according to the relevance of a subdocument regarding a particular query, where the query can be arbitrarily long, for example, the words from a particular section of the spine document. For example, the BM25 relevance score can indicate the relevance of a subdocument based on the number of occurrences of the words from such a search query within the subdocument.
In some embodiments, the relevance score can be based on a BM25 similarity score or a cosine of two TFIDF vectors. The cosine similarity of two vectors can be calculated based on an inner product of the two vectors. In one embodiment, the cosine of two vectors can indicate the similarity of a subdocument and a section of a spine document. In some examples, the cosine similarity can be normalized. For example, the organizer module 122 may map the lowest cosine similarity value to a zero value and map the highest cosine similarity value to a one value. In some embodiments, both the cosine similarity value and the normalized value can be stored. In some examples, the organizer module 122 can also consider additional information when normalizing the cosine similarity value if the range of the cosine similarity values is small. In some embodiments, any suitable combination of TFIDF-based and BM25-based similarity scores and other appropriate features, such as subdocument length, can be used to determine a relevance score. For example, a similarity between a subdocument and a spine document can be calculated using any suitable technique or combination of techniques such as logistic regression, linear regression, decision tress, neural networks, and support vector machines, among others. The relevance score, as referred to herein, can include the probability that the information of a subdocument matches the sub-topic of a section of a spine document.
In some embodiments, the relevance scores and other metrics, such as subdocument length and domain reliability of a spine document, among others, are input into a classifier that can output a probability that a subdocument matches a section of a spine document. In some embodiments, the classifier can use logistic regression, linear regression, decision tress, neural networks, and support vector machines, among others to produce the output of the probability that a subdocument matches a section of the spine document. In some examples, the relevance scores and other metrics can train the classifier by comparing the output of the classifier to predetermined results. For example, the output of the classifier can be compared to results from crowd sourced tasks in which judges decide whether a subdocument matches a section of a spine document, among others.
At block 208, the organizer module 122 displays subdocuments based on a search of the collection of documents. In some embodiments, the organizer module 122 can search a collection of documents for subdocuments with a relevance score above a threshold for a section of the spine document. In some embodiments, a document can be highlighted based on the relationship of text in the document to the spine document. As discussed above, a relationship between a related document and a spine document can indicate redundant information, complementary information, and matching information. In some examples, each relationship can be indicated with a different shade or color of highlighting to depict the relationship between text in a document and the spine document. For example, redundant information in a subdocument that is also discussed in the spine document may appear shaded or highlighted. Displaying relationships between subdocuments and the spine document are discussed below in greater detail in relation to
In some embodiments, a chart can also display the relationship of each section of a document to a spine document. For example, a chart can indicate if the document contains redundant information, complementary information, or matching information, among others. At block 210, the process flow ends.
The process flow diagram of
In some embodiments, the organizer module 122 can determine that a subdocument 308 or 310 is relevant to the topic of the spine document and that the subdocument 308 or 310 matches a section of the spine document. The organizer module 122 can also provide the text from the subdocuments 308 and 310, also referred to herein as matched subdocuments, that correspond to a particular section of the spine document. A matched subdocument can be identified with various machine learning techniques, such as neural networks, among others. The machine learning techniques can determine if a matched subdocument augments a section of the spine document. In some examples, augmenting a section of the spine document can include determining whether the information in the section of the spine document is a subset of the subdocument, or if the information in the subdocument augments the information in the section of the spine document.
In some embodiments, a matched subdocument can be identified using the relevance scores computed for each subdocument. In some embodiments, a relevance score over a suitable number or percent can indicate a subdocument is a match to a section of the spine document. In some examples, a user can adjust the value of the relevance score that indicates a subdocument is a match to a section of the spine document.
The illustration of
The chart 400 displays six subdocuments of a related document. In some embodiments, the left axis of chart 400 includes values between zero and one, which indicate the probability that a subdocument has a particular relationship with the spine document. In the example illustrated in chart 400, each subdocument has a one-hundred percent probability that each subdocument has a particular relationship with a section of the spine document. The shading of chart 400 indicates the relationship between each subdocument and a spine document. For example, the slanted lines through subdocument 1 402 and subdocument 2 404 of chart 400 may indicate that subdocument 1 and subdocument 2 match sections of a spine document. In this example, subdocuments 1 and 2 may include relevant information to a section of the spine document because the matching relationship indicates a high relevance score. In some examples, the subdocument 3 406 of chart 400 includes a dotted shading that may indicate that subdocument 3 includes complementary information to a spine document. For example, subdocument 3 may include information that does not match information in a section of the spine document and is not redundant information in relation to a section of the spine document. In some examples, the horizontal-line shading in subdocument 4 408, subdocument 5 410, and subdocument 6 412 of chart 400 may indicate that subdocuments 4, 5, and 6 include redundant information that is already included in a spine document. In some embodiments, a redundant relationship can be calculated based on whether a subdocument contains a superset of subset of concepts from a section of the spine document. In some examples, a redundant relationship can also be determined based on the amount of overlap in concepts between the subdocument and the section of the spine document or the length of the subdocument, or other features of the subdocument.
Some subdocuments may also be near-verbatim duplicates of sections of the spine document. In some embodiments, the organizer module 122 can detect duplicate subdocuments by calculating a TFIDF based cosine similarity between each sentence of a subdocument and each sentence of a section of the spine article. In some examples, the maximum cosine similarity value for each sentence in the subdocument to some sentence in the spine document can be stored in any suitable data structure such as a vector, among others. The organizer module 122 can calculate the mean of the stored maximum cosine similarity values and determine if the mean value is above a threshold. If the mean value is above a threshold, the sentence of a subdocument can be considered a duplicate to a sentence in the spine document. In some embodiments, the threshold value for determining a duplicate can be predetermined, or periodically modified.
The illustration of
The various software components discussed herein may be stored on the tangible, computer-readable storage media 500, as indicated in
It is to be understood that any number of additional software components not shown in