There are currently an incredibly large number of documents available on the World Wide Web. Furthermore, web-based applications allow users to easily generate content and publish such content to the World Wide Web. Exemplary web-based applications include social networking applications that are employed by users to post status messages, commentary, or the like, micro-blogging applications, wherein a user can generate and publish relatively short messages (up to 140 characters in length), web log(blog) applications that facilitate user creation of online accessible journals, amongst other web-based applications. Additionally, as the public is becoming increasingly proficient with computing technologies, more and more people are creating web pages that are accessible on the World Wide Web by way of a web browser.
As the number of web-accessible documents has increased, it has become increasingly challenging to identify keywords that are descriptive of content of such documents (for each document). For example, identifying descriptive keywords of a web-based document can facilitate classifying web-based documents in accordance with certain topics, identifying subject matter trends which can be employed in connection with selecting advertisements to present to users, for utilization in information retrieval, such that when a user issues a query that includes one or more of the keywords that are known to be relevant to content of a web-based document, the web-based document that includes the keyword will be positioned relatively prominently in a search results list.
An exemplary approach for identifying keywords that are descriptive of content of documents is computing term frequency-inverse document frequency (TF-IDF). This metric, described generally, combines two different metrics (a first metric and a second metric) to ascertain a score for a keyword in a document. The first metric is the frequency of the keyword in the document being analyzed. For example, if the keyword occurs multiple times in the document, then such keyword may be highly descriptive of the content (the topic) of the document. The second metric is the inverse document frequency, which indicates, for a corpus of documents that includes the document, a number of documents that include the keyword. For example, if every document in the document corpus includes the keyword, then such keyword is likely not descriptive of content of any of the documents (such keyword occurs in most documents, and therefore is not descriptive of content of any of the documents).
Computing TF-IDF for each term in each document of a relatively large corpus of documents is too large a task to be undertaken on a single computing device. Accordingly, algorithms have been developed that leverage parallel processing capabilities of distributed computing environments. Thus, the task of computing TF-IDF, for each keyword/document combination in a relatively large corpus of documents, is distributed across several computing nodes, wherein the several computing nodes perform certain operations in parallel. Conventional algorithms for execution in the distributed computing environments, however, require multiple map-reduce operations (e.g., four map reduce operations). As a result, the input/output overhead of computing nodes in the distributed computing environment is relatively high.
The following is a brief summary of subject matter that is described in greater detail herein. This summary is not intended to be limiting as to the scope of the claims.
Described herein are various technologies pertaining to computing a respective metric for each term in each document in a relatively large document corpus, wherein the respective metric is indicative of the descriptiveness of a respective term with respect to content of the document that include the respective term. Pursuant to an example, the metric can be term frequency-inverse document frequency (TF-IDF). Moreover, the metric can be computed for each term that occurs in each document of the document corpus through employment of a distributed computing programming framework that is employed in a distributed computing environment, wherein the metric is computed for each term in each document in the document corpus in which a respective term occurs utilizing a single input pass over the document corpus.
A document corpus includes a plurality of documents, wherein each document in the plurality of documents comprises a plurality of terms. A first subset of computing nodes receives the plurality of documents and executes the following acts over the plurality of documents substantially in parallel. First, a document is received at a first computing node in the first subset of computing nodes, and the first computing node generates a list of terms that are included in the document and stores such list of terms in a memory buffer of the first computing node. The first computing node generates a hash table, wherein the hash table is organized such that keys of the hash table are respective terms in the document and values corresponding to such keys are respective numbers of occurrences of the terms. Accordingly, the first computing node can sequentially analyze terms in the list of terms, and if a term is not already included in the hash table, can update the hash table to include the term and update a value of the hash table to indicate that the term has occurred once in the document. Moreover, when updating the hash table to include the term, the first computing node can output a data packet that indicates that the document includes the term.
If the term is already existent as a key in the hash table, then the first computing node can update a value corresponding to such term in the hash table by incrementing such value by one. Subsequent to generating the hash table, the first computing node can determine a number of terms in the document by summing values in the hash table (or counting terms in the list of terms). Based upon the number of terms in the document and the values in the hash table for corresponding terms, the first computing node can compute, for each term in the document, a respective term frequency. The term frequency is indicative of a number of occurrences of a respective term relative to the number of terms in the document. The first computing node can then output data packets that are indicative of term respective term frequencies for each term in the document. Other computing nodes in the first subset of computing nodes can perform similar operations with respect to other documents in the document corpus in parallel.
A second subset of computing nodes in the distributed computing environment can receive term frequencies for respective terms in respective documents of the document corpus. Additionally, a computing node in the second subset of computing nodes can receive, for each unique term, a respective value that is indicative of a number of documents in the document corpus that include a respective term. In other words, based upon data packets received from the computing nodes in the first subset of computing nodes (output when forming hash tables), computing nodes in the second subset of computing nodes can compute a respective inverse document frequency value for each unique term in documents in the document corpus. Again, the inverse document frequency is a value that is indicative of a number of documents in the document corpus that comprise the respective term.
Thereafter, utilizing respective term frequency values for terms in documents received from computing nodes in the first subset of computing nodes, computing nodes in the second subset of computing nodes can compute the metric that is indicative of descriptiveness of a respective term with respect to content of a document that includes the term (e.g., TF-IDF). In an exemplary embodiment, this metric can be employed in connection with information retrieval, such that when a query that includes a term is received, documents in the plurality of documents are respectively ranked based at least in part upon metrics for the term with respect to the documents. In another exemplary embodiment, the metric can be employed to automatically classify documents to surface terms that are descriptive of a topic of a document, to identify stop words in a document, or the like.
Other aspects will be appreciated upon reading and understanding the attached figures and description.
Various technologies pertaining to computing, for each term in each document of a document corpus, a respective metric that is indicative of descriptiveness of a respective term with respect to content of a document that includes the term will now be described with reference to the drawings, where like reference numerals represent like elements throughout. In addition, several functional block diagrams of exemplary systems are illustrated and described herein for purposes of explanation; however, it is to be understood that functionality that is described as being carried out by certain system components may be performed by multiple components. Similarly, for instance, a component may be configured to perform functionality that is described as being carried out by multiple components. Additionally, as used herein, the term “exemplary” is intended to mean serving as an illustration or example of something, and is not intended to indicate a preference.
As used herein, the terms “component” and “system” are intended to encompass computer-readable data storage that is configured with computer-executable instructions that cause certain functionality to be performed when executed by a processor. The computer-executable instructions may include a routine, a function, or the like. It is also to be understood that a component or system may be localized on a single device or distributed across several devices.
With reference now to
Distributed computing environments generally execute software programs (computer executable algorithms) that are written in accordance with a distributed computing framework. An exemplary framework, in which aspects described herein can be practiced, is the map-reduce framework, although aspects described herein are not intended to be limited to such framework. The map-reduce framework supports map operations and reduce operations. Generally, a map operation refers to a master computing node receiving input, dividing such input into smaller sub-problems, and distributing such sub-problems to worker computing nodes. A worker node may undertake the task set forth by the master node and/or can further partition and distribute the received sub-problem to other worker nodes as several smaller sub-problems. In a reduce operation, the master node collects output of the worker nodes (answers to all the sub-problems generated by the worker nodes) and combines such data to form a desired output. The map and reduce operations can be distributed across multiple computing nodes and undertaken in parallel so long as the operations are independent of other operations. As data in the map reduce framework is distributed between computing nodes, key/value pairs are employed to identify corresponding portions of data.
The system 100 comprises a data store 102, which is a computer-readable data storage device that can be retained on a single computing device or distributed across several computing devices. The data store 102, therefore, may be a portion of memory, a hard drive, a flash drive, or other suitable computer readable data storage device. The data store 102 comprises a document corpus 104 that includes a plurality of documents 106-108 (e.g., a first document 106 through a Zth document 108). In an exemplary embodiment, the document corpus 104 may be relatively large in size, such that the document corpus 104 may consume multiple terabytes, petabytes, or more of computer-readable data storage. In an exemplary embodiment, the plurality of documents 106-108 may be a respective plurality of web pages in a search engine index. In another exemplary embodiment, the plurality of documents 106-108 may be a respective plurality of micro-blogs generated by users of a web-based micro-blogging application. A micro-blogging application is one in which content generated by users is limited to some threshold number of characters, such as 140 characters. In yet another exemplary embodiment, the plurality of documents 106-108 can be status messages generated by users of a social networking application, wherein such status messages are made available to the public by generators of such messages. In still yet another exemplary embodiment, the plurality of documents 106-108 can be scholarly articles whose topics are desirably automatically identified. Other types of documents are also contemplated, as the aforementioned list is not intended to be exhausting, but has been set forth for purposes of explanation.
The system 100 additionally comprises a frequency mapper component 110, a sorter component 112, a frequency reducer component 114, and a document labeler component 116. The components 110-116 may be executed cooperatively by a plurality of computing nodes that are in communication with one another, directly or indirectly. Accordingly, one or more of the component 110-116 may be executing on a single computing node or distributed across several computing nodes. Likewise, separate instances of the component 110-116 can be executing in parallel on different computing nodes in a distributed computing environment.
Each document in the plurality of documents 106-108 comprises a respective plurality of terms. As used herein, a term can be a word or an N-gram, wherein a value of N can be selected based upon a length of a phrase that is desirably considered. The frequency mapper component 110 receives the document corpus 104, and for each document in the document corpus 104 and for each term in each document of the document corpus 104, the frequency mapper component 110 can compute a respective term frequency value, wherein a term frequency value for a term in a document is indicative of a number of occurrences of the term in the document relative to a total number of terms in the document. In an example, if the first document 106 includes 100 terms (wherein the 100 terms may include duplicate terms) and the term “ABC” occurs 5 times in the first document 106, then the term frequency value for the term “ABC” in the first document 106 can be
Again, the frequency mapper component 110 can compute a respective term frequency value for each term in each document of the document corpus 104.
In an exemplary embodiment, the frequency mapper component 110 can compute a respective term frequency value for each term in each document of the document corpus 104 in a single pass over the plurality of documents 106-108 in the document corpus 104. Pursuant to an example, the system 100 can comprise a memory buffer 118 that resides on a computing node that executes an instance of the frequency mapper component 110. Upon receiving a document (document X) from the document corpus 104, the frequency mapper component 110 can cause content 120 of document X (an exhaustive list of terms, including duplicative terms, included in document X) to be stored in the memory buffer 118. As will be described in greater detail below, the frequency mapper component 110 can analyze each term in the content 120 of document X in the memory buffer 118, and can compute term frequency values for respective unique terms in the content 120 of document X. Additionally, the frequency mapper component 110 can output data packets that are indicative of such respective term frequency values, discard the content 120 of document X from the memory buffer 118, and load content of another document from the document corpus 104 into the memory buffer 118 for purposes of analysis. Using this approach, documents in the document corpus 104 need not be analyzed multiple times by the frequency mapper component 110 to compute term frequency values for terms included in documents of the document corpus 104.
For each unique term in each document loaded into the memory buffer 118 by the frequency mapper component 110, the frequency mapper component 110 can output a plurality of data packets. For instance, for each unique term in document X, the frequency mapper component 110 can output a respective first data packet and a respective second data packet. The respective first data packet indicates that a respective term occurs in document X, while the second data packet indicates a term frequency value for the respective term in document X.
The sorter component 112 receives pluralities of data packets output by multiple instances of the frequency mapper component 110 and sorts such data packets, such that the values corresponding to identical terms (without regard to documents that include such terms) are aggregated. As will be shown below, aggregation of values in this manner allows for a number of documents that include each respective unique term that occurs in at least one document in the document corpus 104 to be efficiently computed.
In more detail, the frequency reducer component 114 receives sorted data packets output by the sorter component 112 and computes the metric that is indicative of descriptiveness of each term in each document of the document corpus 104 relative to respective content of a respective document that includes a respective term based upon sorted data packets output by the sorter component 112. As discussed above, the sorter component 112 aggregates values corresponding to data packets pertaining to identical terms output by the frequency mapper component 110. The frequency reducer component 114 can sum the aggregate values, which facilitates computing, for each term included in any document of the document corpus 104, a number of documents that include such term. The frequency reducer component 114 can additionally receive or compute a total number of documents included in the document corpus 104. Based upon the total number of documents in the document corpus 104 and a number of documents that include a respective term, the frequency reducer component 114 can compute an inverse document frequency value for each unique term in any document of the document corpus 104. The frequency reducer component 114 can also receive for each term in each document of the document corpus 104, a respective term frequency value computed by the frequency mapper component 110. Based upon such values, the frequency reducer component 114 can compute, for each term in each document of the document corpus 104, a respective metric that is indicative of descriptiveness of a respective term with respect to the document that includes the respective term (TF-IDF value for the term in the document).
An exemplary algorithm that can be employed by the frequency reducer component 114 to compute TF-IDF values for respective terms in documents of the document corpus 104 is as follows:
where w(t, d) is the metric (TF-IDF value), |t| is a number of times that term t occurs in document d, |d| is the number of terms contained in the document d, |D| is the number of documents included in the document corpus D, and |{d:tεd}| is the number of documents in the corpus D that include the term t.
A document labeler component 116 receives, for each term in each document of the document corpus 104, a respective value output by the frequency reducer component 114, and selectively assigns a label to a respective document based at least in part upon the value. For instance, if the value (for a particular term in a certain document) is relatively high, the document labeler component 116 can indicate that such term is highly descriptive of content of the document. Accordingly, for instance, if a query is issued that includes the term, the document can be placed relatively highly in a ranked list of search results. In still other embodiments, the document can be assigned a particular categorization or classification based upon the value for a term. Other labels are also contemplated and are intended to fall under the scope of the hereto-appended claims.
With reference now to
The frequency mapper component 110 includes a parser component 202 that receives the key/value pair and causes document content 204 to be retained in the memory buffer 118. Specifically, the parser component 202 can generate an array that comprises a list of terms 206 in the document (wherein such list of terms can include several occurrences of a single term).
A hash table generator component 208 generates a hash table 210 in the memory buffer 118, wherein the hash table is organized such that a key thereof is a respective term in the list of terms 206 and a corresponding value in the hash table is indicative of a number of occurrences of the respective term in the list of terms 206. In an exemplary embodiment, the hash table generator component 208 operates in the following manner. The hash table generator component 208 accesses the list of terms 206 and selects, in sequential order, a term in the list of terms 206. The hash table generator component 208 then accesses the hash table 210 to ascertain whether the hash table 210 already comprises the term as a key thereof. If the hash table 210 does not comprise the term, then the hash table generator component 208 updates the hash table 210 to include the term with a corresponding value of, for example, 1 to indicate that (up to this point in the analysis) the document includes the term once. Additionally, the hash table generator component 208 causes a key/value pair to be output when initially adding a term to the hash table 210. A key of such key/value pair is a compound key, wherein a first element of the compound key is the term and a second element of the compound key is a wildcard. For instance, the wildcard can be a negative value or/and empty value. Effectively, this key/value pair indicates that the term is included in the document (even though the document is not identified in the key/value pair).
If the term analyzed by the hash table generator component 208 is already existent in the hash table 210, then the hash table generator component 208 increments the corresponding value for the term in the hash table 210. The resultant hash table 210, then, includes all unique terms of in the document and corresponding numbers of occurrences of the terms in the document.
The frequency mapper component 110 also comprises a term frequency computer component 212 that, for each term in the hash table 210, computes a term frequency value for a respective term, wherein the term frequency value for the respective term is indicative of a number of occurrences of the term in the document relative to a total number of terms in the document. The term frequency computer component 212 computes such values based upon content of the hash table 210. For example, the term frequency computer component 212 can sum values in the hash table to ascertain a total number of terms included in the document. In an alternative embodiment, the frequency mapper component 110 can compute the total number of terms in the document by counting terms in the list of terms 206. The term frequency computer component 212 can, for each term in the hash table 210, divide the corresponding value in the hash table 210 (the total number of occurrences of a respective term in the document) by the total number of terms in the document. The term frequency computer component 212 can subsequently cause a respective second key/value pair to be output for each term in the hash table 210, wherein a key of such key/value pair is a compound key, wherein a first element is a respective term, the a element is a document identifier, and wherein a value of the key/value pair is the respective term frequency for the respective term in the document. Thus, it is to be understood that the frequency mapper component 110 outputs two key/value pairs for each unique term in the document (for each term in the hash table 210): a first key/value pair, wherein a first key of the first key/value pair comprises the term and the wildcard, and wherein a first value of the first key/value pair is, for example, 1; and a second key/value pair, wherein a second key of the second key/value pair comprises the term and the document identifier, and wherein a second value of the second key/value pair is the term frequency for the respective term in the document.
Exemplary pseudo-code corresponding to the frequency mapper component 110 is set forth below for purposes of explanation.
Now referring to
Now referring to
defined above. The frequency reducer component 114 can thereafter compute a respective TF-IDF value for each term in each document of the document corpus 104. Therefore, for each term/document combination, the frequency reducer component 114 can output a respective TF-IDF value. This can be in the form of a key/value pair, wherein a key of the key/value pair is a compound key, wherein a first element of the compound key is a respective term, and a second element of the compound key is a respective document that includes the respective term, and a respective value of the key/value pair is the TF-IDF for the term/document combination.
Exemplary pseudocode that can be executed by the frequency reducer component 114 is set forth below for purposes of explanation.
With reference now to
Moreover, the acts described herein may be computer-executable instructions that can be implemented by one or more processors and/or stored on a computer-readable medium or media. The computer-executable instructions may include a routine, a sub-routine, programs, a thread of execution, and/or the like. Still further, results of acts of the methodologies may be stored in a computer-readable medium, displayed on a display device, and/or the like. The computer-readable medium may be any suitable computer-readable storage device, such as memory, hard drive, CD, DVD, flash drive, or the like. As used herein, the term “computer-readable medium” is not intended to encompass a propagated signal.
Now referring to
At 508, at the second subset of computing nodes in the plurality of computing nodes, a respective inverse document frequency value is computed for each unique term existent in any of the documents in the plurality of documents. At 510, a respective TF-IDF value is computed based at least in part upon the respective term frequency value computed at 506 and the respective IDF value computed at 508. TF-IDF values are computed without re-inputting the plurality of documents (e.g., TF-IDF values are computed in a single pass over the plurality of documents). The methodology 500 completes at 512.
Now referring to
At 606, for each document in the document corpus, an array in a memory buffer is generated, wherein the array comprises a list of terms in the respective document (including duplicative terms). At 608, a hash table is formed in the memory buffer to identify numbers of occurrences of respective unique terms in the list of terms. Specifically, the hash table is organized in accordance with a key and a respective value, the key of the hash table being a respective term from the list of terms, a respective value of the hash table being a number of occurrences of the respective term in the list of terms in the memory buffer. Accordingly, the hash table is populated with terms and respective values, wherein terms in the hash table are unique (no term is listed multiple times in the hash table).
At 610, a total number of terms in the respective document is counted by summing the values in the hash table. At 612, for each term in the hash table for the respective document, a respective term frequency value is computed.
At 614, for each term in the hash table, a respective first key/value pair and a respective second key/value pair are output. The respective first key/value pair comprises a first key and a first value. The first key comprises the respective term and a wildcard character. The first value indicates an occurrence of the respective term in the respective document. The respective second key/value pair comprises a second key and a second value, the second key comprising the respective term and an identifier of the respective document, the second value comprising the respective term frequency value for the respective term in the respective document.
At 616, key/value pairs are sorted based at least in part upon respective keys thereof. When such key/value pairs are sorted, values in key/value pairs with equivalent keys are aggregated. Values in key/value pairs with wildcard characters, subsequent to sorting, are indicative of a number of documents that include a respective term identified in the respective key of the key/value pair.
At 618, for each term in each document in the document corpus, a respective TF-IDF value is computed based at least in part upon the number of documents that include the respective term, a number of documents in the document corpus, and the respective term frequency value for the respective term in the respective document. The methodology 600 completes at 620.
Now referring to
The computing device 700 additionally includes a data store 708 that is accessible by the processor 702 by way of the system bus 706. The data store 708 may be or include any suitable computer-readable storage, including a hard disk, memory, etc. The data store 708 may include executable instructions, documents, etc. The computing device 700 also includes an input interface 710 that allows external devices to communicate with the computing device 700. For instance, the input interface 710 may be used to receive instructions from an external computer device, from a user, etc. The computing device 700 also includes an output interface 712 that interfaces the computing device 700 with one or more external devices. For example, the computing device 700 may display text, images, etc. by way of the output interface 712.
Additionally, while illustrated as a single system, it is to be understood that the computing device 700 may be a distributed system. Thus, for instance, several devices may be in communication by way of a network connection and may collectively perform tasks described as being performed by the computing device 700.
While the computing device 700 has been presented above as an exemplary operating environment in which features described herein may be implemented, it is to be understood that other environments are also contemplated. For example, hardware-only implementations are contemplated, wherein integrated circuits are configured to perform predefined tasks. Additionally, system-on-chip (SoC) and cluster-on-chip (CoC) implementations of the features described herein are also contemplated. Moreover, as discussed above, features described above are particularly well-suited for distributed computing environments, and such environments may include multiple computing devices (such as that shown in
It is noted that several examples have been provided for purposes of explanation. These examples are not to be construed as limiting the hereto-appended claims. Additionally, it may be recognized that the examples provided herein may be permutated while still falling under the scope of the claims.