The present disclosure relates to identifying phrase break using points, and more particularly to identifying phrases, or clauses or fragments, in a document using words in the document to identify break points, the phrases for use, for example, in the generation of a summary of the document.
There is a wealth of information available to users. On the web, for example, a user can search for information on virtually any topic. Typically, a web search returns a set of results containing a number of links to resources, such as documents or files containing content. In addition, the web search results typically include a brief summary of the content referenced by the link. The brief summary is intended to provide the user with information to allow the user to determine whether or not the user wants to click on the link and open the content referenced by the link. Brevity in the summary is important, since there is a limited amount of space on a display screen that displays the web search results and it is beneficial to be able to show as many of the web search results in the available space on the display screen. For example, the available space may only allow approximately ten results, or items, with two lines per item.
In one conventional approach, a summary is generated by first determining the structure, e.g., identifying subject, verb, noun, verb, adjective, adverb, object, etc. of a sentence in which a search term occurs. At the very least, this approach is time consuming. The approach must be adapted to suit a language's structure, which can vary based on the language that is being used. In addition, information available on the web is not always structurally, e.g., grammatically, correct, which can lead to a summary that is not useful to the user. For example, material contained in blogs, e.g., texting abbreviation/acronyms, is not always structurally and grammatically correct.
The present disclosure seeks to address failings in the art and to provide systems and methods for identifying phrases using break points. In accordance with one or more embodiments, content is broken up into phrases, or clauses or fragments, using stop words. In accordance with one or more embodiments, the identified phrases can be used to generate a summary of the content.
By way of a non-limiting example, embodiments of the present disclosure avoid the need to break a sentence down into its structural, e.g., grammatical parts, and/or to identify parts of speech used in the sentence. By virtue of this arrangement, advantageously, embodiments of the present disclosure can be efficiently used to identify phrases in any number of different languages, regardless of structure, e.g., language independence.
In accordance with one or more embodiments, a method is provided, which identifies word pairs in a sentence selected from a document, each word pair having consecutive first and second words, generates, for each of the identified word pairs, a word pair score, selects at least two of the identified word pairs based on the word pair score relative to word pair scores of other ones of the identified word pairs, and identifies at least one phrase from the document, each identified phrase being defined by two of the selected word pairs.
In accordance with one or more embodiments, a computer-readable medium is provided, which tangibly embodies program code stored thereon, the program code comprising code to identify word pairs in a sentence selected from a document, each word pair having consecutive first and second words, code to generate, for each of the identified word pairs, a word pair score, code to select at least two of the identified word pairs based on the word pair score relative to word pair scores of other ones of the identified word pairs, and code to identify at least one phrase from the document, each identified phrase being defined by two of the selected word pairs.
In accordance with one or more embodiments, a system is provided that comprises one or more computing devices configured to provide functionality in accordance with such embodiments. In accordance with one or more embodiments, functionality is embodied in steps of a method performed by at least one computing device. In accordance with one or more embodiments, program code to implement functionality in accordance with one or more such embodiments is embodied in, by and/or on a computer-readable medium.
The above-mentioned features and objects of the present disclosure will become more apparent with reference to the following description taken in conjunction with the accompanying drawings wherein like reference numerals denote like elements and in which:
In general, the present disclosure includes a phrase generation/identification system, method and architecture.
Certain embodiments of the present disclosure will now be discussed with reference to the aforementioned figures, wherein like reference numerals refer to like components.
In accordance with one or more embodiments, a system, method and architecture of generating, or identifying, phrases, e.g., phrases extracted from sentences contained in a document. Phrase, fragment and clause are terms used interchangeably herein. In accordance with one or more embodiments the term document, as used herein, refers to any collection of words, text, characters, symbols, sounds, etc. of any language and represented in any form or format and/or stored by any means. By way of a non-limiting example, in a case that the document is a collections of words, a phrase, or fragment or clause, can comprise one or more of the words in the document.
In accordance with one or more embodiments, phrase identification/generation system 102 generates phrases from selected sentences contained in a document input to the system 102.
The ranked sentences are forwarded to a sentence selector 206, which selects a number of the identified sentences based on their scores and rankings. By way of a non-limiting example, sentence selector 206 ranks the identified sentences from highest score to lowest score, and selects a number of the top ranking, e.g., highest scoring, sentences. The selected sentences are forwarded to a word pair identifier 208, which identifies word pairs that occur in the selected sentences. In accordance with one or more embodiments, a word pair has two words occurring consecutively in a sentence. In accordance with one or more such embodiments, each word in a sentence is used in at least one word pair. By way of a non-limiting example, if a word is not the first or last word in a sentence, the word belongs to a word pair that includes the word's immediately-preceding word and a word pair that includes the word's immediately-succeeding word. By way of some further non-limiting examples, the first word in a sentence belongs to a word pair that includes the immediately-succeeding word, and the last word in a sentence belongs to a word pair that includes the immediately-preceding word.
The word pairs identified by the word pair identifier 208 are forwarded to a word pair scorer and ranker 210, which scores the identified word pairs and ranks the word pairs based on the scores. In accordance with one or more embodiments, word pairs with zero scores are excluded from the ranking, and the remaining word pairs, e.g., those with non-zero scores, are ranked from lowest at the top to highest at the bottom of the ranking. The ranked word pairs are forwarded to a word pair selector 212, which selects a number of the word pairs based on the word pair ranking. In accordance with one or more embodiments, the word pair selector 212 selects a number of the top ranking, e.g., lowest scoring, word pairs. The selected word pairs are forwarded to phrase generator 214, which generates a phrase using two of the selected word pairs.
In accordance with one or more embodiments, sentence scorer and ranker 204 can score a sentence using one or more scoring techniques, which can be used alone or in any combination. One such technique, which can be used with documents that are part of a set of search results generated from a search using one or more query terms, involves determining a number of occurrences of each of the one or more query terms found in a sentence. In addition to the query terms, this technique can expand the query terms to include synonyms and stem words, e.g., run and shoe are stem words of running and shoes, respectively. In such a case, the score can include occurrences of synonyms and stem words of query terms. In accordance with one or more such embodiments, the score assigned to the sentence reflects the number of occurrences of the query terms in the sentence.
Another technique, which can be used in accordance with one or more embodiments, involves determining a score based on the proximity and/or ordering of query terms in a sentence.
In accordance with one or more embodiments, a sentence score can be determined in whole or in part based on an occurrence of words in the sentence determined to be “important” words. By way of a non-limiting example, words can be predetermined to be important words, and can include query terms determined to occur frequently in queries and/or include query terms that occur in high frequency queries. In accordance with one or more such embodiments, a set of important words can be predetermined, or pre-trained, e.g., by a review of query logs and/or other historical information, and sentence scorer and ranker 204 can determine whether or not the sentence includes one or more of the important words identified in the predetermined set. Data store 216 of
Another technique, which can be used in accordance with one or more embodiments, to score a sentence, involves determining the presence of types of words in the sentence, such as proper names, dates, place names, names of people, etc. In accordance with one or more such embodiments, a set of word types can be predetermined, or pre-trained, e.g., by review of query logs and/or other historical information, and sentence scorer and ranker 204 can determine whether or not the sentence includes one or more of the types identified in the predetermined set. Data store 216 of
In accordance with one or more embodiments, word pair scorer and ranker 210 scores each word pair identified by word pair identifier 208.
Referring to
In accordance with one or more embodiments, word pairs are ranked according to their scores, and word pair selector 212 selects word pairs according to their ranking. In accordance with one or more embodiments, the word pairs having a zero score are excluded from the ranking, word pairs with non-zero scores are ranked from lowest to highest scores, and word pair selector 212 selects a number of the lowest scoring word pairs. Using this approach, the word pairs having words that, e.g., based on the training data, are considered to not occur together that frequently are selected, so that a break involving the word pair would likely be at a logical point in the sentence, e.g., a logical or natural break in the sentence.
Phrase generator 214 generates a phrase using the selected word pairs. In accordance with one or more embodiments, a break can occur between the two words of a word pair. This approach might be used, for example, in a case that the word pair includes a stop word and a non-stop word, so that the non-stop word can be included in the generated phrase. In accordance with one or more alternate embodiments, a break occurs after or before a word pair. This approach might be used, for example, in a case that the word pair includes two stop words, so that the stop words can be excluded from the generated phrase.
In accordance with one or more embodiments, one or more rules can be used for generating a phrase by breaking a sentence at word pairs. By way of a non-limiting example, one such role is that a break is not made between two non-stop words. By way of another non-limiting example, a break can occur using a word pair that includes a stop word and a non-stop word, or using a word pair that includes two stop words. In the case of a word pair that includes at least one stop word, a break can occur between the words of the word pair, for example.
In accordance with one or more embodiments, one or more phrases generated by phrase generator 214 are used by summary generation system 104 to generate a summary of the document that contains the phrase(s) generated by phrase generator 214. In accordance with one or more such embodiments, summary generation system 104 can be a web search engine/system, such that a summary corresponding to each document selected by the web search engine/system is returned to the user in response to a query entered by the user. By way of a non-limiting example, each document in the search results has an entry in the search results, which includes a title, the summary, and a link, e.g., a Universal resource locator (URL), to the document.
In accordance with one or more embodiments of the present disclosure, a phrase generation process flow is shown in
At step 602 of
Once all of the identified sentences have been scored, processing continues at step 612 of
Referring to
The user computer 704 can be any computing device, including without limitation a personal computer, personal digital assistant (PDA), wireless device, cell phone, internet appliance, media player, home theater system, and media center, or the like. For the purposes of this disclosure a computing device includes a processor and memory for storing and executing program code, data and software, and may be provided with an operating system that allows the execution of software applications in order to manipulate data. A computing device such as server 702 and the user computer 704 can include one or more processors, memory, a removable media reader, network interface, display and interface, and one or more input devices, e.g., keyboard, keypad, mouse, etc. and input device interface, for example. One skilled in the art will recognize that server 702 and user computer 704 may be configured in many different ways and implemented using many different combinations of hardware, software, or firmware.
In accordance with one or more embodiments, a computing device 702 can make a user interface available to a user computer 704 via the network 706. The user interface made available to the user computer 704 can include one or more summaries generated by summary generation system 104 using phrases generated by phrase identification/generation system 102. In accordance with one or more embodiments, computing device 702 makes a user interface available to a user computer 704 by communicating a definition of the user interface to the user computer 704 via the network 706. The user interface definition can be specified using any of a number of languages, including without limitation a markup language such as Hypertext Markup Language, scripts, applets and the like. The user interface definition can be processed by an application executing on the user computer 704, such as a browser application, to output the user interface on a display coupled, e.g., a display directly or indirectly connected, to the user computer 704.
In an embodiment the network 706 may be the Internet, an intranet (a private version of the Internet), or any other type of network. An intranet is a computer network allowing data transfer between computing devices on the network. Such a network may comprise personal computers, mainframes, servers, network-enabled hard drives, and any other computing device capable of connecting to other computing devices via an intranet. An intranet uses the same Internet protocol suit as the Internet. Two of the most important elements in the suit are the transmission control protocol (TCP) and the Internet protocol (IP).
It should be apparent that embodiments of the present disclosure can be implemented in a client-server environment such as that shown in
For the purposes of this disclosure a computer readable medium stores computer data, which data can include computer program code executable by a computer, in machine readable form. By way of example, and not limitation, a computer readable medium may comprise computer storage media and communication media. Computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the computer.
Those skilled in the art will recognize that the methods and systems of the present disclosure may be implemented in many manners and as such are not to be limited by the foregoing exemplary embodiments and examples. In other words, functional elements being performed by single or multiple components, in various combinations of hardware and software or firmware, and individual functions, may be distributed among software applications at either the client or server or both. In this regard, any number of the features of the different embodiments described herein may be combined into single or multiple embodiments, and alternate embodiments having fewer than, or more than, all of the features described herein are possible. Functionality may also be, in whole or in part, distributed among multiple components, in manners now known or to become known. Thus, myriad software/hardware/firmware combinations are possible in achieving the functions, features, interfaces and preferences described herein. Moreover, the scope of the present disclosure covers conventionally known manners for carrying out the described features and functions and interfaces, as well as those variations and modifications that may be made to the hardware or software or firmware components described herein as would be understood by those skilled in the art now and hereafter.
While the system and method have been described in terms of one or more embodiments, it is to be understood that the disclosure need not be limited to the disclosed embodiments. It is intended to cover various modifications and similar arrangements included within the spirit and scope of the claims, the scope of which should be accorded the broadest interpretation so as to encompass all such modifications and similar structures. The present disclosure includes any and all embodiments of the following claims.