The subject matter which is regarded as the invention is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
The detailed description explains the preferred embodiments of the invention, together with advantages and features, by way of example with reference to the drawings.
An embodiment of the invention will identify a set of web pages that contain transactional content, thereby allowing only such pages to be returned in response to a user-designated transactional search query. In an embodiment of the invention, information can be identified regarding the nature of the transaction supported by the page, and terms that are associated with the transaction.
Traditional information retrieval (IR) includes a preparatory phase, during which documents are inserted into a collection, and indices are created or updated. Traditional IR also includes an operational phase, during which search queries are efficiently evaluated. In an embodiment of the invention, additional work is performed in the preparatory phase for transactional queries. Specifically, web pages that are likely to be relevant to transactional queries are identified and annotated with the set of transactions and transactional features, such as the web page title, name of the software program to be downloaded, links to downloadable software, or other information on the web page, for example. Such web pages shall also be referred to herein as transactional pages. The set of all transactional pages is a subset of the complete document, or web page, collection. These transactional pages can then be processed in different ways (as will be described further below) to create a transactional collection for search by a user.
The recognition of transactional pages is performed by a transactional annotator, configured to identify all transactions supported by a given web page. In an embodiment, a templatized procedure, that is, a procedure that utilizes templates, is configured to increase the precision of the transactional annotator to identify web pages that act as gateways to forms and applications.
In an embodiment, the transactional annotator serves two purposes: First, to classify each web-page as being either transactional or not; and Second, to return those specific sections that support the transactions. As used herein, the term transactional feature shall represent those sections of the web page that support transactions. In an embodiment, a highly optimized, purpose-designed, rule-based classifier is used to provide the relevant portions of the web page. In an exemplary embodiment, the transaction annotator will focus on two common classes of transactions: software downloads (SD) and form-entry (FE).
Turning now to the drawings in greater detail, it will be seen that
While an embodiment has been depicted with a server connected to a processing unit, and data stored upon a program storage device at either the processing unit or the server, it will be appreciated that the scope of the invention is not so limited, and that the invention will also apply to alternate arrangements of processing units and servers, such as having many processing units in data communication with one server, many processing devices in data communication with many servers, and many processing devices in connection with many servers, which are also connected to other servers, for example. While an embodiment has been depicted with a processing unit in data communication with a server via a wired network, it will be appreciated that the scope of the invention is not so limited, and that the invention will also apply to other methods of data communication, such as wireless connection networks, for example.
Referring now to
The presence of the positive pattern is a finding by the regular expression of strings that match the certain syntax rules, or specific strings, on the web page that are likely to indicate the presence of the transactional feature. However, the presence of the negative pattern is a finding by the regular expression of strings that match certain syntax rules, or specific strings, on the web page that are likely to indicate the absence of the transactional feature. Accordingly, in an embodiment, web pages that have positive pattern matches and lack negative pattern matches are most likely to include transactional features.
Referring now to
Referring now to
Referring back now to
In an embodiment, identifying transactional features (also known as feature engineering) and defining regular-expressions and gazetteers is accomplished using a manual iterative process, such as using intranet data, for example. There is an interaction between the choice of features and regular expressions/gazetteers. In an embodiment, the final set of features includes hyperlinks, anchor-texts and html tags along with more specific features such as a window of text around candidate objects and actions.
Referring now to
While an embodiment of the invention has been described with simplified versions of example patterns of regular expressions and gazetteers used by the algorithm template 100 to identify transactional features for SD and FE, it will be appreciated that the scope of the invention is not so limited, and that the invention will also apply to regular expressions and gazetteers that are configured to identify transactional features associated with other classes of transactions, such as making a purchase, filing a property damage claim, and making travel reservations, for example.
The result of the algorithm template 100 for the transactional annotator described above is a set of transactional pages, each with an associated set of transactional features. Subsequent processing ultimately provides a transactional collection that is indexed by the search engine.
In an embodiment, at the collection level, document filtering can require that each transactional page include at least one transactional object. Accordingly, only pages meeting this requirement would be available to a query indicated by the user as a transactional query.
In another embodiment, term filtering, within the web page, is utilized to retain only those portions of the web page that have been identified as containing transactional features. Each transactional page is likely to contain many terms, only a small number of which are actually associated with the transaction. In an embodiment of term filtering, only those terms that appear in the transactional features will be indexed, to be made readily available for a search engine in response to a subsequent, user-designated transactional query.
In an alternate embodiment, synonym expansion, with respect to each transactional term, is performed. Transactional queries typically have a general form of <action><object>, such as “download program”, for example. In many cases, the action has multiple synonyms and there is the possibility of a mismatch between the term appearing in the user query and that appearing in the web-page, such as “obtain”, rather than “download” some software package, for example. The object, on the other hand, being associated with the name of an entity, such as a trademark for example, is less likely to be confused by the user. In an embodiment, this potential mismatch within the web pages that have been classified as transactional is addressed by expanding the annotation of the transactional features to include synonyms of the transactional features. Note that performing synonym expansion over the entire web page collection will dramatically increase the size of the index. In an embodiment, expanding only the transactional actions to include synonyms of the transactional actions in the transactional collection will mitigate this increase in index size, yet still enhance the performance of the transactional query.
Following is a description of experimental results of an evaluation of the foregoing method. A collection of textual intranet web pages with a small set of Multipurpose Internet Mail Extensions (MIME) types, such as html, and php, for example, within a research university domain were recursively collected. The web page collection included 434,211 web pages with a total size of 6.49 gigabytes (GB).
A set of 15 transactional search tasks were derived from an informal survey conducted among administrative staff and graduate students in the research university. Ten of the tasks are to find particular forms, and five are to download software. A total of 394 unique queries to perform these tasks were developed by a group of 26 students and recently graduated students.
Apache Lucene™, a high-performance, full-featured text search engine (available from http://lucene.apache.org/java/docs/) was used to index and search the four following data collections. The original data set, comprising 434,211 web pages as described above is referred to as S-DOC. An embodiment of document filtering, as described above, based on the existence of transactional objects within the S-DOC data set, with each document classified as being a transactional page or not, will be referred to as S-TDC. A separate index was created for the collection of transactional pages within S-TDC, even though this collection is a strict subset of the pages in S-DOC. S-ANT-NE (defined as an embodiment of term filtering, as described above) is a collection created by writing all of the transaction features (for both SD and FE) on the same document into a single file. The identifier associated with each file is the original document. S-ANT is an embodiment of a collection generated similar to S-ANT-NE, but also including a term-level synonym expansion. WordNet™ (available from http://www.wordnet.princeton.edu) was used as a general thesaurus to expand the verbs in the transactional features. While an embodiment of the invention has been described using the Apache Lucene™ text search engine and the WordNet™ thesaurus, it will be appreciated that they are for illustration only, and that scope of the invention is not so limited, and will also include the use of other text search engines and thesauruses.
In the case of a transactional query, it is most often the case that the user is only interested in one way to perform the transaction. That is, the user is likely to care the most about the top ranked relevant match returned. Accordingly, results of most experiments are reported in terms of the mean reciprocal rank (MRR) measure. For each unique query of each task, the reciprocal value (1/n) of the rank (n) of the highest ranked correct result is obtained. This value is averaged over all the queries corresponding to the same task. The reciprocal rank of a query is set to 0 if no correct result is found in the first 100 pages returned.
Correct answers are considered to be those web pages that can support the desired transaction task. For example, a correct answer for “download Remedy Client” must be a web page from which the software “Remedy Client” can be downloaded directly. As such, there is little subjectivity in determining relevance.
Referring now to
Referring flow to
Referring now to
Referring now to
The method continues by annotating and indexing, according to the transactional features, the set of transactional web pages to increase an accuracy of a set of results of a user-designated transactional query, and in response to the user-designated transactional query, providing 825 to the user only the set of web pages that have been classified as transactional, and meet the appropriate query criteria. In an embodiment, the identifying 810 transactional features includes checking for the existence of positive patterns and verifying the absence of negative patterns with respect to a set of contents within each of the plurality of web pages. In an embodiment, the identifying 810 transactional features includes identifying 810 transactional actions to be performed by the transactional feature, and additionally identifying transactional objects of the actions to be performed. In an embodiment, the annotating and indexing 820 the transactional features comprises annotating and indexing transactional actions and transactional objects.
In an embodiment, the identifying 810 the transactional features comprises identifying transactional objects associated with at least one of: software program names; and an actual form to be downloaded. In an embodiment, the identifying 810 the transactional features comprises identifying transactional actions associated with at least one of: making a property damage claim; downloading software; making travel reservations; and online form entry. The above examples are for illustration, and not limitation.
The capabilities of the present invention can be implemented in software, firmware, hardware or some combination thereof.
As one example, one or more aspects of the present invention can be included in an article of manufacture (e.g., one or more computer program products) having, for instance, computer usable media. The media has embodied therein, for instance, computer readable program code means for providing and facilitating the capabilities of the present invention. The article of manufacture can be included as a part of a computer system or sold separately.
Additionally, at least one program storage device readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform the capabilities of the present invention can be provided.
The flow diagrams depicted herein are just examples. There may be many variations to these diagrams or the steps (or operations) described therein without departing from the spirit of the invention. For instance, the steps may be performed in a differing order, or steps may be added, deleted or modified. All of these variations are considered a part of the claimed invention.
While the preferred embodiment to the invention has been described, it will be understood that those skilled in the art, both now and in the future, may make various improvements and enhancements which fall within the scope of the claims which follow. These claims should be construed to maintain the proper protection for the invention first described.