This application relates in general to digital information categorization and, in particular, to a system and method for implicit tagging of documents using search query data.
“Web 2.0” informally refers to Web-based services, including Web sites, developed to encourage communication and collaboration between users as opposed to the focus of the first generation of the World Wide Web, referred to as “Web 1.0,” on information access and retrieval. Web 2.0 services included social networking, such as Facebook (www.facebook.com), and content-sharing, such as YouTube (www.youtube.com), and Web logs, or “blogs”. Web 2.0 services include, for example, active user participation through generation, categorization, and sharing of content.
Tagging is another key component of Web 2.0, which allows a user to associate selected Web content with one or more freely chosen tags, or keywords. Tagging allows a user to efficiently retrieve Web content that was tagged at a later time. For example, Delicious (www.delicious.com) allows a user to apply tags to Web page bookmarks. Subsequently, the user can search and retrieve the Web page from his personal bookmarked collection using the previously applied tags. Additionally, the user's bookmarks and tags can be shared with other users who can view, search, and add their own tags. Aggregation of the tags of many users creates a folksonomy, or social tagging, that makes the tagged content easier to search, browse, and navigate over time as more tags and users are added. Other examples include Flickr (www.flikr.com) and last.fm (www.last.fm) that allow tagging and sharing of photos and music, respectively.
Tags, therefore, provide a valuable data mining tool to individual users as well as an entire community of users. The value of tags, and consequently, the folksonomy of the Web services that provide tagging tools, is dependent on the quantity of tags and topics covered by the tags. As more users utilize the tagging features, additional users are attracted to the service. Unfortunately, tagging exacts a user cost requiring explicit effort to identify and manually tag content. User hesitancy or reluctance to undertake the effort necessary to tag content, especially at the early stages of deployment of a tagging service, can lead to a low adoption rate of the tagging service, which results in data sparcity of the number of tags and topics covered. Additionally, some sites, such as Flickr and YouTube, only allow the user who uploads content to tag that content, further reducing the amount of initial tagging data available.
Therefore, an approach is needed to introduce tagged content into a tagging system without sole reliance on explicit user effort. Preferably, such an approach would use implicit user actions to tag content and thereby facilitate social tagging of Web content, so users are more likely to collaborate and share tagged content.
According to aspects illustrated herein, there is provided a computer-implemented system and method for implicit tagging of documents using search query data. A corpus of documents including electronically-stored digital data is identified. A search query including one or more query terms from a user is received. The search query is executed against the document corpus. Search results including an identifier for each of the documents in the corpus that matches at least one of the query terms are obtained. A selection of one or more of the identifiers by the user is captured. A set of click-through tags that each includes the user, one of the selected identifiers, and the matching query terms is created.
Still other embodiments of the present invention will become readily apparent to those skilled in the art from the following detailed description, wherein are described embodiments by way of illustrating the best mode contemplated for carrying out the invention. As will be realized, the invention is capable of other and different embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and the scope of the present invention. Accordingly, the drawings and detailed description are to be regarded as illustrative in nature and not as restrictive.
Context from search queries can be captured and dynamically utilized for implicit social tagging of documents.
The client-side operations are performed by general purpose computers 104a-b loaded with client-side application module 106, which includes click-through tag plug-in 108 and Web browser 110. In a further embodiment, the client-side application module 106 can further include annotation plug-in 122. The server-side operations are performed by general purpose computers 104c-g loaded with one or more server-side application module 112, which includes either one, or a combination of one or more, of social tag module 114, search query server 116, and Web page server 118. In a further embodiment, the server-side application module 112 can also include one or more of annotation module 114, Web page (or Web document) servers 118, and tag-based search server 120. Still further client-side or server-side modules are possible. In a further embodiment, specific purpose computers can be programmed to carry out the client-side or server-side operations.
Initially, the Web browser 110 is initialized with the click-through plug-in 108, which includes operations for communication with the server-side application module 112. The Web browser 110 receives input from a user requesting a search query, including one or more query terms, which the Web browser 110 communicates to the search query server 116. The search query server 116 maintains or has access to a document corpus 124 containing a collection of documents, as defined infra. The search query server 116 applies the search query against the document corpus 124 and returns search results containing a list of matching documents to the Web browser 110 for display to the user. The list of matching documents can match all or a subset of the search query. Preferably, the matching documents are presented as a list to the user that includes hyperlinks to the document, though other forms of presentation are possible, such as displaying thumbnail images of the matching documents. A user can then select a search result from the list to access the desired document using, for example, a uniform resource locator (URL) that identifies a location on the network 102 of a server, such as a Web page server 118, storing the document.
A document is a collection of electronic data that may define a variable number of pages depending on how the collection of electronic data is formatted when viewed, such as documents that may be viewed using a Web browser, for example Web pages. The electronic data making up a document may consist of static content, dynamic content, or a combination thereof, as further discussed below with reference to
The click-through tag plug-in 108 parses out the query terms of the search request and communicates the query terms through Web server 126 to tag servlet 128, which stores the query terms in a structured data repository in the social tag corpus 130. In a further embodiment, only the query terms that are found in a matching document are stored. Additionally, the click-through tag plug-in 108 identifies the URL selected by the user and stores the URL in the social tag corpus 130. Moreover, user information, such as a user or login name, is identified by the click-through tag plug-in 108 and stored. The query term, URL, and user identification are stored as a data triple, or click-through tag. In a further embodiment, the query term, URL, and user identification can be stored separately and logically linked. In a further embodiment, the click-through tag can be used to seed a social tagging service, such as described in infra. In a further embodiment, a proxy server (not shown) operating on the network 102 can carry out the functions of the click-through tag plug-in 108.
In a further embodiment, the client-side application module 106 includes an annotation plug-in 122 and the server-side application module 112 includes an annotation server 132 that enables explicit manual user tagging of entire, or selected portions of, documents, such as described in commonly-assigned U.S. patent application, entitled “System and Method for Searching Annotated Document Collections,” Ser. No. 11/837,942, filed Aug. 13, 2007, pending, the disclosure of which is incorporated by reference. Other ways of explicitly tagging documents are possible. The tag, the tagged document, and the identification of the user that tagged the document are stored in the social tag corpus as an annotated tag.
In a further embodiment, the click-through tags and annotated tags stored in social tag corpus 130 may be searched using tag-based search server 120 through a user interface running on the Web browser 110, such as described in supra. Other approaches for searching tags are possible.
Click-through tags and annotated tags can provide unique value to the social tag corpus.
Click-through tags provide valuable social tagging data at little to no additional user cost.
A corpus of documents is identified (step 402). Documents are electronic data, such as a Web page, that can be viewed in a Web browser. Documents can consist of static or dynamic content, or a combination thereof, as further described below with reference to
Upon selection of a URL by the user, the selection is captured by the click-through tag plug-in (step 410). Additionally, the query terms are parsed and, along with the URL and user information, are used to create a set of click-through tags (step 412). The click-through tags are used to seed a social tag corpus (step 414). In a further embodiment, the click-through tags, upon creation, can be stored in a separate data repository and added to the social tag corpus 130 at a later time point. The social tag corpus 130 can be revised (step 416), as necessary, with annotated tags explicitly created by the user or one or more different users, as further described below with reference to
The social tag corpus can be supplemented with explicitly created annotated tags.
In a further embodiment, the click-through tags and annotated tags stored in social tag corpus 130 can be searched, such as further described above with reference to
A range of documents can be tagged by users.
Using the foregoing specification, the embodiments disclosed herein may be implemented as a machine (or system), process (or method), or article of manufacture by using standard programming or engineering techniques to produce programming software, firmware, hardware, or any combination thereof. Those skilled in the art will appreciate that the flow diagrams described in the specification are meant to provide an understanding of different possible embodiments. As such, alternative ordering of the steps, performing one or more steps in parallel, or performing additional or fewer steps may be done in alternative embodiments.
Any resulting program or programs, having computer-readable program code, may be embodied within one or more computer-usable media such as memory devices or transmitting devices, thereby making a computer program product or article of manufacture according to the disclosed embodiments. As such, the terms “article of manufacture” and “computer program product” as used herein are intended to encompass a computer program existent (permanently, temporarily, or transitorily) on any computer-usable medium such as on any memory device or in any transmitting device.
A machine embodying the disclosed embodiments may involve one or more processing systems including, but not limited to, CPU, memory/storage devices, communication links, communication/transmitting devices, servers, I/O devices, or any subcomponents or individual parts of one or more processing systems, including software, firmware, hardware, or any combination or subcombination thereof, which embody the disclosed embodiments as set forth in the claims. Those skilled in the art will recognize that memory devices include, but are not limited to, fixed (hard) disk drives, floppy disks (or diskettes), optical disks, magnetic tape, semiconductor memories such as RAM, ROM, and PROM. Transmitting devices include, but are not limited to, the Internet, intranets, electronic bulletin board and message/note exchanges, telephone/modem based network communication, hard-wired/cabled communication network, cellular communication, radio wave communication, satellite communication, and other stationary or mobile network systems/communication links.
While the invention has been particularly shown and described as referenced to the embodiments thereof, those skilled in the art will understand that the foregoing and other changes in form and detail may be made therein without departing from the spirit and scope.