The present invention relates to a computer program product, system, and method for determining tags to recommend for a document from multiple database sources.
To properly manage content and allow for searching of content in documents, a tag is associated with a document to provide metadata used to manage and search for the document. A tag is a non-hierarchical keyword or term assigned to a piece of information (such as an Internet bookmark, digital image, or computer file). Many applications allow the user to add tags or labels for the content, such as videos, documents, blogs, etc. There are also applications to classify web content more intelligently
There is a need in the art for improved techniques for assigning and generating document tags in a computer operating environment.
Provided are a computer program product, system, and method for determining tags to recommend for a document. A natural language processing module determines a document keyword for a document. A tag database search module determines, a tag in a tag database associated with the document keyword. A domain specific search module determines a domain specific tag in a domain specific knowledge base associated with the document keyword. A recommendation is made of at least one of the tag and the domain specific tag as a recommended tag for the document.
A traditional hierarchical system taxonomy uses a top-down system having rigid pre-defined structures. However, in a tagging system, there is more than one way to classify an item, and one item can be assigned multiple tags. In common cases where users can freely add any tags, a number of issues arise, including: homonyms where the same tag/word has different meanings for different contexts, e.g., “apple” the fruit vs “Apple” the company; synonyms where different tags relate to the same concept; duplicates such as singular vs plural (e.g. “recipe” vs “recipes”), or in different languages such as “recipes” vs “”; typos, such as “recipe” vs “recepe” or “recipee”, etc.; tag relationships, such as “recipe” vs “Texas recipes” vs “kids recipes”.
Described embodiments provide improved programming techniques for recommending tags for a document having a greater likelihood of being acceptable to the user providing the document to tag. Upon determining a document keyword based on content in the document, a tag database search module determines whether the document keyword is related to a tag in a tag database previously selected by the user for a related keyword in the tag database A domain specific search module processes a domain specific knowledge base, which may implement an ontology related to the document keyword, to determine a tag related to the document keyword. At least one tag determined from one of the tag database and the domain specific knowledge base is transmitted as at least one recommended tag to the user computer to select whether to use one of the at least one the recommended tag for the document.
The tag database and domain specific search module may comprise machine learning modules trained based on user acceptance or rejection of their tag recommendations to produce tag recommendations having a greater likelihood of acceptance by the user. For instance, the tag database and domain specific search modules may be trained to not output from their respective databases recommended tags for a document keyword the user does not accept to reduce the likelihood of outputting recommended tags unacceptable to the user. The search modules are further trained to output recommended tags the user accepts for a keyword to increase the likelihood of producing recommended tags that will be acceptable. In this way, the selections of recommended keywords by the search modules takes into account user subjective preferences as well as objective preferences based on the document keyword, user profile, etc.
Described embodiments provided improvements for selecting tags for a document by providing tag quality control, such as addressing typographical errors, singular versus plural, equivalents etc., recommending tags based on a user profile and the tagging logs, and combining both the automatic objective recommendations and subjective decisions from the user. Further, with described embodiments, the search modules have a self-learning capability to track user tagging habits for continuous improvement over time. Machine learning algorithms can be used to build relationships between document keywords, selected tags, and user preferences.
Described embodiments may further utilize with tagging services a natural language processing module that automatically extracts the keyword, content, and summary of a given unstructured document, a language translation module that can understand multiple languages to avoid duplicate tags, and consider a user profile and background, and additionally exploit web ontology databases as references for tag recommendations.
The memory 104 further includes a quality control engine 120 to process a user supplied tag for the document 112 to correct typographical errors, spelling, grammar, translation issues, etc., and a validation engine 122 to validate a user supplied tag, or new user tag, with respect to tags indicated in the tag database 200. The tagging services 110 may generate a user interface page 124, such as a Hypertext Markup Language (HTML) page, including recommended tags determined from searching the tag database 200 or the domain specific knowledge base 118 to return to a user computer 126 that provided the document 112 so that a user at the user computer 126 may select a recommended tag through the user interface page 124 or offer a new user supplied tag to use for the document 112.
The tagging system 100 may communicate with the tag database 200, the domain specific knowledge base 118, and the user computer 126 over a network 128. In an alternative embodiment, the tagging related program components in the tagging system 100 may be implemented in the user computer 126 to perform tagging operations locally.
In certain embodiments, the search modules 114 and 116 and the validation engine 122 may implement a machine learning algorithm technique such as decision tree learning, association rule learning, neural network, inductive programming logic, support vector machines, Bayesian network, etc., to search the database 200, 118 for recommended alternate tags, which learn how to search based on user acceptance or rejection of recommended tags to increase the likelihood that tag recommendations will be accepted by the user. In this way, the search modules 114, 116 are trained to recommend tags having a higher likelihood of acceptance by the user.
The tagging system 100 may store program components, such as 108, 110, 114, 116, 120, and 122, documents 112, tags applied to the documents, and user interface pages 124 in a non-volatile storage 130, which may comprise one or more storage devices known in the art, such as a solid state storage device (SSD) comprised of solid state electronics, NAND storage cells, EEPROM (Electrically Erasable Programmable Read-Only Memory), flash memory, flash disk, Random Access Memory (RAM) drive, storage-class memory (SCM), Phase Change Memory (PCM), resistive random access memory (RRAM), spin transfer torque memory (STM-RAM), conductive bridging RAM (CBRAM), magnetic hard disk drive, optical disk, tape, etc. The storage devices may further be configured into an array of devices, such as Just a Bunch of Disks (JBOD), Direct Access Storage Device (DASD), Redundant Array of Independent Disks (RAID) array, virtualization device, etc. Further, the storage devices may comprise heterogeneous storage devices from different vendors or from the same vendor.
The memory 104 may comprise a suitable volatile or non-volatile memory devices, including those described above.
Generally, program modules, such as the program components 108, 110, 114, 116, 120, and 122 may comprise routines, programs, objects, components, logic, data structures, and so on that perform particular tasks or implement particular abstract data types. The program components and hardware devices of the tagging system 100 of
The program components 108, 110, 114, 116, 120, and 122 may be accessed by the processor 102 from the memory 104 to execute. Alternatively, some or all of the program components 108, 110, 114, 116, 120, and 122 may be implemented in separate hardware devices, such as Application Specific Integrated Circuit (ASIC) hardware devices.
The functions described as performed by the program 108, 110, 114, 116, 120, and 122 may be implemented as program code in fewer program modules than shown or implemented as program code throughout a greater number of program modules than shown.
The network 128 may comprise a Storage Area Network (SAN), Local Area Network (LAN), Intranet, the Internet, Wide Area Network (WAN), peer-to-peer network, wireless network, arbitrated loop network, etc.
If (at block 306) the tag database search module 114 outputs determined tags, then the tagging services 110 generates (at block 308) a user interface page 124 with the outputted tags as recommended tags for user approval or to provide a new user tag, and sends the user interface page 124 to the user computer 126 (or display locally). If (at block 306) the tag database search module 114 does not output determined tags, then the tagging services 110 calls (at block 310) the domain specific search module 116 to determine tags related to the document keywords from the domain specific knowledge base 118. If (at block 312) the domain specific search module 116 outputs domain specific tags, then the outputted tags are added (at block 316) to the tag database 200 in tag entries 200i as recommended tags 204 for the document keywords 202, the user 210, and the document 208. Control proceeds to block 308 to generate a user interface page 124 with the determined domain specific tags as the recommended tags to send to the user computer 126 to accept or reject. If (at block 312) there are no domain specific tags outputted, then the tagging services 110 generates (at block 314) a user interface page 124 prompting a user at a user computer 126 to enter a new user tag and send the user interface page 124 to the user computer 126 (or display locally).
The search modules 114, 116, which may comprise machine learning modules, may receive input parameters to assist in searching for tags, including the document keywords, user profile information, document metadata, related entries or information in the databases 200, 118 to use to determine document keywords from the databases 200, 118.
With the embodiment of operations of
In the embodiment of
If (at block 402) the user did not select one of the recommended tags, as indicated in the user interface page 124, then the tagging services 110, or other component, trains (at block 410) the domain specific search module 116 and/or the tag database search module 114, which outputted the recommended tags, to not output the recommended tags from the domain specific knowledge base 118 and/or the tag database 200, respectively, as related to the determined document keywords, which are provided as input to train the modules 114/116. The inputs to train the module 114, 116 at blocks 406 and 410 may comprise the same inputs used to determine the recommended tags from the databases 118, 200, such as the document 112 keywords, user profile information, etc. If (at block 412) the user provided a new user tag for the document 112, then control proceeds to
With the embodiment of
In certain embodiments, initially (during a training phase), the tagging services 110 may observe user feedback and uses that to prepare labelled examples. After a significant number of iterations, it the labelled data are used to train the search modules 114, 116. Subsequently, the tagging services 110 may enter a smart operation and continuous learning mode to determine whether to or not to suggest corrections to user created tags as part of the validation operations of the validation engine 122.
The search modules 114, 116 may be trained by modelling a relationship between potential classifications (recommended tags and tags not recommended) and a feature vector formed using a combination of tag metadata from the tag database 200 and optionally, a text feature extraction (TF-IDF) of associated document vectors from the document 112. By such training, the modules 114, 116 learn how to suggest existing tags that have a higher likelihood of acceptance by the user based on document keywords, such as text feature extraction, and other information.
If (at block 506) the tag database 200 does not indicate a threshold number of documents for the corrected new user tag, then the validation engine 122 determines (at block 514) whether the tag database 200 has a tag 204, 206 related to the corrected new user tag, such as in singular or plural form or super or sub-category of the corrected new user tag, etc. If (at block 514) the tag database has a form of the corrected new user tag, then control proceeds to block 508 to provide determined tags as recommended modified user tags to consider. If (at block 514) the tag database 200 does not have a recommended tag for the user to consider, then the corrected new user tag is applied (at block 516) to the document 112 and the tag database entries 200i for each document keyword are updated (at block 518) to indicate the recommended tag (if any in user interface page 124) as recommended tag 204, the corrected new user tag as the used tag 206, and the documents 112 tagged and the user in fields 208 and 210, respectively, for the document keyword 202 in the tag entry 200i being updated.
With the embodiment of
Further, the embodiment of
The tagging services 110, or other component, trains (at block 608) the validation engine 122 to output the selected recommended modified user tag from the tag database 220 as related to the document keywords and the new user tag with high confidence level to increase the likelihood the validation engine 122 outputs recommended modified user tags that have a higher likelihood of user acceptance. If (at block 602) the user did not select one of the recommended modified user tags, i.e., did not like the validation engine 122 suggestions, then the tagging services 110 updates (at block 612) the tag database entries 200i for each document keyword to indicate the corrected new user tag as the used tag 206 for the document keyword 202 and user 210. The validation engine 122 is trained (at block 614) to not output the recommended modified user tags from the tag database 200 as related to the document keywords and the new user tag to avoid further recommendations of tags the user did not previously accept.
With the embodiment of
In further embodiments, a user may select recommended tags as well as suggest a new user tag when considering tags recommended provided at blocks 308, 314, and 512.
The described embodiments may further apply to a folksonomy, which comprises a system where multiple users apply public tags to online items, such as in collaborative tagging or social taggings, where the tags of other users to items are available for all to use. In such folksonomy environments, the tagging services 110 may look for used tags for keywords for the users participating in the folksonomy and train the machine learning modules 114, 116, and 122 to provide recommendations based on the preferences of all users in the folksonomy to reflect group preferences for tag recommendations for certain keywords.
The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
The computational components of
As shown in
Computer system/server 702 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 702, and it includes both volatile and non-volatile media, removable and non-removable media.
System memory 706 can include computer system readable media in the form of volatile memory, such as random access memory (RAM) 710 and/or cache memory 712. Computer system/server 702 may further include other removable/non-removable, volatile/non-volatile computer system storage media. By way of example only, storage system 713 can be provided for reading from and writing to a non-removable, non-volatile magnetic media (not shown and typically called a “hard drive”). Although not shown, a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk (e.g., a “floppy disk”), and an optical disk drive for reading from or writing to a removable, non-volatile optical disk such as a CD-ROM, DVD-ROM or other optical media can be provided. In such instances, each can be connected to bus 708 by one or more data media interfaces. As will be further depicted and described below, memory 706 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
Program/utility 714, having a set (at least one) of program modules 716, may be stored in memory 706 by way of example, and not limitation, as well as an operating system, one or more application programs, other program modules, and program data. Each of the operating system, one or more application programs, other program modules, and program data or some combination thereof, may include an implementation of a networking environment. The components of the computer 702 may be implemented as program modules 716 which generally carry out the functions and/or methodologies of embodiments of the invention as described herein. The systems of
Computer system/server 702 may also communicate with one or more external devices 718 such as a keyboard, a pointing device, a display 720, etc.; one or more devices that enable a user to interact with computer system/server 702; and/or any devices (e.g., network card, modem, etc.) that enable computer system/server 702 to communicate with one or more other computing devices. Such communication can occur via Input/Output (I/O) interfaces 722. Still yet, computer system/server 702 can communicate with one or more networks such as a local area network (LAN), a general wide area network (WAN), and/or a public network (e.g., the Internet) via network adapter 724. As depicted, network adapter 724 communicates with the other components of computer system/server 702 via bus 708. It should be understood that although not shown, other hardware and/or software components could be used in conjunction with computer system/server 702. Examples, include, but are not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data archival storage systems, etc.
The letter designators, such as i, is used to designate a number of instances of an element may indicate a variable number of instances of that element when used with the same or different elements.
The terms “an embodiment”, “embodiment”, “embodiments”, “the embodiment”, “the embodiments”, “one or more embodiments”, “some embodiments”, and “one embodiment” mean “one or more (but not all) embodiments of the present invention(s)” unless expressly specified otherwise.
The terms “including”, “comprising”, “having” and variations thereof mean “including but not limited to”, unless expressly specified otherwise.
The enumerated listing of items does not imply that any or all of the items are mutually exclusive, unless expressly specified otherwise.
The terms “a”, “an” and “the” mean “one or more”, unless expressly specified otherwise.
Devices that are in communication with each other need not be in continuous communication with each other, unless expressly specified otherwise. In addition, devices that are in communication with each other may communicate directly or indirectly through one or more intermediaries.
A description of an embodiment with several components in communication with each other does not imply that all such components are required. On the contrary a variety of optional components are described to illustrate the wide variety of possible embodiments of the present invention.
When a single device or article is described herein, it will be readily apparent that more than one device/article (whether or not they cooperate) may be used in place of a single device/article. Similarly, where more than one device or article is described herein (whether or not they cooperate), it will be readily apparent that a single device/article may be used in place of the more than one device or article or a different number of devices/articles may be used instead of the shown number of devices or programs. The functionality and/or the features of a device may be alternatively embodied by one or more other devices which are not explicitly described as having such functionality/features. Thus, other embodiments of the present invention need not include the device itself.
The foregoing description of various embodiments of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto. The above specification, examples and data provide a complete description of the manufacture and use of the composition of the invention. Since many embodiments of the invention can be made without departing from the spirit and scope of the invention, the invention resides in the claims herein after appended.