Within any number of business, social, or academic enterprises, a given person may be a member of several project teams. In such cases, it may become difficult for the person to track which of their electronic content (e.g., electronic mail communications, electronic tasks, electronic meeting notations, calendaring items, instant messaging communication threads, etc.) belongs to each of the different project teams. For example, a given employee of a business enterprise may belong to a first project team associated with software development for a first software product line, and the person may belong to a second project team associated with software development associated with a second product line. This may be particularly problematic when the volume of content is high, such as may be the case with large databases of files or busy electronic mail or instant messaging inboxes. In any given day, the person may receive tens or even hundreds of electronic mail messages, documents, instant messaging communication threads, tasks, meeting notices, and the like associated with each of the different project teams. In these cases, the user may become frustrated and may simply give up on attempting to keep content organized in association with the different project teams.
It is with respect to these and other considerations that the present invention has been made.
Embodiments of the present invention solve the above and other problems by automatically classifying content as associated with a given electronic workspace. New electronic mail items, documents, meeting requests, tasks, calendar items, and the like are automatically classified into a project space. Thus, a user is not required to engage in a time-consuming task of identifying, collecting, and associating such content with a given project workspace. In addition, feedback may be provided to the user on the quality of automatic assignments of content items to the desired workspace for editing content associated with the desired workspace and for improving the automatic classification process.
The details of one or more embodiments are set forth in the accompanying drawings and description below. Other features and advantages will be apparent from a reading of the following detailed description and a review of the associated drawings. It is to be understood that the following detailed description is explanatory only and is not restrictive of the invention as claimed.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended as an aid in determining the scope of the claimed subject matter.
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate various embodiments of the present invention. In the drawings:
As briefly described above, embodiments of the present invention are directed to automatically classifying documents into one or more project workspaces. Newly created content, for example, documents, electronic mail messages, text messages, meeting requests, tasks, and the like are analyzed, and a suggested project classification is provided to a user associated with the new content. The user is allowed through a user interface component to accept or reject the project classification or to propose a different project classification. Based on the user's feedback, the classification system learns, and the classification process is improved.
The following description refers to the accompanying drawings. Whenever possible, the same reference numbers are used in the drawings and the following description to refer to the same or similar elements. While embodiments of the invention may be described, modifications, adaptations, and other implementations are possible. For example, substitutions, additions, or modifications may be made to the elements illustrated in the drawings, and the methods described herein may be modified by substituting, reordering, or adding stages to the disclosed methods. Accordingly, the following detailed description does not limit the invention. Instead, the proper scope of the invention is defined by the appended claims.
Referring now to the drawings, in which like numerals represent like elements through the several figures, aspects of the present invention and the exemplary operating environment will be described. While the invention will be described in the general context of program modules that execute in conjunction with an application program that runs on an operating system on a personal computer, those skilled in the art will recognize that the invention may also be implemented in combination with other program modules.
Generally, program modules include routines, programs, components, data structures, and other types of structures that perform particular tasks or implement particular abstract data types. Moreover, those skilled in the art will appreciate that the invention may be practiced with other computer system configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.
Referring to
The user interface component 100 includes an example header 105 of “Project Classification Notification” to indicate to the user that a content item just generated and stored has been classified as will follow in the user interface presentation. As should be appreciated, the classification of content may occur at various times in the life cycle of a particular content item. For example, the classification and subsequent classification notification to the user may occur when the user generates and saves a content item, or the classification and notification may occur when a content item is revised and saved, or when a user receives a new content item, for example, a meeting request, an electronic mail item, a text message item, and the like.
Referring still to
Referring still to
As should be appreciated, the user interface component illustrated in
Referring to
The tasks repository 205 may include tasks generated and stored by a user or tasks received by the user from other users that are subsequently stored in a task database for the user. When a task item is stored by the user, the task item may be classified into a given project workspace via the user interface component 100, described above. The calendar items and meeting requests repository 210 is illustrative of calendar items, received and sent meeting request items, and the like, and such calendar items may be recommended for a classification according to a given project workspace upon generation, sending, receiving, or accepting.
The documents repository 215 and the miscellaneous content repository 220 are illustrative of any content generated and stored, or received and stored by a user that may be classified into a given project through user feedback, as described herein. The automatic content classification system 300 is operative to classify the content received from the various sources 200-220 and for recommending and causing classification of the various content items into one or more project workspaces 230, 235, 240, 245.
Referring still to
A second major component of the automatic content classification system 300 is the component of classification of a content item into a given project workspace, as described below with reference to
Referring still to
The text included in the content item and associated metadata next may be processed for use classifying the content into a given workspace. A text processing application may be employed whereby the text is broken into one or more text components for determining whether the received/retrieved text may be contain terms that may be used in comparing to other classified content. Breaking the text into the one or more text components may include breaking the text into individual sentences followed by breaking the individual sentences into individual tokens, for example, words, numeric strings, etc.
Such text processing is well known to those skilled in the art and may include breaking text portions into individual sentences and individual tokens according to known parameters. For example, punctuation marks and capitalization contained in a text portion may be utilized for determining the beginning and ending of a sentence. Spaces contained between portions of text may be utilized for determining breaks between individual tokens, for example, individual words, contained in individual sentences. According to one embodiment, content may be tokenized in a way that avoids lexicon size growing too large. For example, if a language allows compounds to be formed by combining two nouns by a hyphen, breaking the compounds before and after hyphen to make it three tokens can avoid the need of adding all possible compounds in a lexicon which may cause a lexicon to grow large enough to cause process performance problems. That is, if compound like “front-wheel” is broken into three tokens, “front”, “−”, “wheel”, then the lexicon only needs to store the three tokens instead of the three tokens plus the compound “front-wheel.” Thus, the lexicon may cover as many words as possible and processing performance improved owing to less unknown words.
In addition, alphanumeric strings following known patterns, for example, five digit numbers associated with zip codes, may be utilized for identifying portions of text. In addition, initially identified sentences or sentence tokens may be passed to one or more recognizer programs for comparing initially identified sentences or tokens against databases of known sentences or tokens for further determining individual sentences or tokens. For example, a word contained in a given sentence may be passed to a database to determine whether the word is a person's name, the name of a city, the name of a company, or whether a particular token is a recognized acronym, trade name, or the like. As should be appreciated, a variety of means may be employed for comparing sentences or tokens of sentences against known, words, or other alphanumeric strings for further identifying those text items.
Referring still to
According to another embodiment, the received content item may be passed directly to the rules component/operation 304 or statistical classification model 311, described, below without passing the content item first through the LAD at operation 303. As should be appreciated, language identification for a given content item may be obtained through other means, for example, as a metadata item associated with the content item, such that the LAD is not necessary for determining one or more languages associated with the content item.
The content item is next passed to a rules component/operation 304. The rules component/operation 304 is comprised of a rule database 306, a rule parser 308 and a rule-based classification application 310. The rule database is a repository of rules that may be used to classify a given content item based on one or more specific criteria. For example, if the title of the content item contains the same name as a given project name, then a given rule in the rule database 306 may include automatically recommending the content item for the project bearing the same name. A second example rule might include recommending a content item generated by a particular user to a particular project workspace, when the particular user is associated only with that particular workspace and no others. A third example rule might include a rule based on timing associated with a content item. For example, if all content items generated on a certain day of a period, for example, the last day of a fiscal quarter, should be associated with a given project workspace, for example, quarter-end data, then all content items generated on that particular date may be automatically associated with that project workspace.
The rule parser 308 is an application operative to parse the rules contained in the rule database 306 for comparison of those rules to terms extracted from the content item via text processing and content analysis described above. The rule-based classification application 310 is an application operative to apply the aforementioned rules to processed text and metadata associated with the content item for determining whether a rule is met requiring the recommended classification of the content item for inclusion in a given project workspace.
According to an embodiment, in addition to the use of a rule-based classification system, as described above, a statistical term classification model 311 for identifying parts of a content item as belonging to a given classification may be used. For example a statistical model known as part-of-speech tagging or grammatical tagging may be used where components of a text-based content item may be characterized based on a location and contextual association with other components of the text component. Thus, for example, according to part-of-speech tagging (POS), a word normally operating as a noun may be classified as a verb owing to its location between to known nouns and owing to the context of the words. Such a POS system may be used as an alternative to the rule-based system described above, or the two systems may be combined to enhance classification efficiency. As illustrated in
Referring now to project metadata component/operation 312, metadata associated with the content item, for example, content title, content author, content location, date/time of content generation and storage, date/time of content item transmission or receipt, metadata associating the content item with other content items, metadata associating the content item with other project workspaces, and the like may be utilized for recommending classification of a given content item into a given project workspace. The project keywords component 314 and the project contacts component 316 may be utilized for associating metadata, keywords, terms, features and the like extracted from the content item and for associating or comparing those items through contact information or other identifying information associated with one or more project workspaces for recommending classification of a given content item into a particular project workspace. For example, if the content item includes an electronic mail item bearing a sender name, one or more receiver names, a title, and the like that may be matched to similar metadata associated with other electronic mail items previously classified into a particular workspace, that information may be used by the automatic content classification system 300 for recommending inclusion of the example electronic mail item with the particular project workspace.
At multiple projects data component/operation 318, content and metadata extracted from the content items may be utilized by the automatic content classification system 300 for proposing or recommending classification of a given content item into a particular project workspace. According to embodiments, the multiple projects data component/operation 318 is illustrative of an access point to project data/metadata 320, 324, and training data 322, 326 associated with content items previously classified into one or more other project workspaces, for example, the project workspaces 230, 235, 240, 245, illustrated in
For example, a document previously assigned to a given project workspace will have various data comprising the document including text, images, numeric data, and the like that was processed for analysis and classification when that document was previously classified into a given workspace. In addition, during the classification process, training data associated with the classification of that document may have been generated. For example, if a first proposed classification for that document was presented to a user, but the user rejected the proposed classification and proposed an alternate classification via the user interface 100, illustrated above in
The training set data component/operation 328 is illustrative of training data for the automatic classification system 300 in association with the content item presently being analyzed and classified. That is, information from one or more analyses/components, for example, the rules component 304, a POS tagging system, the project metadata component 312, the multiple projects data 318, or combinations thereof, may be assembled for use in causing the system 300 to associate the present content item with a given project workspace. That is, each of these systems may be used independently for classifying a piece of content, or combinations of each of these systems may be used for optimizing the classification process, described herein. For example, if out of every ten electronic mails from a particular sender, eight of the electronic mails are ultimately classified into a particular project workspace, then if the current content item is an electronic mail from the same sender, then the 80% chance that the electronic mail may be classified into that same project workspace may be utilized along with other data for assisting in the classification.
After training set data is generated for the current content item, the system proceeds to classification component/operation 329. The content type feature builder component 334 is utilized for initially classifying the information about the content according to a particular content type, for example, a word processing document, a spreadsheet document, an electronic mail item, a text message item, a meeting notice, a task item, and the like. The feature vectors component 332 is utilized for organizing the information extracted from the content item for comparing the information against similar information contained in other content items previously classified into one or more other project workspaces. For example, if the content type is associated with an electronic mail item, then feature vectors associated with the electronic mail item may include sending party, receiving party, subject line, transmission type, such as electronic mail versus text messaging, and the like.
After feature vectors are developed for the information extracted from or obtained in association with the current content item, similarity comparisons and computations component/operation 330 compares the information assembled for the content item with similar information contained in or associated with content items previously classified into one or more other project workspaces. Once the current content item is found to be similar to content items previously classified into one or more other project workspaces, the one or more other project workspaces may be proposed to a user as a suggested project 336.
As described above, the suggested project 336 may be proposed to the user via the user interface component 100 illustrated and described above with reference to
If the user rejects the proposed classification, then the system 300 may utilize the rejection to cause the system 300 to analyze the information again and to propose a different classification, for example, a second classification that ranks slightly lower than the first proposed classification. If the user proposes a new project workspace classification for the content item, then the system may parse the information contained in content items associated with the project workspace proposed by the user to compare with data extracted from and obtained in association with the current content item for enhancing its ability to make project workspace suggestions on future similar content items.
Referring still to
As described above, embodiments of the invention may be implemented via local and remote computing and data storage systems, including the systems illustrated and described with reference to
With reference to
Computing device 500 may have additional features or functionality. For example, computing device 500 may also include additional data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Such additional storage is illustrated in
As stated above, a number of program modules and data files may be stored in system memory 504, including operating system 505. While executing on processing unit 502, programming modules 506 and may include the automatic content classification system 300 which may be program modules containing sufficient computer-executable instructions, which when executed, performs functionalities as described herein. The aforementioned process is an example, and processing unit 502 may perform other processes. Other programming modules that may be used in accordance with embodiments of the present invention may include electronic mail and contacts applications, word processing applications, spreadsheet applications, database applications, slide presentation applications, drawing or computer-aided application programs, etc.
Generally, consistent with embodiments of the invention, program modules may include routines, programs, components, data structures, and other types of structures that may perform particular tasks or that may implement particular abstract data types. Moreover, embodiments of the invention may be practiced with other computer system configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like. Embodiments of the invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.
Furthermore, embodiments of the invention may be practiced in an electrical circuit comprising discrete electronic elements, packaged or integrated electronic chips containing logic gates, a circuit utilizing a microprocessor, or on a single chip containing electronic elements or microprocessors. Embodiments of the invention may also be practiced using other technologies capable of performing logical operations such as, for example, AND, OR, and NOT, including but not limited to mechanical, optical, fluidic, and quantum technologies. In addition, embodiments of the invention may be practiced within a general purpose computer or in any other circuits or systems.
Embodiments of the invention, for example, may be implemented as a computer process (method), a computing system, or as an article of manufacture, such as a computer program product or computer readable media. The computer program product may be a computer storage media readable by a computer system and encoding a computer program of instructions for executing a computer process. Accordingly, the present invention may be embodied in hardware and/or in software (including firmware, resident software, micro-code, etc.). In other words, embodiments of the present invention may take the form of a computer program product on a computer-usable or computer-readable storage medium having computer-usable or computer-readable program code embodied in the medium for use by or in connection with an instruction execution system. A computer-usable or computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The term computer readable media as used herein may include computer storage media. Computer storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. System memory 504, removable storage 509, and non-removable storage 510 are all computer storage media examples (i.e., memory storage.) Computer storage media may include, but is not limited to, RAM, ROM, electrically erasable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store information and which can be accessed by computing device 500. Any such computer storage media may be part of device 500. Computing device 500 may also have input device(s) 512 such as a keyboard, a mouse, a pen, a sound input device, a touch input device, etc. Output device(s) 514 such as a display, speakers, a printer, etc. may also be included. The aforementioned devices are examples and others may be used.
The term computer readable media as used herein may also include communication media. Communication media may be embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media. The term “modulated data signal” may describe a signal that has one or more characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media may include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared, and other wireless media.
Embodiments of the present invention, for example, are described above with reference to block diagrams and/or operational illustrations of methods, systems, and computer program products according to embodiments of the invention. The functions/acts noted in the blocks may occur out of the order as shown in any flowchart. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
While certain embodiments of the invention have been described, other embodiments may exist. Furthermore, although embodiments of the present invention have been described as being associated with data stored in memory and other storage mediums, data can also be stored on or read from other types of computer-readable media, such as secondary storage devices, like hard disks, floppy disks, or a CD-ROM, a carrier wave from the Internet, or other forms of RAM or ROM. Further, the disclosed methods' stages may be modified in any manner, including by reordering stages and/or inserting or deleting stages, without departing from the invention.
All rights including copyrights in the code included herein are vested in and the property of the Applicant. The Applicant retains and reserves all rights in the code included herein, and grants permission to reproduce the material only in connection with reproduction of the granted patent and for no other purpose.
While the specification includes examples, the invention's scope is indicated by the following claims. Furthermore, while the specification has been described in language specific to structural features and/or methodological acts, the claims are not limited to the features or acts described above. Rather, the specific features and acts described above are disclosed as example for embodiments of the invention.