Various embodiments are directed generally to data analysis and specifically to methods and systems for analysis of unstructured data.
The current social media boom has made it easy for individuals to publically describe, publish and disseminate their experiences with conducting business with specific organizations to a large audience. Today, a large number of organizations attempt to tap into such customer feedback to understand problem areas faced by their customers, and to use such feedback to make improvements and corrections.
In order to meaningfully analyze the potentially large volume of customer feedback that a business may collect, a typical approach may predefine topics/themes relevant to a specific business function and then develop an approach to map specific customer feedback to an appropriate theme. Typical approaches to mapping feedback to themes are mapping based on rule based patterns or using machine learning techniques.
Once mapped to themes, feedback may then be quantified and analyzed based on the volume associated with a discussion theme. However the use of a predefined set of themes to analyze unstructured data is inherently limiting—previously unseen problems may never get captured by a predefined template of themes and a lot of valuable feedback could get overlooked.
The ability for a business to automatically detect topics of discussion amongst their customers would considerably accelerate the ability of the business to respond to problems. Improved responsiveness would likely improve overall customer satisfaction which in turn would drive greater customer retention and profitability.
Human language is very complex and the authors of documents can choose to describe the same theme in many different ways which makes automatic identification of significant themes a very hard task. However, previous attempts at employing unsupervised techniques have so-far provided limited business value as cluster groupings are generally unintuitive to human interpretation.
Various embodiments includes systems and methods for automatic unsupervised detection of discussion topics from unstructured feedback text wherein the results of topic groupings are tagged with meaningful labels.
Various embodiments are directed generally to data analysis and specifically to methods and systems for analysis of unstructured data.
One embodiment may include a system comprising: a repository of unstructured documents; a theme detection component configured to: process the unstructured documents; discover themes; assign labels to each discovered theme; identify patterns that describe each theme; and organize the themes in a hierarchy; and a user interface configured to: allow an operator to initiate theme detection by the theme detection component; and allow an operator to view and interact with the results of the theme detection, wherein the results comprise at least one of the assigned labels, the patterns, and the hierarchy.
One embodiment may include a method of determining themes from a collection of unstructured text documents, the method comprising: receiving a set of unstructured text documents to process; determining, by a computing system, frequently occurring terms within the set of unstructured text documents; determining, by the computing system, a label for each term in the frequently occurring terms; determining, by the computing system, one or more text patterns, wherein the one or more text patterns are used to identify if the term is contained within a document; and creating, by the computing system, a category model to organize the identified terms as themes of top level themes.
One embodiment may include a computer readable storage medium comprising instructions that if executed enables a computing system to: receive a set of unstructured text documents to process; determine frequently occurring terms within the set of unstructured text documents; determine a label for each term in the frequently occurring terms; determine one or more text patterns, wherein the one or more text patterns are used to identify if the term is contained within a document; and create a category model to organize the identified terms as themes of top level themes.
Additional features, advantages, and embodiments of the invention are set forth or apparent from consideration of the following detailed description, drawings and claims. Moreover, it is to be understood that both the foregoing summary of the invention and the following detailed description are exemplary and intended to provide further explanation without limiting the scope of the invention as claimed.
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate preferred embodiments of the invention and together with the detailed description serve to explain the principles of the invention.
The present disclosure enables a mechanism to determine themes from unstructured data. Such a capability is not possible with existing unstructured and structured data analysis tools.
Certain embodiments provide an apparatus to process unstructured text to automatically determine without supervision, relevant topics of discussion. From the perspective of unstructured text, a theme may be one of many central topics discussed in a document. An example of a theme is an issue raised in feedback submitted by a customer.
Theme Detection may refer to a mechanism to scan through a collection of documents, identify various central topics and establish topics that tend to recur across a broad set of independent documents.
Automatic Theme Detection may refer to unsupervised discovery of topics by a machine or computing device.
Unstructured text may refer to documents whose content includes written human language. This may include, but is not limited to, business documents such as word processing documents, spreadsheets etc or transcripts of audio conversations or survey comments or social media posts (e.g. Twitter posts or Facebook posts).
Various embodiments provide a capability to process a set of previously uncategorized documents and discover significant themes. The output of such a process may be a machine developed category model that provides a set of rules to map sentences and/or documents to categories.
Various embodiments may also provide a capability wherein, for a set of documents categorized by a human developed category model, process any sentences that were unmapped to any predefined categories to propose additional categories that enhance the value of the human developed model.
Various embodiments may also provide a capability wherein, for a human or machine developed category model, newly seen documents which remain uncategorized by the model can be processed to discover any new topics of discussion that may have never been seen before.
For an automatically detected theme to be meaningful, one or more of the following characteristics may need to be determined: (a) what happened?—this can include experiences or observations, (b) which entities were affected?—this can include people, location, products, (c) what was (or will be) the impact?.
For an automatically detected theme to be insightful, the theme may need to be tagged with a meaningful descriptor.
For an automatically detected theme to be useful, a set of properties that identify the theme may need to be determined so that such properties can be applied to any future unstructured feedback to determine if the theme is relevant to that feedback as well.
Enterprise server 110, database server 120, one or more external sources 130, one or more internal sources 140, navigator device 150, administrator device 160, business intelligence server 170, and business intelligence report device 180 may be connected through one or more networks. The one or more networks may provide network access, data transport and other services to the devices coupled to it. In general, one or more networks may include and implement any commonly defined network architectures including those defined by standards bodies, such as the Global System for Mobile communication (GSM) Association, the Internet Engineering Task Force (IETF), and the Worldwide Interoperability for Microwave Access (WiMAX) forum. For example, one or more networks may implement one or more of a GSM architecture, a General Packet Radio Service (GPRS) architecture, a Universal Mobile Telecommunications System (UMTS) architecture, and an evolution of UMTS referred to as Long Term Evolution (LTE). The one or more networks may, again as an alternative or in conjunction with one or more of the above, implement a WiMAX architecture defined by the WiMAX forum. The one or more networks may also comprise, for instance, a local area network (LAN), a wide area network (WAN), the Internet, a virtual LAN (VLAN), an enterprise LAN, a virtual private network (VPN), an enterprise IP network, or any combination thereof.
Enterprise server 110, database server 120, and business intelligence server 170 may be any type of computing device, including but not limited to a personal computer, a server computer, a series of server computers, a mini computer, and a mainframe computer, or combinations thereof. Enterprise server 110, database server 120, and business intelligence server 170 may each be a web server (or a series of servers) running a network operating system, examples of which may include but are not limited to Microsoft Windows Server, Novell NetWare, or Linux.
Enterprise server 110 may include natural processing engine 111, sentiment scoring engine 112, classification engine, 113, reporting engine, 114, and theme detection engine 115.
Natural language processing engine 111 may include subsystems to process unstructured text, including, but not limited to, language detection, sentence parsing, clause detection, tokenization, stemming, part of speech tagging, chunking, and named entity recognition. In some embodiments, natural language processing engine 111 may perform any or all portions of the exemplary process depicted in
Sentiment scoring engine 112 may identify the general feeling, attitude or opinion that the author of a section of unstructured text is expressing towards a situation or event. In some embodiments, the sentiment scoring engine may classify sentiment as either positive, negative or neutral. In some embodiments, the sentiment scoring engine may assign a numeric sentiment score on a numeric scale ranging from a minimum value representing the lowest possible sentiment to a maximum value representing the highest possible sentiment. In some embodiments, a dictionary of words is included, in which selected words are pre-assigned a sentiment tuning value. In some embodiments, the presence or absence of language features such as negation (e.g. NOT GOOD) or modifiers (e.g. VERY GOOD, SOMEWHAT GOOD etc.) when modifying certain words (e.g. GOOD) may influence the computation of sentiment for that sentence or clause.
Classification engine 113 may identify whether a particular classification category applies to a portion of unstructured text. A classification category may refer to a concept that either summarizes a block of unstructured text or may refer to a concept that is being described in a block of unstructured text. Classification categories may include physical objects (e.g. bathroom sink), people (e.g. waitstaff), locations (e.g. lobby), characteristics of objects (e.g. dirty, torn), characteristics of people (e.g. appearance, attitude), perceptions (e.g. unsafe), emotions (e.g. anger) etc. In some embodiments each classification category is represented by one or many rules. In some embodiments, the rules may be expressed in Boolean logic. In some embodiments, the rules may be represented by a trained machine learning model.
Reporting engine 114 may report against categories and sentiment expressed in a collection of documents. In some embodiments, the categories used in reporting may include theme detected topics. In some embodiments, reporting engine 114 may report any or all of the information of the exemplary applications depicted in
Theme detection engine 115 may include subsystems to determine categories/themes within a collection of documents. Theme detection engine 115 may determine categories/themes using unsupervised techniques. In some embodiments, theme detection engine 115 may organize themes in a hierarchical structure in which a child theme may belong to a parent theme. In some embodiments, theme detection engine 115 may suggest one or several categorization rules that represent the concept of the theme such that classification engine 113 may identify whether the theme applies to a portion of unstructured text. In some embodiments, theme detection engine 115 may suggest a name to identify each determined theme. In some embodiments, theme detection engine 115 may perform any or all portions of the exemplary process depicted in
In some embodiments, root cause analysis may be performed with or by any one or more of the embodiments disclosed in co-pending U.S. patent application Ser. No. 13/782,914 filed Mar. 1, 2013, entitled “Apparatus for Identifying Root Cause Using Unstructured Data,” which is hereby incorporated herein by reference
A user of the system may instruct a module of the system such as theme detection engine 115 to process a collection of unstructured documents to determine significant themes that repeat across the collection.
Receiving collection of documents 210 may include collecting unstructured text documents from one or more internal sources and/or external sources. An internal source may refer to unstructured text that originates from within an organization. Internal sources may include, but not be limited to, e-mail, call center notes, transcripts of audio conversations, chat conversations, word documents, excel documents, and databases. External sources may refer to publicly available unstructured data or unstructured data that is published on a system external to an organization. External sources may include, but not be limited to, social media sources such as Facebook, Twitter, product review sites, service review sites, business review sites, and news sources. In some embodiments, unstructured text documents may be temporarily staged in a file-system or a database during the data collection process.
Natural language processing 220 may include methods to process unstructured text, including, but not limited to, language detection, sentence parsing, clause detection, tokenization, stemming, part of speech tagging, chunking and named entity recognition. In some embodiments, natural language processing 220 may include any or all portions of the exemplary process depicted in
Sentiment scoring 230 may include methods to identify the general feeling, attitude or opinion that the author of a section of unstructured text is expressing towards a situation or event. In some embodiments, the sentiment scoring engine may classify sentiment as either positive, negative or neutral. In some embodiments, the sentiment scoring engine may assign a numeric sentiment score on a numeric scale ranging from a minimum value representing the lowest possible sentiment to a maximum value representing the highest possible sentiment.
In some embodiments, if a sentence has a single sentiment word with no negators or modifiers, the sentence sentiment score will be equal to the sentiment of that word (For example, a single word with a sentiment value of +3 will result in a sentence sentiment score of +3). In some embodiments, for sentences with multiple sentiment words, the following calculation is applied. Consider the sentence below as an example:
1. Find the highest sentiment word value in the sentence. This will be used as a base for the sentence sentiment. In the example sentence this is +3.
2. Add +0.5 for every additional word with the same sentiment. In the example, there is one more word with +3 so add +3 and +0.5 which equals to +3.5
3. Add +0.25 for every word one level lower in sentiment. In the example, there is just one token with +2, so (3.5+0.25)=+3.75
4. The same approach is applicable for each subsequent levels. For sentiment level n−1 take the value of individual token on the level n and divide by 2 and then multiply by number of tokens with sentiment (n−1). So, in the example to calculate the effect of +1 token you have add +0.25/2 to the sentence sentiment: (3.75+0.25/2)=3.875. The only exception is that word sentiment level 0.25 (multiple decreasing modifiers attached to the word with a +1 or −1 value) is handled the same way as 0.5—the net effect for the sentence sentiment will be the same for both levels as there is no meaningful difference between the two cases.
5. Total sentence sentiment=+3.875
The same calculation model is used for a sentence with negative words: adding a negative value equals subtraction of this value. When a sentence contains both positive and negative words, the calculations are done separately for positive and negative parts and then summed up.
Isolating target documents 240 may include steps to determine a collection of documents that theme identification 250 may process. In some embodiments, isolating target documents 240 may include steps to randomly sample a set of documents from the original document collection if the original document collection size is larger than a threshold. In some embodiments, isolating target documents 240 may include steps to reject duplicate documents from the sampled set of documents. In some embodiments, isolating target documents 240 may include steps to apply a criteria for selecting documents, which theme identification 250 may process, using a user specified document filter criteria. In some embodiments, isolating target documents 240 may perform any or all portions of the exemplary processes depicted in
Theme identification 250 may include steps to statistically determine themes from isolated target documents 240 using natural language features identified using natural language process 220 including but not limited to stemmed words, named entities, bigrams and part-of-speech. In some embodiments, theme identification 250 may include steps to identify themes as a single or multi-level hierarchy with parent themes and child themes. In some embodiments, theme identification 250 may include any or all portions of the exemplary process depicted in
Theme naming 260 may name one or more themes produced by theme identification. Theme naming 260 may provide a theme with a label that allows a person viewing the label to understand what concept the theme describes.
Rule generation 270 may generate rules that are text patterns that can be used to categorize a block of unstructured text such that if a block of text matches the pattern, the theme can be said to describe a concept in the block of text.
Category model 280 may represent a hierarchical structure of themes.
Language detection 310 may include steps to identify the human language used to create the specific document being processed. In some embodiments, language detection may be performed by testing for high frequency words found in particular languages (e.g. “the”, “we”, “my” etc. in English). In some embodiments, language detection may be performed by identifying frequent N-grams, which are sequences of word patterns, from the document and searching against a corpus. In some embodiments, language detection may be performed based on the Unicode characters used in the document which may be unique to a particular language.
Sentence parsing 350 may include steps to identify sentences within the specific document being processed and to determine a parse structure that represents different grammatical relationships within each sentence identified. For example the sentence “The bathroom was dirty” may be parsed as:
where S refers to sentence, NP refers to noun phrase “the bathroom”, VP refers to verb phrase “was dirty”, ADJP refers to adjective phrase “dirty”.
Clause detection 320 may include steps to identify clauses within sentences identified in sentence parsing 350. For example the sentence “The bathroom was dirty and the carpet was torn” may be parsed as:
where two independent clauses “the bathroom was dirty” and “the carpet was torn” may be separated for analysis without losing any semantic meaning.
Tokenization 360 may include steps to identify each word within sentences identified in sentence parsing 350. For example the sentence “The bathroom was dirty” may be tokenized into the words THE, BATHROOM, WAS, DIRTY.
Stemming 330 may include steps to strip each word of any morphological suffixes or prefixes so that the word can be reduced to its root form. For example the token RUDELY may be stemmed to RUDE so that a single concept called RUDE can be identified.
Named entity recognition 380 may include steps to identify and relate proper nouns within a document collection. For example the token EL PASO may be identified as a named entity of type location.
Part of speech 370 may include steps to identify the grammatical function performed by each word within a sentence. Grammatical functions performed by words may include but are not limited to, noun, verb, adjective, adverb, pronoun, preposition. For example in the sentence “the bathroom was dirty” may identify speech parts as THE=determiner, BATHROOM=noun, WAS=verb, DIRTY=adjective.
Chunking 340 may include steps to group sequential nouns together as a single consolidated noun token, if the multiple tokens are deemed more meaningful when grouped together as compared to when separated. In some embodiments, the determination that a group of tokens is more meaningful when grouped together may be accomplished using a lexicon dictionary or using a statistical calculation. An example would be neighboring tokens MICKEY and MOUSE which would be more meaningful as a combined token MICKEY MOUSE.
Receiving collection of documents 410 may include steps to collect unstructured text documents from one or more internal sources, such as but not limited to e-mail, call center notes, transcripts of audio conversations, chat conversations, word documents, excel documents, and databases, and/or one or more external sources, such as but not limited to social media sources such as Facebook, Twitter, product review sites, service review sites, business review sites, and news sources. In some embodiments, unstructured text documents may be temporarily staged in a file-system or a database during the data collection process.
In block 420, the documents from 410 may be sampled. Sampling may include steps to sample from a large collection of documents. The sampling may be performed randomly.
Documents to process 430 may represent the result of the sampling process 420.
Receiving collection of documents 510 may include steps to collect unstructured text documents from one or more internal sources, such as but not limited to e-mail, call center notes, transcripts of audio conversations, chat conversations, word documents, excel documents, and databases, and/or one or more external sources, such as but not limited to social media sources such as Facebook, Twitter, product review sites, service review sites, business review sites, and news sources. In some embodiments, unstructured text documents may be temporarily staged in a file-system or a database during the data collection process.
Filter criteria 530 may include filter criteria based on document metadata (i.e. structured data attributes associated with each unstructured document), filter criteria based on text patterns within unstructured text, or combination of several filter criteria against document metadata or text patterns. Applying filter criteria may include using Boolean logic.
Filter 520 may include steps to capture documents out of collection of documents 510 such that the documents fit any criteria specified in filter criteria 530.
Sample 540 may include steps to sample from a large collection of documents. The sampling may be performed randomly.
Documents to process 550 may include the result of the sampling process 540.
Receiving collection of documents 610 may include steps to collect unstructured text documents from one or more internal sources, such as but not limited to e-mail, call center notes, transcripts of audio conversations, chat conversations, word documents, excel documents, and databases, and/or one or more external sources, such as but not limited to social media sources such as Facebook, Twitter, product review sites, service review sites, business review sites, and news sources. In some embodiments, unstructured text documents may be temporarily staged in a file-system or a database during the data collection process.
Category model 630 may include a collection of categories such that for each category in the collection, one or many rules are specified that provide instructions for mapping a document to that category if the rules are satisfied by the document.
Classify 620 may include steps to apply category model 630 to collection of documents 610 so that documents are mapped to a category within model 630 if the document contains a text pattern defined by a rule for the category.
Determine uncategorized documents 640 may include steps to identify those documents from classify 620 that remain unclassified. Unclassified documents may refer to documents for which no category exists whose rules match the content of such uncategorized documents.
Sample 650 may include steps to sample from a large collection of documents. The sampling may be performed randomly.
Documents to process 660 may include the result of the sampling process 650.
Receiving collection of documents 710 may include steps to collect unstructured text documents from one or more internal sources, such as but not limited to e-mail, call center notes, transcripts of audio conversations, chat conversations, word documents, excel documents, and databases, and/or one or more external sources, such as but not limited to social media sources such as Facebook, Twitter, product review sites, service review sites, business review sites, and news sources. In some embodiments, unstructured text documents may be temporarily staged in a file-system or a database during the data collection process. In some embodiments, the documents received represents only documents acquired in the current collection process. Category model 730 may represent a collection of categories such that for each category in the collection, one or many rules are specified that provide instructions for mapping a document to that category if the rules are satisfied by the document.
Classify 720 may include steps to apply category model 730 to collection of documents 710 so that documents are appropriately categorized
Determine uncategorized documents 740 may include steps to identify those documents from classify 720 that remain unclassified. Unclassified documents may refer to documents for which no category exists whose rules match the content of such uncategorized documents.
Sample 750 may include steps to sample from a large collection of documents. The sampling may be performed randomly.
Documents to process 760 may represent the result of the sampling process 750.
Receiving target documents 810 may include may include steps to collect unstructured text documents from one or more internal sources, such as but not limited to e-mail, call center notes, transcripts of audio conversations, chat conversations, word documents, excel documents, and databases, and/or one or more external sources, such as but not limited to social media sources such as Facebook, Twitter, product review sites, service review sites, business review sites, and news sources. In some embodiments, unstructured text documents may be temporarily staged in a file-system or a database during the data collection process. In some embodiments, the document collection may be sampled as depicted in
Duplicate detection 815 may include steps to identify duplicate content amongst the document collection being processed. This may be to prevent spam content and other non-meaningful repetitive content such as advertisements, reposts of popular topics etc from creating an unwanted bias in the results. In some embodiments, duplicate detection 810 may be performed by computing a mathematical hash computation of each document and selecting only one instance of cases where different documents return the same hash value.
Filter stop words 820 may include steps to remove high frequency words commonly found in language such as determiners, copula, auxiliary verbs etc. In some embodiments filter stop words 820 may be performed by using a predefined list of high frequency words. In some embodiments filter stop words 820 may be performed by mathematically computing a frequency threshold; words with a frequency above the threshold may be removed from further processing.
Noise word filtering 825 may include steps to remove noise words in the documents from consideration, such that the noise words may be ignored. Data sourced from social media is generally noisy. An example is if data is sourced using a search criteria, the search term(s) tends to repeat very frequently and may be mistaken for a theme.
In some embodiments, noise word filtering 820 may determine a set of noise words N using the following technique. A word w may considered to be a noise word if the following heuristic criteria are met:
where
All words in the set N may be ignored. An example of noise words would be when a search query is issued against a source such as “LCD TV”. Given the nature of the query, the most frequent terms are likely to be LCD and TV—in order to filter such terms out an approach such as the one described above may be applied. Bigram identification 830 may include steps to identify two sequential tokens as a single logical unit. In some embodiments, bigram identification 830 may identify bigrams using the following steps:
An example would be neighboring tokens WATER and PARK which would be more meaningful as a combined token WATER PARK.
Themes may be identified into a one-level or a two-level hierarchy. Identify top level themes 835 may include steps where top level themes of a two-level hierarchy are identified or where themes of a one level-hierarchy are identified. In some embodiments, this step may be performed using the following steps:
where
In some embodiments, Top-level themes may be identified by from the pool of words and bigrams. An overall sentiment distribution of a group of sentences may be calculated and an item sentiment distribution for sentences containing each item in the pool may be calculated. An item from the pool may be a candidate for a top-level theme if (1) it appears above a certain threshold in the set of documents and (2) it is:
(a) a noun having a word sentiment probability is greater than the overall sentiment distribution probability and the sentiment of the word is neutral;
(b) a verb having a word sentiment probability is greater than the overall sentiment distribution probability and the sentiment of the word is not neutral; or
(c) is a named entity.
In some embodiments, the identified candidates may be culled by identifying the second most frequent candidate, setting a cutoff threshold as a count of sentences for the second most frequent candidate divided by a value, and ignoring any candidate which occur less than the cutoff threshold.
Identify second level themes 840 may include steps in which one or more second level themes of a two-level hierarchy may be identified. This step is may not necessarily be performed for a one level-hierarchy theme model. In some embodiments, this step may be performed using the following:
In some embodiment, second-level themes may be identified by from the pool of words and bigrams. The probability that a second level theme co-occurs with a top level theme given that the top level theme exits may be calculated. Similarly, the probability of a second level theme co-occurring with a top-level theme given that the second level theme exists may be calculated. An item from the pool may be a candidate for a second-level theme if it (1) appears above a certain threshold in the set of documents, (2) appears within a certain distance of a top-level item, and (3) is:
(a) a noun having a sentiment is neutral, having a probability of co-occurring with a top level theme given that the top level theme exists is greater than a first threshold, having a probability of co-occurring with a top level theme given that the item exists is greater than a second threshold, and the top level theme is also a noun; or
(b) an adjective having a sentiment is neutral, having a probability of co-occurring with a top level theme given that the top level theme exists is greater than a first threshold, having a probability of co-occurring with a top level theme given that the item exists is greater than a second threshold, and the top level theme is a noun; or
(c) an numeric having a sentiment is neutral, having a probability of co-occurring with a top level theme given that the top level theme exists is greater than a first threshold, having a probability of co-occurring with a top level theme given that the item exists is greater than a second threshold, that occurs within a one item of the top level theme, and the top level theme is a noun; or
(d) a noun having a sentiment is neutral, having a probability of co-occurring with a top level theme given that the top level theme exists is greater than a first threshold, having a probability of co-occurring with a top level theme given that the item exists is greater than a second threshold, and the top level theme is a verb; or
(e) any item with a non-neutral sentiment having a probability of co-occurring with a top level theme given that the top level theme exists is greater than a first threshold and having a probability of co-occurring with a top level theme given that the item exists is greater than a second threshold.
In some embodiments, the identified candidates may be culled by identifying the second most frequent top and second level pair, setting a cutoff threshold as a count of sentences for the second most frequent top and second level pair divided by a value (e.g. 5), and ignoring any top and second level pair which occur less than the cutoff threshold.
Grouping 850 may include steps in which related themes are grouped together using:
Theme naming 865 may include steps in which identified top-level and second-level themes are assigned labels. In one embodiment, this may be performed with the steps below:
Rule generation 870 may include steps in which text patterns to identify a theme are identified. In one embodiment this may be performed with the following steps:
Generated category model 875 represents a two-level hierarchy of themes or a one-level hierarchy of themes.
Category model 940 depicts a two level hierarchical model of themes discovered using unsupervised theme identification techniques.
At the highest level of the hierarchy in category model 940, there are parent level themes such as parent theme 910. Parent themes may automatically assigned a label. For example, parent theme 910 is named “Bed”.
At the second level of the hierarchy in category model 940, there are child level themes, such as child theme 920 which is a child of parent theme 910. Child themes may be assigned a label. For example, child theme 920 is named “Size Bed”.
Every parent theme and child theme in category model 940 may associated with one or more rules that describe a Boolean pattern for determining whether a document discusses the theme topic. Rules 930 are an example of text patterns that may be used to determine whether a document can be classified with child theme 920.
Categorization model 1010 may represent a human created hierarchical categorization model or may represent an automatically generated categorization model using theme detection.
Document collection 1020 may represent any uncategorized documents when applying categorization model 1010 to a document collection. In this illustration, the collection of uncategorized documents is labeled as “Global Other”.
A user may select an option to run theme detection 1030 against uncategorized documents 1020.
Display 1040 may represent the results of automatically discovered themes from the set of uncategorized documents.
Item 1110 may depict the results of theme detection against documents acquired at date time “10 Feb. 2012 13:37:10”.
Theme 1120 may depict a specific theme identified from newly determined themes 1110.
As will be appreciated by one of skill in the art, aspects of the present invention may be embodied as a method, data processing system, or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment or an embodiment combining software and hardware aspects, all generally referred to herein as system. Furthermore, elements of the present invention may take the form of a computer program product on a computer-usable storage medium having computer-usable program code embodied in the medium. Any suitable computer readable medium may be utilized, including hard disks, CD-ROMs, optical storage devices, flash RAM, transmission media such as those supporting the Internet or an intranet, or magnetic storage devices.
Computer program code for carrying out operations of the present invention may be written in an object oriented programming language such as JAVA, C#, Smalltalk or C++, or in conventional procedural programming languages, such as the Visual Basic or “C” programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer, or partially or entirely on a cloud environment. In the latter scenarios, the remote computer or cloud environments may be connected to the user's computer through a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, systems and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, server, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, server or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks, and may operate alone or in conjunction with additional hardware apparatus described herein.
Bus 1410 may include one or more interconnects that permit communication among the components of computing device 1400. Processor 1420 may include any type of processor, microprocessor, or processing logic that may interpret and execute instructions (e.g., a field programmable gate array (FPGA)). Processor 1420 may include a single device (e.g., a single core) and/or a group of devices (e.g., multi-core). Memory 1430 may include a random access memory (RAM) or another type of dynamic storage device that may store information and instructions for execution by processor 1420. Memory 1430 may also be used to store temporary variables or other intermediate information during execution of instructions by processor 1420.
ROM 1440 may include a ROM device and/or another type of static storage device that may store static information and instructions for processor 1420. Storage device 1450 may include a magnetic disk and/or optical disk and its corresponding drive for storing information and/or instructions. Storage device 1450 may include a single storage device or multiple storage devices, such as multiple storage devices operating in parallel. Moreover, storage device 1450 may reside locally on the computing device 1400 and/or may be remote with respect to a server and connected thereto via network and/or another type of connection, such as a dedicated link or channel.
Input device 1460 may include any mechanism or combination of mechanisms that permit an operator to input information to computing device 1400, such as a keyboard, a mouse, a touch sensitive display device, a microphone, a pen-based pointing device, and/or a biometric input device, such as a voice recognition device and/or a finger print scanning device. Output device 1470 may include any mechanism or combination of mechanisms that outputs information to the operator, including a display, a printer, a speaker, etc.
Communication interface 1480 may include any transceiver-like mechanism that enables computing device 1400 to communicate with other devices and/or systems, such as a client, a server, a license manager, a vendor, etc. For example, communication interface 1480 may include one or more interfaces, such as a first interface coupled to a network and/or a second interface coupled to a license manager. Alternatively, communication interface 1480 may include other mechanisms (e.g., a wireless interface) for communicating via a network, such as a wireless network. In one implementation, communication interface 1480 may include logic to send code to a destination device, such as a target device that can include general purpose hardware (e.g., a personal computer form factor), dedicated hardware (e.g., a digital signal processing (DSP) device adapted to execute a compiled version of a model or a part of a model), etc.
Computing device 1400 may perform certain functions in response to processor 1420 executing software instructions contained in a computer-readable medium, such as memory 1430. In alternative embodiments, hardwired circuitry may be used in place of or in combination with software instructions to implement features consistent with principles of the invention. Thus, implementations consistent with principles of the invention are not limited to any specific combination of hardware circuitry and software.
Exemplary embodiments may be embodied in many different ways as a software component. For example, it may be a stand-alone software package, a combination of software packages, or it may be a software package incorporated as a “tool” in a larger software product. It may be downloadable from a network, for example, a website, as a stand-alone product or as an add-in package for installation in an existing software application. It may also be available as a client-server software application, or as a web-enabled software application. It may also be embodied as a software package installed on a hardware device.
Numerous specific details have been set forth to provide a thorough understanding of the embodiments. It will be understood, however, that the embodiments may be practiced without these specific details. In other instances, well-known operations, components and circuits have not been described in detail so as not to obscure the embodiments. It can be appreciated that the specific structural and functional details are representative and do not necessarily limit the scope of the embodiments.
It is worthy to note that any reference to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. The appearances of the phrase “in one embodiment” in the specification are not necessarily all referring to the same embodiment.
Although some embodiments may be illustrated and described as comprising exemplary functional components or modules performing various operations, it can be appreciated that such components or modules may be implemented by one or more hardware components, software components, and/or combination thereof. The functional components and/or modules may be implemented, for example, by logic (e.g., instructions, data, and/or code) to be executed by a logic device (e.g., processor). Such logic may be stored internally or externally to a logic device on one or more types of computer-readable storage media.
Some embodiments may comprise an article of manufacture. An article of manufacture may comprise a storage medium to store logic. Examples of a storage medium may include one or more types of computer-readable storage media capable of storing electronic data, including volatile memory or non-volatile memory, removable or non-removable memory, erasable or non-erasable memory, writeable or re-writeable memory, and so forth. Examples of storage media include hard drives, disk drives, solid state drives, and any other tangible storage media.
It also is to be appreciated that the described embodiments illustrate exemplary implementations, and that the functional components and/or modules may be implemented in various other ways which are consistent with the described embodiments. Furthermore, the operations performed by such components or modules may be combined and/or separated for a given implementation and may be performed by a greater number or fewer number of components or modules.
Some of the figures may include a flow diagram. Although such figures may include a particular logic flow, it can be appreciated that the logic flow merely provides an exemplary implementation of the general functionality. Further, the logic flow does not necessarily have to be executed in the order presented unless otherwise indicated. In addition, the logic flow may be implemented by a hardware element, a software element executed by a processor, or any combination thereof.
Although the foregoing description is directed to the preferred embodiments of the invention, it is noted that other variations and modifications will be apparent to those skilled in the art, and may be made without departing from the spirit or scope of the invention. Moreover, features described in connection with one embodiment of the invention may be used in conjunction with other embodiments, even if not explicitly stated above.
This application claims the benefit of U.S. Provisional Patent Application No. 61/606,025, filed Mar. 2, 2012, and also claims the benefit of U.S. Provisional Patent Application No. 61/606,021, filed Mar. 2, 2012, the contents of each of which are hereby incorporated by reference herein in their entirety.
Number | Name | Date | Kind |
---|---|---|---|
3576983 | Cochran | May 1971 | A |
4652733 | Eng et al. | Mar 1987 | A |
4871903 | Carrell | Oct 1989 | A |
5162992 | Williams | Nov 1992 | A |
5255356 | Michelman et al. | Oct 1993 | A |
5361353 | Carr et al. | Nov 1994 | A |
5396588 | Froessl | Mar 1995 | A |
5560006 | Layden et al. | Sep 1996 | A |
5567865 | Hauf | Oct 1996 | A |
5586252 | Barnard et al. | Dec 1996 | A |
5608904 | Chaudhuri et al. | Mar 1997 | A |
5634054 | Sarachan | May 1997 | A |
5640575 | Maruyama et al. | Jun 1997 | A |
5664109 | Johnson et al. | Sep 1997 | A |
5708825 | Sotomayor | Jan 1998 | A |
5768578 | Kirk et al. | Jun 1998 | A |
5819265 | Ravin et al. | Oct 1998 | A |
5887120 | Wical | Mar 1999 | A |
5930788 | Wical | Jul 1999 | A |
5983214 | Lang et al. | Sep 1999 | A |
6003027 | Prager | Dec 1999 | A |
6009462 | Birrell et al. | Dec 1999 | A |
6052693 | Smith et al. | Apr 2000 | A |
6061678 | Klein et al. | May 2000 | A |
6078924 | Ainsbury et al. | Jun 2000 | A |
6122647 | Horowitz et al. | Sep 2000 | A |
6151604 | Wlaschin et al. | Nov 2000 | A |
6154213 | Rennison | Nov 2000 | A |
6163775 | Wlaschin et al. | Dec 2000 | A |
6236994 | Swartz et al. | May 2001 | B1 |
6332163 | Bowman-Amuah | Dec 2001 | B1 |
6363379 | Jacobson et al. | Mar 2002 | B1 |
6366921 | Hansen et al. | Apr 2002 | B1 |
6449620 | Draper et al. | Sep 2002 | B1 |
6564215 | Hsiao et al. | May 2003 | B1 |
6584470 | Veale | Jun 2003 | B2 |
6629097 | Keith | Sep 2003 | B1 |
6643661 | Polizzi et al. | Nov 2003 | B2 |
6665685 | Bialic | Dec 2003 | B1 |
6681370 | Gounares et al. | Jan 2004 | B2 |
6694007 | Lang et al. | Feb 2004 | B2 |
6694307 | Julien | Feb 2004 | B2 |
6728707 | Wakefield et al. | Apr 2004 | B1 |
6732097 | Wakefield et al. | May 2004 | B1 |
6732098 | Wakefield et al. | May 2004 | B1 |
6735578 | Shetty et al. | May 2004 | B2 |
6738765 | Wakefield et al. | May 2004 | B1 |
6741988 | Wakefield et al. | May 2004 | B1 |
6862585 | Planalp et al. | Mar 2005 | B2 |
6886010 | Kostoff | Apr 2005 | B2 |
6912498 | Stevens et al. | Jun 2005 | B2 |
6970881 | Mohan et al. | Nov 2005 | B1 |
7043535 | Chi et al. | May 2006 | B2 |
7085771 | Chung et al. | Aug 2006 | B2 |
7123974 | Hamilton | Oct 2006 | B1 |
7191183 | Goldstein | Mar 2007 | B1 |
7200606 | Elkan | Apr 2007 | B2 |
7266548 | Weare | Sep 2007 | B2 |
7320000 | Chitrapura et al. | Jan 2008 | B2 |
7353230 | Hamilton et al. | Apr 2008 | B2 |
7379537 | Bushey et al. | May 2008 | B2 |
7523085 | Nigam et al. | Apr 2009 | B2 |
7533038 | Blume et al. | May 2009 | B2 |
7536413 | Mohan | May 2009 | B1 |
7844566 | Wnek | Nov 2010 | B2 |
7849048 | Langseth et al. | Dec 2010 | B2 |
7849049 | Langseth et al. | Dec 2010 | B2 |
8166032 | Sommer et al. | Apr 2012 | B2 |
8266077 | Handley | Sep 2012 | B2 |
8301720 | Thakker et al. | Oct 2012 | B1 |
8335787 | Shein et al. | Dec 2012 | B2 |
8346534 | Csomai et al. | Jan 2013 | B2 |
20010018686 | Nakano et al. | Aug 2001 | A1 |
20010025353 | Jakel | Sep 2001 | A1 |
20010032234 | Summers et al. | Oct 2001 | A1 |
20020065857 | Michalewicz et al. | May 2002 | A1 |
20020078068 | Krishnaprasad et al. | Jun 2002 | A1 |
20020111951 | Zeng | Aug 2002 | A1 |
20020128998 | Kil et al. | Sep 2002 | A1 |
20020129011 | Julien | Sep 2002 | A1 |
20020143875 | Ratcliff | Oct 2002 | A1 |
20020152208 | Bloedorn | Oct 2002 | A1 |
20020156771 | Frieder et al. | Oct 2002 | A1 |
20020161626 | Plante et al. | Oct 2002 | A1 |
20020168664 | Murray et al. | Nov 2002 | A1 |
20020194379 | Bennett et al. | Dec 2002 | A1 |
20030014406 | Faieta et al. | Jan 2003 | A1 |
20030016943 | Chung et al. | Jan 2003 | A1 |
20030033275 | Alpha et al. | Feb 2003 | A1 |
20030033295 | Adler | Feb 2003 | A1 |
20030078766 | Appelt et al. | Apr 2003 | A1 |
20030083908 | Steinmann | May 2003 | A1 |
20030088562 | Dillon et al. | May 2003 | A1 |
20030101052 | Chen et al. | May 2003 | A1 |
20030110058 | Fagan et al. | Jun 2003 | A1 |
20030120133 | Rao et al. | Jun 2003 | A1 |
20030125988 | Rao et al. | Jul 2003 | A1 |
20030130894 | Huettner et al. | Jul 2003 | A1 |
20030144892 | Cowan et al. | Jul 2003 | A1 |
20030149586 | Chen et al. | Aug 2003 | A1 |
20030149730 | Kumar et al. | Aug 2003 | A1 |
20030158865 | Renkes et al. | Aug 2003 | A1 |
20030176976 | Gardner | Sep 2003 | A1 |
20030177112 | Gardner | Sep 2003 | A1 |
20030177143 | Gardner | Sep 2003 | A1 |
20030188009 | Agarwalla et al. | Oct 2003 | A1 |
20030204494 | Agrawal et al. | Oct 2003 | A1 |
20030206201 | Ly | Nov 2003 | A1 |
20030225749 | Cox et al. | Dec 2003 | A1 |
20040010491 | Riedinger | Jan 2004 | A1 |
20040044659 | Judd et al. | Mar 2004 | A1 |
20040049473 | Gower et al. | Mar 2004 | A1 |
20040049505 | Pennock | Mar 2004 | A1 |
20040103116 | Palanisamy et al. | May 2004 | A1 |
20040111302 | Falk et al. | Jun 2004 | A1 |
20040122826 | Mackie | Jun 2004 | A1 |
20040167870 | Wakefield et al. | Aug 2004 | A1 |
20040167883 | Wakefield et al. | Aug 2004 | A1 |
20040167884 | Wakefield et al. | Aug 2004 | A1 |
20040167885 | Wakefield et al. | Aug 2004 | A1 |
20040167886 | Wakefield et al. | Aug 2004 | A1 |
20040167887 | Wakefield et al. | Aug 2004 | A1 |
20040167888 | Kayahara et al. | Aug 2004 | A1 |
20040167907 | Wakefield et al. | Aug 2004 | A1 |
20040167908 | Wakefield et al. | Aug 2004 | A1 |
20040167909 | Wakefield et al. | Aug 2004 | A1 |
20040167910 | Wakefield et al. | Aug 2004 | A1 |
20040167911 | Wakefield et al. | Aug 2004 | A1 |
20040172297 | Rao et al. | Sep 2004 | A1 |
20040186826 | Choi et al. | Sep 2004 | A1 |
20040194009 | LaComb et al. | Sep 2004 | A1 |
20040215634 | Wakefield et al. | Oct 2004 | A1 |
20040225653 | Nelken et al. | Nov 2004 | A1 |
20040243554 | Broder et al. | Dec 2004 | A1 |
20040243560 | Broder et al. | Dec 2004 | A1 |
20040243645 | Broder et al. | Dec 2004 | A1 |
20050004909 | Stevenson et al. | Jan 2005 | A1 |
20050010454 | Falk et al. | Jan 2005 | A1 |
20050015381 | Clifford et al. | Jan 2005 | A1 |
20050038805 | Maren et al. | Feb 2005 | A1 |
20050049497 | Krishnan et al. | Mar 2005 | A1 |
20050050037 | Frieder et al. | Mar 2005 | A1 |
20050055355 | Murthy et al. | Mar 2005 | A1 |
20050059876 | Krishnan et al. | Mar 2005 | A1 |
20050065807 | DeAngelis et al. | Mar 2005 | A1 |
20050065941 | DeAngelis et al. | Mar 2005 | A1 |
20050065967 | Schuetze et al. | Mar 2005 | A1 |
20050086215 | Perisic | Apr 2005 | A1 |
20050086222 | Wang et al. | Apr 2005 | A1 |
20050091285 | Krishnan et al. | Apr 2005 | A1 |
20050108256 | Wakefield et al. | May 2005 | A1 |
20050108267 | Gibson et al. | May 2005 | A1 |
20050165712 | Araki et al. | Jul 2005 | A1 |
20050240984 | Farr et al. | Oct 2005 | A1 |
20050243604 | Harken et al. | Nov 2005 | A1 |
20050267872 | Galai | Dec 2005 | A1 |
20050278362 | Maren | Dec 2005 | A1 |
20060242190 | Wnek | Oct 2006 | A1 |
20060253495 | Png | Nov 2006 | A1 |
20070011134 | Langseth et al. | Jan 2007 | A1 |
20070011175 | Langseth et al. | Jan 2007 | A1 |
20070282824 | Ellingsworth | Dec 2007 | A1 |
20080010274 | Carus | Jan 2008 | A1 |
20080071872 | Gross | Mar 2008 | A1 |
20080133488 | Bandaru | Jun 2008 | A1 |
20080249764 | Huang et al. | Oct 2008 | A1 |
20090037457 | Musgrove | Feb 2009 | A1 |
20090063481 | Faus | Mar 2009 | A1 |
20090292583 | Eilam et al. | Nov 2009 | A1 |
20090307213 | Deng et al. | Dec 2009 | A1 |
20090319342 | Shilman et al. | Dec 2009 | A1 |
20100017487 | Patinkin | Jan 2010 | A1 |
20100049590 | Anshul | Feb 2010 | A1 |
20100114561 | Yasin | May 2010 | A1 |
20100121849 | Goeldi | May 2010 | A1 |
20100153318 | Branavan et al. | Jun 2010 | A1 |
20100223276 | Al-Shameri | Sep 2010 | A1 |
20100228693 | Dawson | Sep 2010 | A1 |
20100235313 | Rea et al. | Sep 2010 | A1 |
20100235854 | Badgett | Sep 2010 | A1 |
20100241620 | Manister et al. | Sep 2010 | A1 |
20100312769 | Bailey | Dec 2010 | A1 |
20100332287 | Gates et al. | Dec 2010 | A1 |
20110106807 | Srihari et al. | May 2011 | A1 |
20110137707 | Winfield et al. | Jun 2011 | A1 |
20110161333 | Langseth et al. | Jun 2011 | A1 |
20110231394 | Wang et al. | Sep 2011 | A1 |
20110276513 | Erhart et al. | Nov 2011 | A1 |
20110282858 | Karidi et al. | Nov 2011 | A1 |
20110302006 | Avner et al. | Dec 2011 | A1 |
20120259617 | Indukuri et al. | Oct 2012 | A1 |
20120265806 | Blanchflower | Oct 2012 | A1 |
20120290622 | Kumar et al. | Nov 2012 | A1 |
20130024389 | Gupta | Jan 2013 | A1 |
20130096909 | Brun et al. | Apr 2013 | A1 |
Number | Date | Country |
---|---|---|
10337934 | Apr 2004 | DE |
1083491 | Mar 2001 | EP |
1021249 | Jan 1989 | JP |
2004258912 | Sep 2004 | JP |
WO-9630845 | Oct 1996 | WO |
WO-9835469 | Aug 1998 | WO |
WO-0026795 | May 2000 | WO |
WO-02082318 | Oct 2002 | WO |
WO-02095616 | Nov 2002 | WO |
WO-0304892 | Jan 2003 | WO |
WO-03040878 | May 2003 | WO |
WO-03098466 | Nov 2003 | WO |
WO-2004104865 | Dec 2004 | WO |
WO-2007005730 | Jan 2007 | WO |
WO-2007005732 | Jan 2007 | WO |
WO-2007021386 | Feb 2007 | WO |
Entry |
---|
Jain, Anil K., Robert P. W. Duin, and Jianchang Mao. “Statistical pattern recognition: A review.” IEEE Transactions on pattern analysis and machine intelligence 22.1 (2000): 4-37. |
Pak, Alexander & Paroubek, Patrick. (2010). Twitter as a Corpus for Sentiment Analysis and Opinion Mining. Proceedings of LREC. 10. |
Jain, A. K., Duin, R. P. W., & Mao, J. (2000). Statistical pattern recognition: A review. IEEE Transactions on pattern analysis and machine intelligence, 22(1), 4-37. (Year: 2000). |
Pak, A., & Paroubek, P. (May 2010). Twitter as a corpus for sentiment analysis and opinion mining. In LREc (vol. 10, No. 2010, pp. 1320-1326). (Year: 2010). |
“Adding Structure to the Unstructured-Computer Business Review;”, http://www.cbr-online.com, World Wide Web, May 25, 2005, pp. 1116. |
“EIQ Server”, http://www.whamtech.com/eiq_server.htm>, World Wide Web, May 25, 2005, pp. 1-3. |
“GEDDM—Grid Enabled Distributed Data Mining”, http://www.qub.ac.uk/escience/projects/geddm/geddm handout.pdf, World Wide Web, pp. 1-2. |
“Innovative Applications;”, http://www2002.oreispector.pdg, World Wide Web, pp. 1-5. |
Agrawal et al., “Athena: Mining-Based Interactive Management of Text Databases,” 2000, EDBT, LNCS 1777, 365-379. |
Alani et al., “Automatic Ontology-Based Knowledge Extraction from Web Documents,” IEEE Computer Society, 2003. |
Bourret, “Persistence: SGML and XML in Databases”, http://www.isemlug.org/database.html, World Wide Web, 2002, pp. 1-5. |
Buttler et al., “Rapid Exploitation and Analysis of Documents,” Lawrence Livermore National Laboratory, Dec. 2011, pp. 1-40. |
Clark, “XSL Transformations (XSLT) Version 1.0,” W3C, 1999. |
Crosman, “Content Pipeline”, http://messagingpipeline.com/shared/article/printableArticleSrc.jhtml?articleId=51201811>, World Wide Web, Nov. 1, 2004, pp. 1-8. |
Darrow, “IBM Looks to Viper Database to Combat Oracle, Microsoft”, http://www.bizintellignecepipeline.com/shared/article/printable.ArticleSrc.Jhtml, World Wide Web, May 25, 2005, pp. 1-5. |
Das et al., “Opinion Summarization in Bengali: A Theme Network Model,” Retrieved From: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5591520&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D5591520, Aug. 20-22, 2010, pp. 675-682. |
Das et al., “Theme Detection an Exploration of Opinion Subjectivity,” IEEE, 2009, pp. 1-8. |
Embley et al., “Ontology-Based Extraction and Structuring of Information form Data-Rich Unstructured Documents,” In Proceedings of the Conference on Information and Knowledge Management (CIKM'98), 1998. |
Extended European Search Report issued in Application No. 06774414.4 dated Dec. 27, 2010. |
Extended European Search Report issued in Application No. 06774415.4 dated Dec. 27, 2010. |
Ferrucci, “Building an Example Application With the Unstructured Information Management Architecture”, http://www.findarticles.com/p/articles/mi mISJ/is 3 43/ain7576557/print, World Wide Web, Mar. 16, 2004, pp. 1-22. |
Gamon et al., “Pulse: Mining Customer Opinions from Free Text,” Springer-Verlag Berline Heidelberg, 2005. |
Ghanem et. al., “Dynamic Information Integration for E-Science”, http://www.discovery-on-the.net/documents/DynamicInformationintegration.pdf>, World Wide Web, pp. 1-2. |
Gold-Bernstein, “EbizQ Integration Conference”, http://www.ebizq.net/topics/portals/features/4371.html?page=2&pp=1>, World Wide Web, May 10, 2005, pp. 1-5. |
Harabagiu et al., “Using Topic Themse for Multi-Document Summarization,” ACM Transactions on Information Systems, vol. 28, No. 3, Jun. 2010, pp. 1-47. |
Hearst, Marti A., “Untangling Text Data Mining,” Proceeding of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, 1999. |
Hu et al., “Mining Opinion Features in Customer Reviews,” American Association for Articial Intelligence, 2004 pp. 1-6. |
Infoconomy Staff, “Enterprise Search Tools”, http://www.infoconomy.com/pages/infoconomist-crib-sheets/group101866.adp, World Wide Web, Dec. 1, 2004, pp. 1-5. |
International Preliminary Report on Patentability issued in Application No. PCT/US2006/025810 dated Jan. 10, 2008. |
International Preliminary Report on Patentability issued in Application No. PCT/US2006/025811 dated Jan. 9, 2008. |
International Search Report issued in Application No. PCT/US2006/025810 dated Jul. 27, 2007. |
International Search Report issued in Application No. PCT/US2006/025811 dated Feb. 16, 2007. |
International Search Report issued in Application No. PCT/US2006/025814 dated Jan. 3, 2007. |
Kelly et al., “Roadmap to checking data migration,” 2003, Elsevier, 506-510. |
Keogh et al., On the Need for Time Series Data Mining Benchmarks: A survey and Empirical Demonstration, 2003, Data Mining and Knowledge Discovery, 7:349-371. |
Kim et al., “Automatic Identification of Pro and Con Reasons in Online Reviews,” ACL 2006, pp. 1-8. |
Kim et al., “Determining the Sentiment of Opinions,” ACM, 2004. |
Kugel, “Transform Magazine: Unstructured Information Management”, http://www.transformmag.com/shard/cp/print articlejthm hsessionid. >, World Wide Web, Dec. 2003, pp. 1-3. |
Lau et al., “Automatic Labelling of Topic Models,” Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Jun. 19-24, 2011, pp. 1536-1545. |
Ma et al., “Extracting Unstructured Data from Template Generated Web Documents,” 2003, CIKM, 512-515. |
Moschitti et al., “Open Domain Information Extraction via Automatic Semantic Labeling,” American Association for Artificial Intelligence, 2003. |
Nagao et al., “Automatic Text Summarization Based on the Global Document Annotation,” International Conference on Computational Linguistics Proceedings, 1998, vol. 2, p. 917-921. |
Notice of Allowance issued in U.S. Appl. No. 11/172,955 dated Aug. 16, 2010, 20 pages. |
Notice of Allowance issued in U.S. Appl. No. 11/172,956 dated Aug. 18, 2010, 19 pages. |
Office Action issued in U.S. Appl. No. 11/172,955 dated Apr. 29, 2009, 38 pages. |
Office Action issued in U.S. Appl. No. 11/172,955 dated Jan. 5, 2010, 37 pages. |
Office Action issued in U.S. Appl. No. 11/172,955 dated Jul. 25, 2007, 39 pages. |
Office Action issued in U.S. Appl. No. 11/172,955 dated May 30, 2008, 43 pages. |
Office Action issued in U.S. Appl. No. 11/172,956 dated Apr. 28, 2009, 28 pages. |
Office Action issued in U.S. Appl. No. 11/172,956 dated Aug. 8, 2007; 26 pages. |
Office Action issued in U.S. Appl. No. 11/172,956 dated Jan. 7, 2010, 34 pages. |
Office Action issued in U.S. Appl. No. 11/172,956 dated May 30, 2008, 27 pages. |
Office Action issued in U.S. Appl. No. 11/172,957 dated Aug. 8, 2007, 35 pages. |
Office Action issued in U.S. Appl. No. 12/959,805 dated Mar. 31, 2011, 26 pages. |
Examiner Interview Summary issued in U.S. Appl. No. 11/172,955, dated Feb. 1, 2008, 2 pages. |
Examiner Interview Summary issued in U.S. Appl. No. 11/172,955, dated Oct. 22, 2009, 4 pages. |
Examiner Interview Summary issued in U.S. Appl. No. 11/172,956, dated Jan. 28, 2008, 3 pages. |
Pang et al., “Thumbs Up? Sentiment Classification using Machine Learning Techniques,” Proceedings of the Conference of Empirical Methods in Natural Language Processing (EMNLP), 2002, pp. 79-86. |
Park et al., “Co-trained support vector machines for large scale unstructured document classification using unlabeled data and syntactic information,” Information Processing and Management, 2004. |
Patro et al., “Seamless Integration of Diverse Types into Exploratory Visualization Systems,” Eurographics, 2003. |
Pradhan et al., “Semantic Role Parsing: Adding Semantic Structure to Unstructured Text,” Proceedings of the Third IEEE International Conference on Data Mining (ICDM.03), 2003. |
Russom, Philip, “How to Evaluate Enterprise ETL,” Tech Choices, 2004. |
Spector, “Architecting Knowledge Middleware;” http://itlab.uta.edu/idm01/FinalReports/Innovation.pdf, World Wide Web, 2002, pp. 1-40. |
Swoyer, “IBM's BI Middleware Play: Its All About Integration, Partnerships”, http://www.tdwi.org/Publications/display.aspx?id+7267&t=y, World Wide Web, Nov. 3, 2004, pp. 1-3. |
The Unstructured Information Management Architecture Project, http://www.research.ibm.com/UIMA/>, World Wide Web, May 25, 2005, pp. 1-2. |
Tkach, Daniel S., “Information Mining with the IBM Intelligent Miner Family,” IBM, 1998. |
Vesset et al., “White Paper: Why Consider Oracle for Business Intelligence?” IDC, 2004. |
Wang et. al., “Database Research at Watson”, http://www.research.ibm.com/scalabledb/semistruct.html, World Wide Web, May 25, 2005, pp. 1-3. |
Written Opinion of the International Searching Authority issued in Application No. PCT/US2006/025810 dated Jul. 27, 2007. |
Written Opinion of the International Searching Authority issued in Application No. PCT/US2006/025811 dated Feb. 16, 2007. |
Yang et al., “Automatic Category Theme Identification and Hierarchy Generation for Chinese Text Categorization,” Kluwer Academic Publishers, 2000, pp. 1-26. |
Zheng, “Tradeoffs in Certificate Revocation Schemes,” Apr. 2003, ACM SIGCOMM Computer Communications Review, 33:2:103-112. |
Zornes, “EA Community Articles”, http://www.eacommunity.com/articles/openarticle.asp?ID=1834>, World Wide Web, May 25, 2005, pp. Jan. 3, 2013. |
Number | Date | Country | |
---|---|---|---|
20130268534 A1 | Oct 2013 | US |
Number | Date | Country | |
---|---|---|---|
61606025 | Mar 2012 | US | |
61606021 | Mar 2012 | US |