This invention pertains generally to a system, method, and computer program product for information analysis, discovery, classification and retrieval; and more particularly to a system, method, and computer program product for classifying, gathering, classifying, categorizing, and analyzing unstructured information and for tracking trends and exceptions in unstructured information objects.
Structured data or objects generally refers to data existing in an organized form, such as a relational database, that can be accessed and analyzed by conventional techniques (i.e. Standard Query Language, SQL). By contrast, so-called unstructured data or objects refer to objects in a textual format (i.e. faxes, e-mails, documents, voice converted to text, or a collection of these formats) that do not necessarily share a common organization. Unstructured information often remains hidden and un-leveraged by an organization primarily because it is hard to access the right information at the right time or to integrate, analyze, or compare multiple items of information as a result of their unstructured nature. There exists a need for a system and method to provide structure for unstructured information such that the unstructured objects can be accessed with powerful conventional tools (such as, for example, SQL, or other information query and/or analysis tools) and analyzed for hidden trends and patterns across a corpus of unstructured objects.
Conventional systems and methods for accessing unstructured objects have focused on tactical searches, that seek to match keywords, an approach that has several shortcomings. For example, as illustrated in
One conventional method of addressing this problem allows a user to enter detailed searches utilizing phrases or Boolean logic, but successful detailed tactical searches can be extremely difficult to formulate. The user must be sophisticated enough to express their search criteria in terms of Boolean logic. Furthermore, the user needs to know precisely what he or she is searching for, in the exact language that they expect to find it. Thus, there is a need for a search mechanism to more easily locate documents or other objects of interest, preferably searching with the user's own vocabulary. Further, such mechanism should desirably enable automatically searching related words and phrases, without knowledge of advanced searching techniques.
In another conventional method, the search is done based on meaning, where each of the words or phrases typed is semantically analyzed, as if second guessing the user (for example, Use of the term Juvenile picks up teenager). This increases the result set though, making analysis of search results even more important. Also this technique is inadequate and quite inaccurate when the user is looking for a concept like “definition of terrorism” or “definition of knowledge management”, where the “concept” of the phrase is more important than the meaning of the individual words in the search term.
Even when tactical searches succeed in searching or finding information, the problem of analyzing unstructured information still remains. Analyzing unstructured information goes beyond the ability to locate information of interest. Analysis of unstructured information would allow a user to identify trends in unstructured objects as well as quickly identify the meaning of an unstructured object, without first having to read or review the entire document. Thus, there further exists a need to provide a system and methodology for analyzing unstructured information. In one situation, this need extends to system and method for tracking and optionally reporting the changing presence of words or phrases in a set of documents over time.
Prior art classification systems exist that can organize unstructured objects in a hierarchical manner. However, utilizing these classification systems to locate an object of interest requires knowing what the high-level of interest would be, and following one path of inquiry often precludes looking at other options. Thus, there is also a need for a system and method that can recognize relevant relationships between words and concepts, and can categorize an object under more than one high-level interest. Such a system and method should desirably scan objects for words or phrases and determine the presence of certain patterns that suggest the meaning, or theme, of a document, allowing for more accurate classification and retrieval.
Some prior art technologies store data and information utilizing proprietary methods and/or data structures, which prevents widespread or open access or analysis by keeping objects in a native non-standard proprietary format. Thus, there is a need to store information about unstructured objects in an open architecture and preferably in a readily accessible standard storage format, one embodiment being a relational database of which many types are known. Storage in a relational database keeps the information readily available for analysis by common tools. Where access protection is desired various known security measures may be employed as are known in the art.
The focus of many currently available prior art technologies has been to help find the one document or a known piece of information of interest from a huge source of unstructured data. However, there remains a need to analyze this untapped and largely ignored collection of unstructured data to find new trends, patterns and relationships to create actionable intelligence. This need requires the ability to look at the unstructured data in many possible ways and view points, drill down and look at details of various levels of summarizations and enable visualization of unstructured data in many ways. In short, there remains a need for a theme or concept-based method and system to analyze, categorize and query unstructured information. The present invention provides such a high precision system and method.
The present invention provides a system, method and computer program and computer program product for categorizing and analyzing unstructured information. The present invention includes a analysis and categorization engine that scans available unstructured objects. The analysis and categorization engine generates structured information in the form of relational database tables, and can accept user-specific input to personalize this process further. Once these relational database data structures have been generated, conventional techniques (such as SQL) can therefore be utilized on the structured information to access the unstructured objects.
The analysis and categorization engine preferably builds a set of categories into which it will classify the unstructured objects. By scanning the categories or further training, the analysis and categorization engine captures a list of relevant concepts, where preferably each relevant concept comprises at least one word. Utilizing language relationships, thesaurus, other industry/language thesaurus and/or dictionary-lookup, the analysis and categorization engine expands the concepts into concept groupings. Each concept grouping preferably comprises at least one word and is named by a representative seed concept of at least one word. The concept groupings may be further augmented by user input and modification, allowing the analysis and categorization engine to capture language relationships and usage unique to individual users.
The analysis and categorization engine can bubble up or otherwise identify ideas and concepts embedded in a given set of unstructured data objects and present them in a structured or organized form, such as for example like a “table of contents for a magazine”. One difference being that in this case, the table of contents provides a dynamically organized collection of concepts embedded in the objects. The collection can be dynamically sorted in multiple ways for the user to access the right set of concepts and view their distribution in the targeted objects.
The analysis and categorization engine receives and filters unstructured objects, and indexes objects utilizing the concept groupings and a variation of the term frequency-inverse document frequency (Tf-Idf) technique. Indexing results in a representation of the object as a selection of weighted concepts. The analysis and categorization engine preferably generates a Gaussian distribution curve for the object to assign probabilities to concepts within the object. Concepts having probabilities within a certain range are selected as key concepts to represent the theme, or meaning, of an object. By setting the range, it possible to dramatically increase precision and recall for objects classification. The analysis and categorization engine utilizes the key concepts and their probabilities to determine an object's score for each category, and associates an object with every category achieving a specified score.
Output generated by the analysis and categorization engine such as concept groupings, object scores, and the users to whom they pertain may be stored in an open architecture format, such as a relational database table. Such storage enables conventional analysis techniques to be employed over unstructured data.
Aspects of the invention also provide an object concept based search engine. The search engine accepts search text, analyzes the text for concepts and retrieves objects represented by those concepts. User preferences are learned by the search engine through passing previously unknown concepts extracted from the query text to the analysis and categorization engine. The analysis and categorization engine incorporates the new concepts into the concept groupings and updates its object scoring based on the new concept groupings.
A novel graphical user interface is also optionally but advantageously provided to assist the user in viewing, organizing, and analyzing unstructured objects, and performing the object concept search and subsequent analysis. The structured information generated by the analysis and categorization engine facilitates integrated views of unstructured objects by concept as well as analysis—for example, capturing trends over time.
Other features and advantages of the invention will appear from the following description in which the preferred embodiments have been set forth in detail, in conjunction with the accompanying drawings.
Exemplary embodiments are described with reference to specific structural and methodological embodiments and configurations. Those workers having ordinary skill in the art in light of the description provided here will appreciate that various changes and modifications can be made while remaining within the scope of the claims. For example, the categorization process is presented in a preferred order utilizing preferred (Gaussian) statistics; however, ordering the steps differently or utilizing a different statistical methodology could achieve the same or analogous end. Examples of relational database tables are given, but those skilled in the art will appreciate that these tables could be structured differently and remain within the scope of the claims. Other variations, changes, and/or modifications may be made without departing from the scope of the invention.
The inventive system, method, data structure, and computer program software and computer program software product have particular applicability to information and intelligence gathering and analysis. Such information and intelligence identification, gathering, and analysis may be applied in economic, financial, technological, sociological, informatics, educational and learning, and security contexts, as well as in many other disciplines.
With reference to
An embodiment of the analysis and categorization engine 200 operates as outlined in
User input 300 and/or training objects 310 are utilized by the analysis and categorization engine to build (step 320) categories. The analysis and categorization engine 200 uses the built categories to capture concepts (step 330). A ‘concept’ as used herein denotes a word or phrase. With further user input 300 and a dictionary or thesaurus look-up (step 340), the analysis and categorization engine generates concept groupings (step 360). A ‘concept grouping’ as used herein denotes a group of concepts related in one or more predefined ways—such as synonyms or meaning words and phrases discovered in a dictionary look-up or set up by the user using a concept customization interface. Each concept grouping is headed, or named, by one concept—referred to herein as a seed concept.
The analysis and categorization engine 200 accepts an unstructured object as input (step 370), filters the object (step 380) and utilizes the concept groupings to index the object (step 390). Indexing, as generally known in information retrieval, refers to representing an object as a function of the parameters that will be utilized to search, analyze, or retrieve the object. In a preferred embodiment of the present invention, the indexing step 390 comprises generating a vector representation of the object, having a number of dimensions where each dimension has a weight. Each dimension corresponds to a seed concept, and the weight given to each seed concept depends in part on the frequency of occurrence of that concept within the object.
The index is utilized by the analysis and categorization engine 200 to generate a Gaussian distribution (step 400) of weights for each object and select a set of concepts to represent each object (step 410), herein referred to as key concepts. The objects are scored (step 420) and assigned to categories (step 430). Recall as described relative to
Embodiments of the present invention illustrating a more detailed description of the steps outlined in
With reference to
Advantageously, each object is available for access using the Universal Naming Convention (UNC) or via some other procedure for providing a unique (globally or locally unique) identifier or ID. The UNC is a way to identify a shared file in a computer without having to specify (or know) the storage device it is on. In the Microsoft Windows operating system, the naming format is \\servername\sharename\path\filename. Analogous naming formats are known for other operating systems. Each unstructured object is stored on one or more computer storage media accessible to the analysis and categorization engine through the UNC. A pointer 30 to the object's physical storage location is generated, for example, by the engine as an integer between −2,147,483,648 to 2,147,483,647. Other methods of generating a physical pointer may be utilized. The pointer 30 is advantageous in that an object can be viewed or analyzed by more than one user without the need to physically copy the object and consume additional space on the computer storage media. Object properties may also be stored in a relational database table. Object properties may include, for example, a string of text representing an object description 34 such as a name or file type, an object created date 36 comprising a numeric string representing the day, month, and year the object was created, and an object modified date 38 comprising a numeric string representing the day, month, and year the object was last modified. A variety of object properties could be stored utilizing a variety of storing methodologies or naming protocols.
In one exemplary object relational database table, shown here as Table 1, the global object IDs 30 and object properties, such as object description 34, object created date 36, object modified date 38, and the object size 40 in bytes [are stored as structured information 230 in an open architecture format, a relational database table. Other object properties, attributes, and the like may also be stored in the object relational database table and tracked,
As illustrated in the embodiment of
In a first exemplary embodiment, (1) a user 300 inputs both category name 313a and description 313b. In this case, the user provides the category name or other category identification and a description of the category, where these are desirably provided in natural language. A natural language description is, generally, a descriptive sentence or paragraph further refining what the category name is meant to signify for the user. One illustrative example is:
In a second exemplary embodiment, (2) user 300 inputs category name 313a and the analysis and categorization engine 200 generates the corresponding category description 313b. In this case, the user provides the name of the category and a number of training objects 310 forming or belonging to the category. The analysis and categorization engine 200 scans the training objects 310 to generate a set of descriptive words and/or phrases to use as the category description 313b. One illustrative example is:
The manner in which the analysis and categorization engine generates the category description from the uploaded or otherwise identified documents or information items are described in further detail hereinafter.
As described in greater detail elsewhere in this specification, once the object has been converted into a relevant reduced dimensionality vector, where the primary dimensions of the vector space are seed concepts occurring in that document, the analysis and categorization engine 200 selects a set of these dimensions, or seed concepts, that are or correspond to key concepts that are most representative of the object (
After step 410 (See
From these two ratios, the composite ratio of key concepts under a given category may be determined. This composite ratio R3 is R1*R2. If this composite ratio R3 falls within the high-bound 29 or low-bound 27 criteria, then this key concept becomes a concept defining the category as well. It should be noted that this training can occur at any time based on user input and can be controlled by the user through an interface.
In a third exemplary embodiment, (3) the analysis and categorization engine 200 creates both category name and description. The user 300 provides training objects 310 pertaining to Golf, such as, for example, U.S. open, British open, Australian open and TPC tour. The system, specifically the analysis and categorization engine 200, generates both the category name 313a and the category description 313b. In the example, the system generates category name 313a and category description 313b as follows.
The category building procedure 320 for generating the category name and category description from the uploaded objects is described in greater detail hereinafter. It is noted that the examples are illustrative only, and that a variety of methodologies could be utilized to build categories for use in sorting or analyzing objects. For example, a category may simply consist of a list of words or phrases, it need not have a ‘name’ or ‘description’ as utilized in the example.
The generated category name will generally be a subset of category description. The creation of category description was described in the previous section. We choose the top Nk (for example choose Nk=5, but any other selected number may be chosen) highest key concepts from the category description as the category name and the selection of concepts for the name and description. Creating a category name is based on number of objects for object name and description creation. Generally, the more the number of objects in the training set, the better the generated concept name and description. The user can group a set of objects and instruct the analysis and classification engine to create category description and category name.
With further reference to the embodiment in
It is noted that in one embodiment, the analysis and categorization engine captures or identifies concepts from category names and descriptions during classification, but in one embodiment, the relationships between different words and phrases are created during the thesaurus look-up and are continuously maintained and refined by user interaction.
A seed concept is a concept that will serve as a basis for a concept grouping and is a sub-type of concept. As described, this is either generated by the system when words get extracted (refer word extraction step) or when user provides category name and description. Thus the seed concept id is assigned from the same pool of concept identifiers. Three examples of capturing or generating seed concepts are given below.
In one embodiment, the analysis and categorization engine 200 accepts a set of training objects 450 that define a category. The engine extracts seed concepts based on Category description. In this case, the category description is parsed to get individual words by removing the stop and noise words. The resulting set of words become seed concepts.
In another embodiment, the analysis and categorization engine 200 scans all available documents (such as those stored in a defined directory or a list) and extracts a list of the most frequent keywords and their related words. The analysis and categorization engine 200 utilizes categories 312 and training objects 450 to extract a list of concepts 460.
Seed concepts 480 are refined by a dictionary and thesaurus look-up 470, or according to any other procedure for generating seed concepts. The thesaurus can be augmented by use of additional thesaurus as well. For example, in addition to the English thesaurus, for legal industry we can include a legal thesaurus that will be first accessed for the look-up. This word extraction or generation procedure may, for example, utilize semantic analysis rules or policies and take into account word or phrase occurrence frequency, synonymy, and/or polysemy, grammatical part of speech as well as other optional attributes and/or rules. In some instances, the rules may vary depending upon the number and size of documents or other information items available. An electronic dictionary and thesaurus 470 in the form of a database stored in a memory or storage device are used to generate additional words and/or phrases. Based on the set of extracted words, seed concepts are generated.
The procedure for extraction uses a variation of Latent Semantic Indexing, a well known information retrieval technique. The idea is to extract best possible words out of every document and build a superset of words or phrases and their relationships that would then be able to describe the object collection. The first step in this process is extracting most frequently occurring words from every document. Documents can be sampled in arithmetic or geometric progression and the sampling selection can be based on several criteria such as time, size, author, and the like. The type and frequency of sampling can be modified by the user. The number of words to be extracted from a document is limited by a constant that can be set by the user. Also in order for smaller documents to contribute at the same proportion as the bigger documents, the word extraction process has to be normalized. According to one embodiment, the steps for extracting words from an individual object is as follows:
An assumption is made that every kilobyte of text has approximately W words (in one implementation, W is set to be 150 but a different number may be selected). Then the number of words (nw) that can be extracted from a document is given by the formula nw=Ds/W where Ds is the document size. The user can control the upper limit of nw by using upper limits. In the first step, the system and method will extract up to nw*10 frequently occurring words from the document. In the next step, for every word extracted, part of speech will be determined based on grammar look-up. In one embodiment, a proper noun will be given the highest weightage W(wordi), a verb will be given lowest weightage, and a polysemy word will be given medium weightage. (Other weightage rules or policies may alternatively be implemented.) Now the overall weightage by word for every selected word is W(wordi)*f(wordi) where f(wordi) is the number of occurrences of the words. Now choose nw in the descending order of W(wordi)*f(wordi). If word collection nw from object O1 is called nwo1 then the superset, {nwo1, nwo2, . . . nwom} becomes a collection of seed concepts for m objects where {O1 . . . Om)} is a collection of individual objects.
In yet another embodiment, a list of words and phrases is generated from a user provided description for a category. For at least some applications, this is a preferred way of generating seed concepts as user-specific information is directly input to the system and algorithm or method. The user can input one or more phrases each within double quotes (or other identifiers) and the engine will capture and store each of them as a multi-word concept. In one embodiment, multi-word concepts are given as much weight or weightage as a proper noun for part-of-speech.
Once seed concepts 480 have been generated (see
Concept groupings 600 are advantageously stored in a seed relationship relational database table as exemplified by Table 3. Since concept groupings are generally user-specific, the user ID 56 is stored along with a global seed concept ID 42, a related concept id 50, and the type of relationship 52. A status flag or indicator 54 also may be stored, allowing the user to activate or deactivate specific relationships. Providing this relational database table advantageously allows the system to utilize these concept groupings for multiple users while maintaining the ability of individual users to modify and customize the groupings.
It should be noted that the seed concepts themselves can be interrelated. For example, there may be two seed concepts “bug” and “insect” and they have the same meaning. The engine scans the database looking for relationships among individual seed concepts. This is done by taking an individual seed concept and looking for the existence of related concepts in Table 2. The relationship is established again using thesaurus look-up. For example, in this case, bug has meaning of insect and when insect appears in Table 2, a concept grouping entry will be created by linking “bug” at level 1 with “insect” in Table 3. Thus concepts having similar meanings, synonyms, inflections and related words would be linked.
In the embodiment illustrated in
The indexing procedure 390 is described further in
The total number of occurrences of a concept within an object or some measure or metric derived from such total is stored in a cross-reference relational database table exemplified by Table 4 below. This table preferably includes the global object ID 56 (as indexing is desirably independent of user), the concept ID 50, number of hits 58, and location of the concept 60 within the object. Additionally, an index start time 62 and cross-reference time 64 are included to keep a block of cross-references for an object together and to enable later search capabilities. Advantageously, a cross-reference entry is made for each concept.
The Term-Frequency Inverse Document Frequency or Tf-Idf technique is well-known in the art, and is a technique which represents an object as a vector of weighted terms. TF denotes term-frequency and IDF denotes inverse-document-frequency. Terms that appear frequently in one document, but rarely in other documents are more likely to be relevant to the topic of the document. Therefore, the TF-IDF weight of a term in one document is the product of its term-frequency (TF) and the inverse of its document frequency (IDF). In addition the weighted term vectors are used and are normalized to unit length to prevent lengthier documents from having a better chance of retrieval due only or primarily to their length. A standard information retrieval weighting mechanism is:
w=Hc*Tf*idfk
where w is a weight of a word or phrase in a document, Hc is a header constant, Tf is a frequency of the word or phrase in the current document and idfk is defined as:
idfk=log (N/dfk)
where N is the total number of documents already retrieved by the system, and dfk is the document frequency of any given term, for example, the k-th term. The header constant is utilized in the present invention differently from its standard usage in that the invention system and method use the term to reflect the position of the concept in the object and its part of speech.
In addition, the inventive system and method differs from the standard Tf-Idf technique in that it looks beyond synonyms, related words, and definition words by using the concept groupings that have already been built and which are described in greater detail elsewhere in this description. The concept groupings advantageously have four levels, spanning synonyms (Level I), related words (Level II), meaning words (Level II), and user specific input (Level IV) that are utilized to reduce the dimensionality of the vector representation. Embodiments of the system and method may provide for only a subset of these levels or may provide additional levels. Reduction of the vector dimensionality is an advantage of the invention for several reasons, including but not limited to providing a more accurate and user-specific representation of the object.
Once the object has been converted into a relevant reduced dimensionality vector, where the primary dimensions of the vector space are seed concepts occurring in that document, the analysis and categorization engine 200 selects a set of these dimensions, or seed concepts, that are or correspond to key concepts that are most representative of the object (
Assuming the number of words/phrases in a given object as a large integer on an average, according to the central limit theorem, the total number of occurrences of concepts derived from the object can be approximated to standard normal distribution.
As shown in
Key concepts are seed concepts that are selected to represent the object. In a symmetrical normal distribution, Key concepts have a weight closer to the mean 22 than some distribution lower limit (or low-bound) 27, and further from the mean 22 than some upper limit (or high-bound) 29. A concept whose weight falls further from the mean than low-bound is deemed to make an insignificant contribution to the concept of an object. A concept whose weight falls closer to the mean than high-bound occurs very frequently and thus contributes little to inherent meaning of the object. These criteria are broadly based on Claude Shannon's information theory, which states in general terms that the more frequently an information pattern occurs, the less its intrinsic value. Low- and high-limits can be modified by the user, and are advantageously expressed as some multiple of the standard deviation.
Key concepts are advantageously stored as structured information in an open architecture format, such as a relational database table. As the same object can be used by multiple users in different ways, in order to provide a way for object to be classified in a user specific way, objects are given a user object ID 66 or identification. This ensures that the same object can be categorized in multiple ways without duplicating the object and its contents every time it needs to be categorized for a user and for a view (a view may be defined by the user or the system, but may typically be a logical grouping of objects as specified by the user). User object IDs 66 are preferably a number between 0 and 2,147,483,647 but may be in different ranges. Utilizing a user object ID 66, as opposed to a global object ID 30 in this captured concept relational database table allows different users to store different vector representations of the same object. The key concept ID 42 for each key concept identified for the object is stored. The probability 68 associated with each key concept id 42, as determined from the Gaussian distribution, is stored. The probability 68 is preferably stored as a floating point number between 0 and 1 but may be scaled to other number-ranges, formats, or representations, such as an integer representation between 0 and 9,999,999,999 or any other convenient range providing sufficient resolution or precision for the task or query. The rank 70 of each key concept is stored. A rank of one preferably indicates that key concept had the highest probability of representing that object, while a rank of 3, for example, indicates the key concept had a lower priority than two other concepts, and so on. An exemplary embodiment of such a captured concepts relational database table is shown as Table 5.
In one embodiment, a conditional probabilistic method is advantageously used for classification to determine whether an object belongs to a given category. Referring back to
The combined ratio R=Ri*Rc is multiplied with object scores (the probability of key concept) 68 for final classification to categories. As individual component of the multiple is less than or equal to one, the combined score will always be greater than or equal to zero and less than or equal to one. In one embodiment, the processing of probabilities is an average. For each category, the combined score R of all key concepts appearing in the category and the object are summed and the total is divided by the total number of key concepts appearing in the object (Rs). In order to give equal weightage to categories with less descriptive concepts vis-à-vis more descriptive concepts, we define category normalization ratio (Rn). This category normalization ratio is defined as the ratio between the total number of concepts that occur in both the category and the object over the total number of concepts in the category. The final object score 74 is then Rs*Rn. Note that the object score according to usage here will always be greater than or equal to zero and less than or equal to one. Thus it can be represented as a percentage for convenience. Other mathematical objects or processes may be used to assign a score to the categories, particularly modification to a straight averaging.
The use of standard normal distribution to capture central theme or idea helps in the manner described as follows:
Objects are assigned to categories having a score greater than a threshold value of 25% (step 430). The threshold value is a percentage and can have a value between 0 and 100. It is determined or set by the user based on several characteristics of the corpus of objects. These characteristics include features such as whether the corpus has objects with similar contents, whether a single object can have multiple themes (for example, as in a news feed), and the like characteristics. In general, it is observed that for object with multiple themes, lower threshold value such as 25% (or equivalent fraction) would be needed as opposed to object with single theme for which threshold can be higher such as 40%. As more objects are input to the engine, the more accurate the engine becomes and thus large volumes of objects input implies a lower threshold value as well. For example, threshold value in the range of 25% to 35% may typically be encountered, but are not limited to this range. More particularly, the threshold value range may have any upper and lower bound and be any range. It is noted that each user may have different categories, concepts, and/or concept groupings, as is true also for groups or organizations. Thus, the category to which an object is assigned may be different for different users (or groups or organizations).
Output from the analysis and categorization engine is advantageously stored in a user object relational database table, such as, for example, a relational database table illustrated in Table 6. Table 6 includes the user ID 56, user object ID 66, and global object ID 30 as well as user object hierarchy pointer 72. The user object hierarchy pointers 72 indicate the parent, or category, ID to which the object belongs and the relative location of the object pointer which indicates an ordering of objects as provided to the analysis and categorization engine. The score 74 for the object under that category is also stored. A status 76 is also provided to enable the display of the objects in a manner desirable to a user, for example, the categories may appear in a user interface as a folder and these folders may appear open or shut. Status 76 may also indicate that the object has been deleted or is active. One object can belong to more than one category, and thus can have more than one entry in this table.
The above remarks have focused on the analysis and categorization engine 200 provided by the present invention to deduce the theme, or meaning of unstructured information and store output as structured information 230 in an open architecture format, we now turn to aspects of the present invention that further provide interface tools for viewing and analyzing unstructured information based on the categorization data collected and stored via the analysis and categorization engine. These tools enable intelligent views of unstructured information, the ability to view trends in a group of unstructured objects, and the ability to execute object concept based searches.
The inventive system and method advantageously provide and utilize an object concept based search utilizing the structured information 230 generated by the analysis and categorization engine 200. An embodiment of this object concept based search process 700 is outlined in
The objects returned as results for the object concept based search are then scored according to the following algorithm. The scores for the individual key concepts that contributed to the search are averaged for each object returned. If the search was performed by using a combination of key concepts and seed concepts, the number of hits for the seed concepts are then divided by the total number of hits picked up for all seed concepts in the document to determine how much the seed concept actually contributed to the concept of the document. This figure is then added and averaged with the average score for the key concepts to arrive at a relevancy score for the object as pertains to this particular search.
If the captured concept is not contained in the relational database tables, the search engine optionally performs a keyword search and phrase matching, directly accessing the unstructured information (step 730). In addition, the search text is passed to the analysis and categorization engine (step 740). The engine can re-capture the object concepts and update the relational database tables (step 750).
The process then comprises capturing search text 220, and parsing the search text as individual words and phrases. The words within double quotes are considered as phrases, even though this definition of phrase can vary. It then uses the seed concepts extrapolation procedure to produce concept groupings 530 as depicted in
As any user search continuously refines Table 4 and Table 5, the captured object concepts continue to get more accurate and thus can anticipate user search needs. Thus over time, the system can meet the user concept search needs with accuracy in step 720. The next time a user enters a similar phrase, the concepts look-up would contain the relevant information.
A graphical user interface advantageously provided by the inventive system provides a dynamic matrix view of concepts and their occurrence within unstructured objects. Concepts (42) are advantageously displayed versus object description 34 in a matrix, or spreadsheet, format. This assists a user in quickly determining an object or objects of interest. A user can choose concepts 42 to add or remove from this view and can compare concepts within the view. The provided view is personalized, that is, the view provided for a first user viewing a first set of unstructured objects may be different that the view provided for a second user viewing the first set of unstructured objects. Additionally, the provided view is integrated, that is unstructured and structured objects may be advantageously displayed together in the same format. The user interface may advantageously display categories as graphical folders that can be displayed as open or shut, as dictated by their status 76 to reveal or hide the unstructured objects assigned to the category.
In another aspect of the inventive system, analysis tools are advantageously provided that present concepts captured by the analysis and categorization engine in a variety of ways: based on number of hits 58, score 74, alphabetical order, strength or weakness of concept scores 68 or other ways based on the structured information generated by the analysis and categorization engine. Additionally, the presence of concepts in objects over time is advantageously tracked, analyzed, and optionally displayed in table or graph formats. This tracking is facilitated by the cross-reference time stamp 64 and index start time 62. As the analysis and categorization engine updates the structured information, changing scores 74 of concepts in a category, for example, can be advantageously displayed.
The system advantageously provides for object sharing between two or more users, allowing multiple users to view or analyze a set of shared objects. This functionality is advantageously provided by an automatic share agent that is configured to publish information to one or more users. In order for the sharing to occur, there needs to be multiple “views” of the same information. A view is a logical grouping of objects tailored to fit a user's specific access needs. Every view has a user id and can have all the functionalities associated for the user. Each of the multiple views accessing the same object have their user object identifiers that link an object id to a specific user/view. Thus it is possible in this embodiment or design for multiple user or views to have access to the same object. As captured and refined concepts and categories can vary by user and/or view, it is possible for the same user object to be categorized and analyzed differently in every view.
The user/view has to specify through an interface what categories need to be shared with other users/views. This has to be done for all the categories that need to be shared once. Now, as soon as an object is classified under an user or view, the category under which the object is classified is examined to see if this would be shared and the targeted user or view for sharing, Then the user object will be reclassified for the targeted user or view. If the object (pointed to by the user object id) already exists under a category, then the object will not be classified again. If the category (or categories) under which a user object gets classified for a targeted user or view is shared, then the object will be shared based on the targeted user or view sharing setup. This process thus creates a dynamic flow of objects in the network of users or views without duplication of objects as only user object ids that point to the object id get created every time.
The use of views are advantageously more than just sharing. Views facilitate multidimensional analysis of unstructured information. For example, we can share a view on Golf (View I) to another view created on Tiger Woods (View II). Now the contents of View II, will have Golf and Tiger Woods. We can take that information and share it with another view (View III) on Vijay Singh. Then that view will have information only on Golf, Tiger Woods, and Vijay Singh. We can share the contents of View III and share with View IV on Chip Shots. Then the contents of View IV would be Golf, Tiger Woods, Vijay Singh and Chip Shots. This way we can drill down on unstructured data along multiple dimensions. Once the views are setup, the information will continue to flow and be updated.
In addition to this, the analysis and classification engine allows the users to analyze concepts in objects, track trends and exceptions. Analysis of concepts is useful when we
Look to discover trends in unstructured information
We have collected Innovation magazine articles in text format from 1996. The subscription is weekly and we are browsing through articles in 1996 looking for articles of interest. As we are not sure of what we are looking for, it is very difficult to represent a condition like this in a text box for search or through a complex set of user specifications captured from an user interface.
Some prior art currently used to solve these problems today would be:
(1) Keyword and concept based searches are currently used to partially solve the problem. These technologies work best when the user knows exactly what words to try to find. These technologies are not very helpful if the user does not know the subject matter well and is unable to ask the right questions or search for the right information.
(2) A Boolean search phrased correctly might solve the problem, if the user knows exactly what they are looking for. However, the disadvantages of this method are that it is very difficult to do Boolean concept based search, where we would be looking for concepts and not just keywords.
(3) A hierarchical classification of objects would be another solution such as a taxonomy hierarchy. But in this approach once we choose a particular hierarchical path, we preclude choosing other hierarchical paths. This is because of the inherently implied “one-to-many” structure in hierarchical approach. This might result in us not getting at the right information. On the other hand, a relational approach will allow not only “one-to-many” but also “many-to-many” relationships to be resolved and seen as well. We can liken this to the difference between the hierarchical and relational database technologies.
The real solution to these above problems lies in analysis of unstructured data which takes off from where search and classification leaves off. This analysis based approach mimics and improves upon the ability of cognitive processes to analyze multiple viewpoints simultaneously. What if we are able to extract key ideas and concepts automatically and bubble them up a magazine's table of contents in an analytical format to help the user see at a glance what the topics covered are? Much like that, analysis and categorization engine is able to provide an extract of concepts representing a collection of unstructured data. Without any pre-specified condition or criteria the user is able to understand the contents and narrow down on articles that have a combination of ideas/concepts. The system also uses prior learning (extracted and user defined concepts) in identifying ideas/concepts. Thus the extracted concepts are user/view specific.
In
Let's say we are looking for an article of interest and are not sure about the exact nature of the article. It may have something to do with high-tech and education but if these don't appear it can be something to do with laser and/or surgery. Such kind of requests are complex to represent in a search engine. Sometimes it is equally important to know when concept(s) or idea is not there, than to know if the concept(s) or idea is there. There is value to analyzing large volumes of data as shown in
Whereas the search technologies are hypothesis driven, i.e., fetch result objects based on certain input criteria, the technology here is data driven i.e., letting the data provide its content back in an analytical relational format and without a pre-imposed hierarchy.
The analysis and classification engine generates a cross reference of objects and the embedded concepts automatically. Access, modification and tracking of this cross reference is possible through analysis system components and procedures or function (
In the above referenced patent application U.S. patent application Ser. No. 10/087,053 entitled Concept-Based Method and System For Dynamically Analyzing Unstructured Information, there were described various ways of extracting concepts from objects based on user access and profile. The extracted concepts from the objects are available by view and/or user through dynamic viewer 1010. Whenever the user needs to review the distribution of concepts in the selected objects in different orders of preference, they can. In one embodiment, the concepts can be displayed in alphabetical order, number of hits or strength of concepts. The order can be descending or ascending. In this embodiment, the user can choose and view a combination of concepts that have been identified to represent the collection of unstructured objects. Screen shot in
Another important aspect of representing unstructured information in a relational format is capturing trends and discovering new patterns and alerting exceptions that are not apparent. The exceptions monitor 1020 can monitor the changes to concepts based on exception conditions. Through a user interface 1015, the user can set thresholds to capture exceptions. Concepts that exceed a pre-specified threshold within a category will be marked for exception reporting. The threshold can for example be:
Advantageously, in one embodiment, a relational table has been defined to capture the exceptions settings. The Table (Table 7) includes the User id 56, whether the exception being tracked 75 is a percentage or count, a real number to quantify the exception (76) and a status flag (77) to specify whether the request is active or inactive.
For exceptions and history tracking to occur, the analysis and categorization engine must retain the history of indexing changes. We discussed earlier about the vector representation of the object 902, having a number of dimensions where each dimension representing a concept has a weight. When indexing the target object (step 390), the process monitors to see if dimensions have been added or deleted or if their weights have changed. The original dimensions and weights of the vector before changes are saved in a relational database history table Table 8. The structure of Table 8 is as given below:
The exceptions monitor 1020 (
At a pre-specified frequency set by the user, the exceptions monitor 1020 goes through every entry in TABLE 7 and determines if there are corresponding concepts in TABLE 8 for that user that exceed the threshold set by the user. If so, those concepts will be marked as exceptions by the monitor 1020. The details of the comparison process is as follows:
By joining Table 8 with Table 6 based on user id 56 and object id 30, we can get a cross reference of object concept combinations that have changed. For every entry that has changed in the cross reference set, we will lookup the corresponding entry in Table 4. To consider changes to a concept id in its entirety, we need to consider the concept and all it's active concept groupings 600 as in Table 3. This can be done by setting the object id 30, concept id 50 and it's groupings in Table 3 and cross reference type 60 to every entry in the retrieved cross reference set. If the difference between total of the hits for all the concepts in Table 4 and corresponding hits for all the concepts in Table 8 is greater than the threshold exception number 76, then the corresponding concept is flagged as an exception for the user. This is captured in Table 9. Table 9 has User id 56, concept id 50, exception time stamp 78 indicating the time the exception was spotted.
In one embodiment, the User Alert function 1030 highlights the categories in which the exception concepts occur in red color. This visual will help the user to analyze the contents of the category looking for exceptions. The user can sort the concepts available in the category by exceptions. This would allow them to act on exceptions first.
This historical data collection can be viewed through one of the available analysis view screens such as that depicted in
In one embodiment, first the user analyzes a category by viewing the concepts through an user interface as in
At the intervals specified by analyst conditions 1040 tracking ids that satisfy timing criteria. For every entry that satisfy the criteria as of time ti, the process would determine the distribution of concept ci as hit count hi. The sum of all concepts ci within a category (note that a category is a special user object that is a parent of a collection of objects) as indicated by the user object id oi is hoi=Sum(hi) for all ci; under oi. This is the 3-tuple <ti, oi, hoi> that can vary for every observation ti and is stored in snapshot container 1060. In one embodiment, the snapshot container 1060 is implemented in tabular format as in Table 11. For every tracking id, user id combination, the concept id that exists in the user objects that belong to the category (user object id 66), and total number of hits (58), time stamp (85) to indicate the time of access would be stored. For the category (66) itself, concept id (50) would be set to zero and hit count (58) would be hoi.
Table 11 will also contain summary information for the participating concepts within the category. This would be the sum of all hit counts (58) of all user objects that have one or more participating concepts within that category.
For illustration, we once again look at subscription to Innovation magazine. We are looking for trends and exceptions from electronically gathered articles of the magazine for the past five years. Among many concepts that have been extracted from objects automatically and are available to track, we choose education, laser, high-tech, search engines and surgery. Thus we decide to track them in categories 1996, 1997, 1998, 1999, and 2000. The concept education is comprised of variations of school, teaching, learning, university etc., High-tech is comprised of hi-tech, technology, know-how, etc., Laser and Surgery are by themselves. We now look at the analysis view for these concepts by year (see the chart below). In this case the years are 1996, 1997, 1998, 1999, 2000. In order to account for varying of number of articles by year, we further normalize the sum of normalized scores. This is done by dividing total normalized score by total number of documents and multiplying in a large number (in this case 100) to get an integer. The material in
If we take total normalized score for concepts and plot them in a graphical format, then we get the trend information as in the chart (
As presented by Susan M Grotevant, in the paper The Power of Multidimensional Analysis (OLAP) in Higher Education Enterprise Reporting Strategies at the College and University Information Services Conference in 1999, for a system supporting multidimensional analysis (Online Analytical Processing—OLAP) satisfies the following three requirements:
Analysis and categorization engine extracts data and makes them available in such a way that the above three requirements can be met. Through Table 11, the user has access to profile of concepts in targeted categories over time. The details of concept distribution within individual objects have been kept as well. Thus, we provide structured access to concepts data at the object level. Also as explained before using views and classification techniques by view, the same information can be accessed in multiple ways. Together all these three features offer the drill-down capability that is important for an OLAP system.
The user/view can track concepts and their distributions in objects over time or as exceptions occur. Using categories that can be nested within each other, the user can refine the level information access. With concept id and object id the user can get to the available details of concepts and objects in a structured format. These three features give us the ability to slice and dice the captured information based on object and concept characteristics.
The output data is in a structured format that can be easily imported to commercially available graphical tools in the market place. As explained earlier, the dynamic viewer 1010 provides additional ability to analyze multiple concepts simultaneously and helps the user identify combination of ideas in unique ways. Thus we facilitate graphical and visual analysis. In summary, the analysis and categorization engine can provide sufficient information for multidimensional (OLAP) analysis of unstructured objects.
Workers skilled in the art will appreciate that, in light of the description, a variety of interfaces can be provided for a user to view, and understand the meaning of, unstructured objects based on the structured information generated by the analysis and categorization engine.
Having now described a variety of structures and techniques from a theoretical and user interface perspective, attention is now directed to some of the underlying methods, procedures, algorithms, and structural aspects of the inventive features. While the invention has aspects too numerous to repeat again here, the following paragraphs focus attention of a specific set of inventive embodiments and features that have particular utility and advantage. Merely for the purpose of organization and the readers convenience of reference, the following sections contain headings that identify the general subject matter of the paragraphs that fall under that section but such headers are not and should not be intended to limit the applicability of the description to any particular inventive feature as many aspects of many inventive features are described throughout these sections and though out the specification generally. It is also noted in this context that various features that are described in the context of one feature, such as a procedural step in an overall method, may also or alternatively be used with a different method or independently. In addition, not all combinations of features are described though it is clear that many of the features and other aspects of the methods, procedures, objects, and other structures may be used in other permutations and combinations.
Embodiments of a Method for the Analytical Processing of an Unstructured Object
It will therefore be apparent that the invention provides a method for the analytical processing of at least one unstructured object in a dimensioned space such as a unidimensional or multi-dimensional space. In one embodiment the method comprises selecting a dimension and exploring or querying the unstructured object or information item, such as a document including unstructured information (such as unstructured text fields) at least one level of detail or abstraction, but more typically at multiple levels of detail, granularity, or abstraction, to determine a result for that query. Then, selecting at least one concept in the form of a view or angle, or as the term is more generally used in the context of conventional OLAP, a perspective, among a set of concepts at a predefined level of detail or granularity. The unstructured object or objects are then explored or processed analytically using the selected combination of dimension or dimensions and at least one angle or perspective to determine a second query result. This second query result may generally be different from the first query result, and identifies where the corpus has changed to reflect the chosen angle or perspective, given that the angle or perspective is a set of concepts and granular detail.
Another embodiment of this method for analytical processing of at least one unstructured object in multi-dimensional space, includes: for at least a selected dimension of the multi-dimensional space, hierarchically exploring and querying the unstructured object at a plurality of levels of detail or abstraction to determine a first query result; selecting a combination of at least two dimensions of the multi-dimensional analytical processing space; selecting at least one particular concept in the form of a view or perspective among a determined set of concepts at a predefined level of granular detail; and analytically exploring and processing the unstructured objects using the selected combination of dimensions and at least one angle or perspective to determine a second query result different from the first query result that identifies where the corpus has changed to reflect the angle or perspective, wherein the angle or perspective is a set of concepts and optionally one or more relationships with each other, and granular detail.
In another embodiment, this multi-dimensional space may be a one dimensional space or include a plurality of dimensions. That is to say that the dimension of the multi-dimensional space may have a dimension n, where n is an integer in the range {1, 2, 3, . . . N−1, N}.
It will be appreciated that the unstructured object may take many different forms, and may for example be a document, created, generated, retrieved, stored, or displayed on any display media, and that further the document may contain text or other character or symbolic information. Such text, character, or other symbolic information is commonly used for communicating thoughts and ideas and is therefore a particularly prevalent form of unstructured object. Even when such text or symbolic information is stored in a database or other data structure, such text, symbolic, or characters may be stored in fields or records within the database or data structure in an unstructured form. The unstructured object may also or alternatively be a document stored or displayed on any media containing text other than numbers and without structural codes.
This method for analytical processing of at least one unstructured object in a multi-dimensional space may be applied to a broad category of unstructured or partially unstructured objects, including but not limited to situations where the unstructured objects exist within a structured, partially-structured, or non-structured body of information, but wherein the structure or partial-structure of the body of information does not provide the structural relationship between and among at least a selected one of the plurality of the elements comprising of the body of information to satisfy an information acquisition task.
It will be appreciated that the unstructured object may comprise a text field within a database, wherein the database may be a flat database or a relational database, or a combination of flat, relational, or other database or data structure types. More particularly, the unstructured object may comprises a text, character, or symbolic field or record within a relational database, within an Oracle™ relational database, within a Microsoft™ Access relational database, or within any other database. In some embodiments the text or symbols with be letter characters in any language (or there equivalents in Japanese, Chinese, and other non-character representations) forming words, phrases, sentences or otherwise expressing facts, opinions, thoughts, concepts, and the like.
In embodiments there may be some or first structured data or object within an otherwise structured database that is an unstructured object because the first structured data does not provide the necessary complexity of relationships to second structured data or unstructured data in the same structured database or in a different structured database or non-structured database to satisfy the information acquisition task. It will therefore be appreciated that the inventive method is applicable to situations and objects where the unstructured object comprises an object within a structured or partially structured data or object.
The are no limits to what the unstructured object may be and by way of example, but not limitation, the unstructured object may comprise or consist of text, character, or symbolic elements organized (or unorganized) in any manner such as in a computer readable field or record, printed on paper, accessible from the Internet, or in any other manner. The unstructured object may also or alternatively comprise of a sound recording in analog, digital, or printed form; faxes, e-mails, voice converted to text; a graphical depiction of a printed document in printed or in electronic form; a radio frequency signal; a telemetry signal; a land-line, optical, cellular, mobile, or satellite telephone signal at any frequency or an audio, computer display, printed, digitized, or other representation of such telephone signal; a content of an Internet web site or the Universal Record Locator (URL) of a website; as well as other items that do not necessarily share a common organization. It will also be appreciated that the unstructured object or objects may comprise a combination of structured objects and unstructured objects. It should also be appreciated that where the unstructured object is an indirect reference to a location storing or otherwise identifying information, data, or information objects, such as a reference to an Internet or world-wide-web URL, the contents of the unstructured object (here the URL) may change over time so that comparisons made over time to the unstructured object will or may change even though the unstructured object (perhaps represented in whole or part by a vector) does not change. For this it follows that where the unstructured object is the Internet, the content of the Internet and its various web sites, links to other web sites, postings, articles, pictures, graphics and all other manner of content changes over time.
In at least one embodiment, what may by some be characterized as structured objects may in fact include objects or data in a flat database or in a relational database that include unstructured object or data fields that do not provide a required level of relationship information objects to other structured objects or data to satisfy a determined information acquisition need or query. These are as a whole unstructured objects. In other embodiments, the unstructured objects may include objects or data in structured or partially structured databases that do not include at least one structural relationship desired or required to satisfy in information acquisition task or query. These are also as a whole unstructured objects and for which the inventive method and procedures provide advantages for query results, discovery, and exploration.
Having described what unstructured object may be in general terms, we now provide a list of some examples of information items, data, facts, opinions, predications, or other items that by way of illustration but not limitation give some sense of the broad spectrum and great variety of things that may be or constitute unstructured objects. The include unstructured objects are selected from the set of information objects consisting of: scientific, agricultural, epidemiological, medical, genetic, chemical, biological, pharmaceutical, corporate, athletic, archeological, bioinformatic, proteomics, geological, physical, intelligence, military, meteorological, oceanographic, genealogical, photographic, fingerprints, electronic signal, astronomical, bibliographical, racial, religious, national origin, political, geo-political, ethnicity, energy generation or consumption, Internet, world wide web, tomographic, photo-reconnaissance, satellite data, geographic information systems (GIS), environmental, economic, currency market, stock market, futures market, international currency market, bond market, any other data or information appropriate for making intelligence, terrorist, national or domestic security or military decisions, and combinations thereof. The inventive method and system clearly have applicability for analyzing vast amounts of unstructured information or unstructured objects that are generated, published, communicated, or that are in any other way identifiable so that they may be analyzed to assess any international, national, domestic, or local security or terrorist threats. More particularly the inventive system and method may be of use to such organizations as the Department of Defense, the Central Intelligence Agency, the Federal Bureau of Investigation, the Department of Homeland Security, and state and local law enforcement organizations for threat assessment applications. It also has applicability for technology, financial, corporate and business intelligence gathering and analysis.
At least some embodiments of the invention provide for a display either for interaction by the analysis with the system and method and for displaying intermediate results and final results or output. Where display or display media is a component of the inventive system or method, the display or display media may includes at least one graphical display type capable of displaying at least two dimensional representations of information objects. In other embodiments, the display media may include at least one graphical display type capable of displaying at least three dimensional representations of information objects. The three dimensional display may be accomplished using an inherently two-dimensional display, such as a printer, CRT display screen, or LCD display screen that simulates a three-dimensional representation using shading or some distance perspective rendering technique. Alternatively or in addition, stereographic techniques with some form of left-right vision or eye selection may be used to provide an actual three-dimensional stereo display. Such display may be interactive. Holographic and heads up type display may also or alternatively be used by a user or analyst interacting with the inventive system or using the inventive method or computer program or computer program product implementing the inventive methods or its component procedures.
In at least one of the embodiments of the method for analytical processing of at least one unstructured object in a multi-dimensional space, the determined set of concepts may be predetermined; or may be determined or established by a rule, policy, or algorithm; or may dynamically determined.
The at least one view, angle or perspective may be at a predefined, dynamically defined, or user defined, level of granular detail. The at least one angle or perspective, at a predefined or user defined level of granular detail, may be automatically chosen based on the most frequent collocation occurrences of all the concepts or on the appearance of new concepts for exception tracking in the multidimensional space defined by a set of concepts and their granular detail. Alternatively or in addition, a level of granular detail may be determined by at least one of timestamp indexed, number of hits, cross reference type, object type, and combinations thereof.
A higher level of detail may generally be a superset of all the lower levels and may include normalized summarizations; and a lower level of detail may generally provide a more specific definition in either meaning, or a more narrowed delineation in size, time, physical distance on a document or objects that is not provided in a higher level of detail. Furthermore, in some embodiments, the summarizations are selected from the set of sums and counts comprising: sum of number of objects a concept occurs, sum of number of times a concept occurs, count of number of times a concept occurs, count of total number of objects, and sum of size of all the objects; and wherein these sums and summarizations are used in varied combinations that is not included in a lower level of detail.
An unstructured object may include an attribute or variable and the attribute or variable is selected from the set consisting of a size, time, physical distance or combination thereof in an unstructured object.
In some embodiments, the levels of granular details may be abstracted from a corpus using an abstraction procedure to generate a concept or concepts. The abstraction procedure may include but is not limited to utilizing any one or combination of: a single language dictionary, a language translation dictionary, a thesaurus or a plurality of thesauri, codes, documents, letters, books, and combinations thereof in any printed, digital, facsimile, computer, or electronic form, or the like, or combinations thereof.
In at least one embodiment, the abstraction procedure may utilize an initial concept seed or seeds to generate the concept.
The method for analytical processing of at least one unstructured object in a multi-dimensional space represents a different level of granular detail from a high level or granular detail to a low level of granular detail or from a general level of granular detail to a specific level of granular detail and a particular point of view or perspective or multiple points of view or multiple perspectives or multiple angles at different levels of granular detail for each dimension.
Embodiments of the inventive system and method for analytical processing of at least one unstructured object in a multi-dimensional space may provide, that each dimension represents a different level of granular detail from a high level or granular detail to a low level of granular detail or from a general level of granular detail to a specific level of granular detail and a particular point of view or perspective. Alternatively, it may provide that the multi-dimensions are selected as a plurality of dimensions selected from the set of dimensions comprising multiple points of view or multiple perspectives or multiple angles at different levels of granular detail for each dimension.
The inventive method supports and is operable in an automated mode, batch processing mode, unsupervised mode, or the like largely non-interactive mode for processing and analysis. However, the system and method support user interaction and yield benefits from an interactive mode. In particular the user or analyst may apply their own intelligence and knowledge to interactively select a perspective and to alter parameters of the query such as the dimension, number of dimensions, see concepts, or any of the other inputs, variables, vector or vector portions described herein, to interactively refine the interactive session to achieve a satisfactory search result, such as identifying a trend or exception. In some instances, the search or query result will be to verify that there is no significant trend or exception condition occurring.
In this aspect, the method may provide for user or analysis interactive selection of one or more of the selecting a combination of concepts and objects, selecting at least one timestamp, and exploring its associated concepts and objects cross-reference. In other embodiments, the user interaction includes the user selecting a plurality of the selecting a combination of concepts. The combination may be automatically or user interactively identified on the basis of an identification criteria. Such identification criteria may be selected from the set of identification criteria consisting of: at least one trend in concepts over time, at least one exception to absolute occurrence or a percentage differential from previous occurrence, a plurality of trends, a plurality of exceptions within a concept or a collection of concepts, and combinations thereof.
Either user or analyst interactive processing or automatic processing may result in the display or printing of query results or results derived from the query results, such as trends, exceptions, or the like. The query results or derived results generated by processing the query results may for example, be displayed on at least one display having a display type, and the at least one graphical display type may be selected from the set of graphical display types consisting of cross tabs, pie charts, bar charts, column, row, line, x-y scatter, x-y-z scatter, multi-dimensional scatter, distance metric, area, volume, Venn-diagrams, animations, doughnut, radar, surface, bubble, stock, cylinder, cone, pyramid, and combinations thereof. The media on which the graphical display is presented to a user or used for further user or automated processing may be any display media or device as is known in the art.
Embodiments of a Method for Trend Tracking and Exception Tracking and Monitoring
The inventive system and method provide a method for analytical processing of at least one unstructured object and the results of such processing may be to generate one or more valuable query result. Advantageously, other aspects of the inventive system and method take these query results, or facts, data, or results generated by other systems and methods, to generate and track trends and exception. Alarms or other indications may be generated in a variety of ways and forms and communicated to interested parties as programmed into or otherwise provided by the system, method, and computer program.
Some embodiments of the inventive system or method may provide only for trend result generation and tracking of such trend results, other embodiments may provide only for exceptions result generation and exception monitoring and tracking, however, other embodiments of the invention advantageously provide for both or a combination of trend and exception tracking.
The trend results (and/or exception results) and the trends that are tracked or exceptions that are identified, may be or comprise trends in temporal (time), spatial (location or geographic), events occurring or not occurring, and activities that occur or do not occur, and combinations thereof. The temporal trends (or exceptions) or trends in time may for example, be measured or tracked or monitored in elements of seconds, minutes, hours, days, weeks, months, years, decades, or any other duration or increment of time where time or time duration is measured or quantified in either absolute time or relative time units. The spatial trends (or exceptions) may be tracked or monitored in elements of physical location, cities, states, country, continents, physical or spatial coordinates or proximity or distance to such coordinates, global position coordinates or proximity or distance to such coordinates. The trends or exceptions may be tracked for example, in elements of changes in the number of concepts in total or at a given level of granularity, changes in the types of concepts in total or at a given level of granularity, changes in the number and types of concepts, changes in the temporal distribution of concepts, changes in the spatial distribution of concepts, and combinations thereof.
Trend and/or exception tracking and monitoring procedures may advantageously be used for the trend and exception tracking. Advantageously, either a single common procedure may be used that can be modified by appropriate selection of parameters to suit the tracking or exception monitoring task, or dedicated procedures may be utilized for each or a set of trend tracking or exception monitoring tasks. Advantageously the system and method learn and can be tuned or adapted over time and the parameters and techniques stored for later retrieval (without additional tuning or modification) and reused or applied to similar situations and problems.
By way of example, but not limitations, tracking procedure may be selected from a set of tracking procedures consisting of: creating a history of the number of hits or categories that are defined by collection of concepts, generating statistical information pertaining to the history of the number of hits of concepts or a collection of concepts, creating a history of number of occurrences of concepts, and combinations thereof.
In one particular embodiment of a method for tracking trends (which may be used with or separate from the afore described method for analytical processing of an unstructured object), the trends tracking procedure includes: selecting a vector having a vector dimension, identifying a change in the vector, and storing the change as a trend result, which trend result may itself be a trend but is more typically an intermediate result where a more definite trend is established, identified, or characterized by repeating the selecting, identifying, and storing to identify the trend, or alternatively where the vector is to remain fixed, to repeat only the identification and storing to produce a plurality of trend results to establish a trend. The storage of the trend result may only be transitory and need not be permanent.
In another embodiment of the method or procedure for tracking trends, the method comprises selecting at least one vector in a predetermined vector dimension that includes at least one concept variable having a value (unless the vector has been previously selected or identified to the tracking problem and need not be selected); identifying a change in a predetermined concept variable from a first value of the predetermined concept variable to a second value of the predetermined concept variable for at least one selected concept over a selected dimension; and storing as a trend result, the identified change in the predetermined concept variable. As earlier described for another embodiment, the trend may be more readily identified or characterized by repeating the identification and storage for a plurality of the same or different concept variables to generate a plurality of trend results, so as to establish a trend or possibly to show that no particular trend exists.
It will be noted that in some embodiments, the at least one concept variable value having a value that may be the null value. It will also be noted that the selected vector dimension may be a dimension selected from the set of dimensions consisting or time, concept, and combinations thereof. Alternatively, the selected dimension is a dimension defined based upon any word, concept, idea, event, symbol, set of symbols, and combinations thereof. Furthermore for any of these embodiments, the vector may be multi-dimensional vector having dimensionality between 1, . . . , N; and that it clearly may therefore be a one-dimensional vector or scalar. The selected dimension may be a dimension selected from the set of dimensions consisting of: a time dimension, another concept, and combinations thereof, though in reality permitting the dimension to be a concept and a multi-dimensional to be characterized by a plurality of concepts effectively means that the possible dimensions are practically unlimited.
It will be appreciated in light of the description and details provided elsewhere herein that embodiments may provide that the at least one vector, or any of a plurality of vectors, comprises an unstructured object. Furthermore, the at least one vector comprises a document containing unstructured text. The at least one vector, or any of a plurality of vectors, may include a document containing at least some unstructured text or a combination of structured text and partially structured or unstructured text. The selecting at least one vector (or any one of a plurality of vectors) comprises creating the at least one vector (or any one of the plurality of vectors) in the predetermined vector dimension (or plurality of dimensions) having the at least one concept variable (or variables) and concept variable value (or values).
Therefore described embodiments of the trend tracking procedure may be utilized in a situation when the at least one vector comprises at least a document containing at least some unstructured text or a combination of structured text and partially structured or unstructured text. In this situation, the selecting and identification refer to components of the document.
The inventive tracking trends procedure may be utilized with a variety of different vectors and unstructured or partially unstructured objects. The comparisons, selections, and identifications may bee made within the same vector or vector portion or a cross between different vectors or vector portions. These variations are now described.
In one embodiment of the method for tracking trends, the selecting at least one vector comprises selecting a plurality of vectors including a first vector and a second vector; and the identifying a change in a predetermined concept variable from a first value of the predetermined concept variable to a second value of the predetermined concept variable for at least one selected concept over a selected dimension, comprises identifying a change in a in a predetermined concept variable from a first value of the predetermined concept variable in the first vector, to a second value of the predetermined concept variable from the second vector.
In another embodiment of the method for tracking trends, the selecting at least one vector comprises selecting a first portion of the at least one vector and a selecting a second portion of the at least one vector; and identifying a change in a predetermined concept variable from a first value of the predetermined concept variable to a second value of the predetermined concept variable for at least one selected concept over a selected dimension, comprises identifying a change in a predetermined concept variable from a first value of the predetermined concept variable in the first portion or the first vector, to a second value of the predetermined concept variable from the second portion of the first vector.
In another embodiment of the method for tracking trends, the selecting at least one vector comprises selecting a plurality of vectors including a first vector created at a first time and the first vector at a second time, wherein there may be change or no change in the vector between the first time and the second time; and the identifying a change in a predetermined concept variable from a first value of the predetermined concept variable to a second value of the predetermined concept variable for at least one selected concept over a selected dimension, comprises identifying a change in a predetermined concept variable from a first value of the predetermined concept variable in the first vector at the first time, to a second value of the predetermined concept variable from the first vector at the second time.
In any of the afore described methods for tracking trends, the selecting, identifying, and storing the change for at least two (or any plurality) of the vectors comprises identifying a trend. Of course it will be appreciated that the trend may be established when the same vectors or unstructured objects, such as documents, are compared over different time intervals.
In another embodiment of the invention, the method for tracking trends may be modified or altered to provide for exception tracking. In this embodiment, the selecting at least one vector comprises selecting a plurality of vectors including a first vector and a second vector, wherein the second vector comprises a reference vector for use in generating an exception; and the identifying a change in a predetermined concept variable from a first value of the predetermined concept variable to a second value of the predetermined concept variable for at least one selected concept over a selected dimension, comprises identifying a change in a predetermined concept variable from a first value of the predetermined concept variable in the first vector, to a reference value of the predetermined concept variable from the second vector; and the method further comprising generating an exception when the first value is within a predetermined magnitude relationship of the reference value. When this occurs a exception condition has been identified and may be accompanied by the issuance of an alarm, email, telephone message, or other communication or action.
There are various different magnitude relationships that may be implemented in connection with this procedure and it will be understood that mathematical ad/or Boolean logic or other operations may be performed to manipulate such magnitude relations or reference value yet till provide the desired performance. Such magnitude relationship may therefore include greater than, less than, equal to, less than or equal to, greater than or equal to, the negation of any of these, and well as any Boolean logic operations that may be applied to these relationships, and combinations thereof. Of particular utility is to make the comparison based on a predetermined magnitude relationship of the reference value (or threshold value) as being greater than or equal to the reference value, or simply greater than the reference value. Of course the reference value may be adjusted to reflect a change in the mathematical nature of the comparison. Any of these trend or exception tracking or monitoring procedures may provide that the at least one vector includes a least one concept and an associated concept variable, and the concept variable is susceptible to change over time.
In yet another alternative trends tracking procedure, the procedure comprises: comparing first and second portions of at least one vector, wherein the first and second portions pertain to the same concept; and identifying a change result in the concept if there is a difference between the first and second portions; and optionally repeating the comparing and identifying over a dimension to generate a plurality of change results and using the plurality of change results to represent a trend. Frequently, the selected dimension is time as there is great utility in establishing trends over time so that predictions and forecasts, or other decisions on action may be made respective of the future.
This trends tracking procedure or any of the other described tracking or exception monitoring procedures may more specifically be implemented by choosing as the vector a multidimensional vector comprising a date, and the first list of dimensional information objects or other properties comprises at least one of: second, minute, hour, day, month, season, quarter, year, time-zone (with respect to Greenwich mean time or other reference). The dimensional information may be relative to any calendar or calendar system.
Within the trends tracking and procedure, the date or date vector may occur a plurality of times, for example it may occur each time there is a shift or change in concepts or groupings of and their combinations; optionally, the occurrence of a date vector for each shift or change number of occurrences of hit count may be used for trending analysis.
In some embodiments, the multidimensional vector contains: an global object identification data (such as a user independent index of the unstructured object), a concept identification data, a reference time stamp, cross reference type (such as to locate concept within the object such as an indication that the concept is in the title or body of a document), history time (such as the time when the object was indexed), and a concept hit occurrence data.
The method may provide that the multidimensional vector includes a vector identifier, which may itself optionally include a vector sequence identifier called an object identifier.
Referring back to the embodiment of the inventive method for trends tracking comprising selecting at least one vector in a predetermined vector dimension that includes at least one concept variable having a value; identifying a change in a predetermined concept variable from a first value of the predetermined concept variable to a second value of the predetermined concept variable for at least one selected concept over a selected dimension; and storing as a trend result, the identified change in the predetermined concept variable; where the at least one vector comprises a document containing at least some unstructured text or a combination of structured text and partially structured or unstructured text; as well as to other embodiments of the invention, it will be appreciated that the first vector data may be selected from the set of first vector data consisting of: a date data, a temporal data, a spatial data, a location data, a number of hits, a cross reference with object data, and combinations thereof.
When a vector consists of vector elements, such elements may be selected from the set of vector elements consisting of object identifier, concept identifier, cross reference time stamp, other dimensional information, cross reference type, history time, total number of hits, and combinations thereof. Alternatively, when a vector consists of vector elements such elements may be selected from the set of vector elements consisting of object identifier, concept identifier, a timestamp, cross reference time stamp, other dimensional information, cross reference type, history time, total number of hits, and combinations thereof; and wherein when the timestamp occurs it may occur a single time, two times, or any plurality of times.
In this alternative trends tracking procedure as well as in the other variants of the trends tracking procedure and exception monitoring procedure, the identifying a change may include a comparison and the comparison is a vector to vector comparison; or the identifying a change includes a comparison and the comparison is a comparison made between two portions within the same vector; or the identifying a change includes a comparison and the comparison is made between any plurality of vectors or portions of vectors is made by performing such comparison on a pair-by-pair basis.
These trends tracking exceptions monitoring procedures may be implemented or executed such that the identifying a change in a concept variable from at least a first date to at least a second date for at least one selected concept, comprises identifying history of change over time for a plurality of times and optionally for a plurality of concept variables for the selected concept. The concept variable is optionally selected from the set of concept variables consisting of key concepts, seed concepts, and combinations thereof.
Clearly, the trends tracking procedure may also include some pattern recognition features and procedural steps so that particular patterns, signatures, or the like may automatically be recognized, or recognized interactively through the user or analysis with a set of analysis tools. For example, the identifying a change may comprises identifying a pre-defined or dynamically determined pattern. Such a patterns may be any pattern that can be defined, and may by way of example but not of limitation, be selected from the set of patterns consisting of: a pattern for the number of hits of high-low-high-low-high, a pattern for high-low-high, a pattern for low-high-low, a pattern for an increase, a pattern for a decrease, a rare event pattern, a pattern for gradually increasing—peaking—then gradual declining, a pattern for any sequence of increasing or decreasing, a pattern for any peaking, and combinations thereof; wherein the pattern may serve to identify a trend of interest for any number of more complex concepts and their combinations.
In one alternative embodiment of the trends tracking procedure, the identifying a change comprises selecting a first portion of at least one first vector storing data for the selected concept at the first date with a second portion of at least one second vector storing data for the selected concept at the second date, and comparing the selected first and second vector portions for a change in at least one concept variable for the selected concept. Optionally, the first vector and the second vector are the same vector; or the first vector and the second vector may be different vectors. At least one concept variable for the selected concept may optionally be the same concept variable for the comparison, or the at least one concept variable for the selected concept is a different concept variable for the comparison. In some embodiments, the vector comprises a subsection of the entire vector.
As described herein elsewhere relative to other embodiments, the comparison may be a comparison between the first date and the second date a hit count change, a cross reference type change, and combinations thereof. The comparison is not limited to two vectors, and may comprise a comparison among three or more vectors. In some instances such multi-way comparisons are conveniently performed on a pair-wise comparison basis.
When trends results (whether initial, intermediate or final) are generated, they may be stored in whatever form is convenient, such as locally on an analyst computer hard disk drive, or on local or remote data base storage. The results may be the trend comparisons, the tracked trends differences, graphical depictions of such trends, and documents describing the results in the form of reports, power-point presentations, annotated maps, or in any other form. In some instances, it may be convenient to store the tracked trends in a database as a column, a row, as a parsable free-text ordered appropriately. In other instances it may be convenient to store the tracked trends in a relational structured or partially structured database or other data structure.
Having described the manner in which query results and trends tracking data and result may be stored it should be appreciated that any or all of the input or output, including the inputs and outputs or query results from the afore described method for analytical processing may be stored in a data structure. For example, and of the concept or concept identifier and a number of occurrences in the unstructured objects may be so stored.
The unstructured object may comprise at least one document having unstructured or partially structured content.
In some instances the data structure is stored in a display device storage buffer, in other instances it may be stored in some a random access memory, and in other instances it may be stored in a non-volatile storage medium, such as a storage system having one or a plurality of rotating magnetic hard disk drives, optical drives, or other forms of non-volatile storage as are known in the art. Encryption or other security measures may be applied to safeguard the information.
After such information has been stored, it may be retrieved by the same user or analyst or by a different user or analyst, or used by automated processors for any other purpose. In one embodiment, the stored information includes stored concept or concept identifier information, and the method includes accessing the stored concept or concept identifier and displaying the concept or concept identifier and a number of occurrences in the object or document for each of the concepts or concept identifiers.
In another alternative embodiment, the method for analytical processing of at least one unstructured object in a multi-dimensional space further includes monitoring or tracking exceptions. In a particular embodiment, the monitoring or tracking of exceptions are applied to changes in hits, concept type, and number. In another embodiment, the exceptions are selected from the set of exceptions consisting of hit count, normalized hit count, new concepts, absent or deleted concepts, and combinations thereof.
While embodiments for tracking trends have been described, it will be appreciated that there are multiple variations for achieving such tracking over unstructured objects. In another alternative embodiment, the method for tracking a trend in an unstructured object over a predetermined dimension or variable comprises: selecting first and second vectors or vector portions, each having at least one vector element; comparing the first and second vector or vector portions along relevant elements within the vectors or vector portions to determine a difference or non-difference; and optionally storing the resulting difference or non-difference for trending analysis. The selecting first and second vectors or vector portions may also be optional where the selection has previously been made. Also, in some cases it may be necessary to generate the desired first and second vectors according to the needs and required results from the analysis.
Each vector or vector portion, or selected vectors or vector portions, may optionally have one or a plurality of vector elements selected from the set consisting of: object id, concept id, cross reference time stamp, other dimensional information, cross reference type, history time, and total number of hits or occurrences.
It will be appreciated that for any of the numbers, parameters, results, or the like, either the raw number of some scaled or normalize version may be stored or used. Such normalization may promote comparisons over some variable such as time and make comparisons more meaningful. In one embodiment, for example, the methods further comprise normalizing captured concepts scores and/or hits or occurrences in the unstructured objects. In analogous manner summaries or statistical metrics may be generated from raw results or even from processed results.
In connection with the trends tracking and exceptions monitoring and tracking, as well as for the method for analytical processing of an unstructured object, the inventive system and method permit interactive or dynamic browsing of concepts and other variable or parameters, as well as historical results. In at least one embodiment, the inventive system and method permit and facilitate a user to dynamically or interactively browse concepts for changes in one or more unstructured objects, over predetermined set of concepts or dimensions, and combinations thereof. For example, the dimension or dimensions over which a user may browse may comprise a time or calendar dimension selected from the set of dimensions consisting of: time, second, minute, hour, day, month, year, time-zone, and combinations thereof.
Embodiment of Alternative Method for Analytical Processing of An Unstructured Object
Recalling the afore described embodiments of the method for analytical processing of at least one unstructured object in multi-dimensional space, that includes: for at least a selected dimension of the multi-dimensional space, hierarchically exploring and querying the unstructured object at a plurality of levels of detail or abstraction to determine a first query result; selecting a combination of at least two dimensions of the multi-dimensional analytical processing space; selecting at least one particular concept in the form of a view or perspective among a determined set of concepts at a predefined level of granular detail; and analytically exploring and processing the unstructured objects using the selected combination of dimensions and at least one angle or perspective to determine a second query result different from the first query result that identifies where the corpus has changed to reflect the angle or perspective, wherein the angle or perspective is a set of concepts and optionally one or more relationships with each other, and granular detail.
This method (or other of the alternative methods for analytical processing of a unstructured object may further include tracking trends, monitoring or tracking exceptions, or both. In one particular embodiment, it provides for tracking trends by a tracking procedure or method that includes comparing first and second portions of at least one vector, wherein the first and second portions pertain to the same concept; and identifying a change result in the concept if there is a difference between the first and second portions; and optionally repeating the comparing and identifying over a dimension to generate a plurality of change results and using the plurality of change results to represent a trend. This procedure may itself further include generating summarizations at a categories level, and optionally that the generating summarizations at a categories level comprises generating pre-defined summarizations or pre-built summarizations.
The trends tracking procedure may further include monitoring a corpus at an individual user level or individual view or perspective level to detect a change in concepts, categories that are defined by groupings of concepts by count or number of hits or occurrences as absolute number or as a percentage from the previous occurrence.
In one embodiment, the monitored corpus may comprise a collection of objects where the inherent content of these objects can be translated to a common set of symbols.
The inventive tracking procedure may further include determining a change without user intervention or action, and the change detected can be any detectable change or a change satisfying predetermined or dynamically determined detection or change parameters, rules, or policies. For example, the change can be a predetermined or dynamically determined degree, type, or threshold of change. The determining a change without user intervention may also or alternatively be performed according to some predetermined or dynamically determined schedule whether periodic or aperiodic. For example, initiating a sequence of events for looking or determining a change may be done in response to some action or event that occurs or fails to occur. Furthermore, the identification of one or more trends may initiate determination of changes for other concepts or the like actions. Any of these may occur either automatically or under user or analyst control or direction. The determining a change without user intervention being performed in response to the occurrence or non-occurrence of a predetermined or dynamically determined event or the failure of an event to occur.
This or other embodiments of the trends tracking procedure may further include automatically altering a user analytical format without a pre-imposed hierarchy in response to user defined changes in concepts, or number of hits or occurrences of a given subset of concepts. This or other embodiments of the trends tracking procedure may also optionally but advantageously include: generating summarizations at a categories level; monitoring a corpus at an individual user level or individual view or perspective level; and determining a change without user intervention based on the monitoring.
Embodiment of Method for Providing Dynamic Matrix-Based Graphical Interface
The invention further provides a method for providing a graphical interface having a dynamic matrix tabular form view of concepts and their occurrence within unstructured objects as well as the structure of the view produced by the method. The method includes: displaying concepts versus object descriptions in a matrix format view to assist a user in quickly determining an object or objects of interest; providing a view tool to permit the user to choose concepts to add or remove from this view for comparing concepts within the view; the view tool permitting view personalization so that the view provided for a first user viewing a first set of unstructured objects may be different from the view provided for a second user viewing the first set of unstructured objects; and the view being an integrated view such that both unstructured and structured objects may be advantageously displayed together in the same format.
The view tool may provide display categories as graphical folders that can be displayed as open or shut, as dictated by their status to reveal or hide the unstructured objects assigned to the category.
Advantageously, the method for generating the matrix format tabular data structure may comprise drilling-down to an arbitrarily specified level of detail and optionally popping up to an arbitrary higher level of detail. The drill-down may be performed at the command of a user on a command basis or interactively. The drilling-down may be performed automatically under algorithmic control to support the higher level of summarization which can be sum of number of hits or occurrences of a concept or count of number of objects with the presence or absence of a concept and can be normalized by sum of size of individual objects or an arbitrary number or a combination thereof. The drill-down may also or alternatively be performed under automatic algorithmic control with user interaction and override.
Alternative Embodiments of Method for Tracking and Monitoring Exceptions
Methods and system for tracking or monitoring an exception to a change have been described herein elsewhere, but as with tracking trends, more than one alternative procedure may be used either alone or in conjunction with the method for analytical processing of an unstructured and/or method or procedure for tracking trends. The method is particularly advantageous for tracking exceptions to one or more changes over a dimension that is hidden or not plainly identified.
In one embodiment of a method for tracking an exception to a change in the state of the corpus over a dimension that is hidden or not plainly identified in a corpus, the method includes: identifying and recording all (or selected) concepts and their number of hits or occurrences; selecting or constructing at least two vectors or at least two portions of a single vector; comparing these vectors over all defined dimensions; and, if data or object cannot be explained by current dimensions, then generating a new dimension using the exception as a basis or prompt the user to suggest a new basis; and otherwise reporting the exception and whether a new basis was generated.
In at least one embodiment of the exceptions monitoring or tracking procedure, the vector has the form of a vector having at least one and generally a plurality of vector elements selected from the set of elements consisting of object id, concept id, cross reference time stamp, other dimensional information, cross reference type, history time, and total number of hits. Frequently, the dimension includes a temporal (time) dimension.
The method tracking an exception includes some exceptions trigger such as a threshold that is used as a reference for comparison as described. The procedure includes setting at least one threshold for at least one of a computed value of number of hits of a concept or other exceptions condition where tracking or monitoring is desired. In one particular embodiment, the setting of at least one threshold is for at least one of: (i) a computed value of number of hits of a concept, and (ii) an addition or deletion of a concept over a given dimension comprises setting a plurality of thresholds for a plurality of concepts or categories. Optionally, the threshold comprises a predetermined or dynamically determined percentage change of the numeric sum of occurrence of concepts in a corpus, or the threshold comprises a predetermined or dynamically determined change from an absolute value from a previous observation of a concept in a corpus.
The threshold may alternatively be set as a percentage change from a previously recorded number of occurrences of a particular symbol within an unstructured object or plurality of unstructured objects, or the threshold may be set as a numerical count change from previously recorded number of occurrences of a particular symbol within an unstructured object or a plurality of unstructured objects.
The exception tracking method may use at least one threshold and further but alternatively include: monitoring a corpus at an individual user level or individual view perspective angle level to detect changes in an absolute, relative, or percentage number of occurrences or hits or concept or categories; determining a change without user intervention; and automatically alerting the user in response to changes in concepts, or number of occurrences or hits of a given subset of concepts. In this embodiment of the method, the change may be any detectable change; a changed of a predetermined magnitude or polarity; or a predetermined or dynamically determined degree, type, or threshold of change or any predetermined pattern of change along a given dimension of any combination above.
The base method for exception tracking, or the method with any of the optional components may provide that the determining of a change without user intervention may be performed according so some predetermined or dynamically determined schedule; determining a change without user intervention being performed periodically; or the determining a change without user intervention being performed in response to the occurrence or non-occurrence or a fixed pattern of occurrence and non-occurrence of a predetermined or dynamically determined event or the failure of an event to occur, or any combination of these or according to other criteria.
The method for tracking an exception my also provide that the corpus comprises a collection of objects, and optionally where the inherent content of these objects can be translated to a common set of symbols.
Optionally but advantageously, the method for tracking an exception may further include browsing or permitting browsing or other examination of a corpus for exceptions to changes in concepts, or number of hits or occurrences of a given subset of concepts, or for other conditions. This browsing may advantageously be interactive and with a graphical, tabular, textual, or with a combination of graphical, tabular, or textual interface.
Any of the various alternative embodiments of the exception monitoring and tracking procedure may further include alerting a user, system administrator, management, or other identified person, organization, or other entity upon detecting the occurrence of predetermined exception condition. In at least one embodiment, the exception condition comprises the occurrence of a predetermined number of exceptions or events. In another embodiment, the predetermined number of exceptions is a single exception or event. Email alerts send from client machine's or user/analysts computer using an email system such as Microsoft Outlook to the same Microsoft Outlook or to any other email system using standard or secure email protocols may be used in order to notify the user of predefined changes. Printed reports may alternatively or additionally be generated.
In embodiments of the invention where a visual display in whatever form is utilized to display results, such as to present exception alerts or portions of alerts, color coding may advantageously be used in the visualization display to distinguish among various alert types as can be defined by the user. The use of text or symbolic font coding to identify particular changes may also or alternatively be used. The font coding may include font coding selected from the set of font codings consisting of font size, bold, italics, shading, color, underline, double-underline, blinking, complementary color, strike-through, and combinations thereof. The font coding may optionally include font coding in a separate report.
The afore described method for tracking an exception (or a plurality of exceptions) to a change in the state of the corpus may further include or be directed specifically to tracking exceptions as a function of a time dimension, tracking exceptions as a function of a concept dimension, tracking exceptions as a function of a predetermined or dynamically determined dimension, tracking exceptions as a function of a predetermined or dynamically determined plurality of dimensions, or a combination of these. The method may also optionally include setting an exception reporting threshold that defines the number of exceptions that must occur before an exception report is generated. In other embodiments, the threshold is set as a percentage change from a previously recorded number of occurrences of a particular symbol within an unstructured object or plurality of unstructured objects, or is set as a numerical count change from previously recorded number of occurrences of a particular symbol within an unstructured object or a plurality of unstructured objects.
The view tool may provide display categories as graphical folders that can be displayed as open or shut, as dictated by their status to reveal or hide the unstructured objects assigned to the category.
Alternative Embodiments of a Method for an Object Concept Based Search
In another aspect, the invention provides a method for object concept based search, the method comprising in one embodiment, parsing a user-entered search text to capture at least one seed concept from the entered text; determining whether at least one of the captured concepts are available as a key concept associated with an object in a relational database table; repeating the parsing and the determining for all seed concepts entered; generating a resulting list of objects; determining if the seed concepts and their user customizations exist, optionally including ones that have not been picked up as key concepts; narrowing the resulting object list to accommodate the existence of all entered seed concepts with their special user customizations; identifying objects as results whose concepts match; and scoring objects identified as matching according to an object scoring algorithm. In one embodiment, this scoring algorithm includes averaging scores for the individual key concepts that contributed to the search for each object returned using a tf-idf and a Gaussian distribution to remove child concepts that do not significantly contribute to the meaning of the parent concept with respect to the corpus. In another embodiment, the search is performed using a combination of key concepts and seed concepts, and the number of hits for the seed concepts are then divided by the total number of hits picked up for all seed concepts in a document being analyzed to determine how much the seed concept actually contributed to the concept of the document; adding this result; and averaging this result with the average score for the key concepts to arrive at a relevancy score for the object as pertains to a particular search.
In some embodiments, the method further includes reading elements of the data structure and displaying the read elements in at least a two-dimensional row and column format on a display device. In another embodiment, the method further includes reading elements of the data structure and displaying the read elements in a three-dimensional volumetric format on a display device. In yet another embodiment, the count change comprises a positive integer greater than zero.
Embodiments of Computer Programs and Computer Program Products
Having described numerous methods it may be appreciated that the methods may advantageously be implemented as computer programs for execution within a general purpose computer or a network of general or special purpose computers. Such computers include a processor or CPU and a memory coupled with the processor for storing commands or instructions and data in whatever form. In general such computer programs also have a corresponding computer program product comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism, comprising a program module that directs the computer system, to function in a specified manner according to the method and the intended inputs and outputs and including instructions for accomplishing the intended processing.
Embodiments of a System for Analytical Processing of An Unstructured Object
The invention also provides a system for analytical processing of at least one unstructured object in a multi-dimensional space, the system comprising: at least one computer having a processor and a memory coupled with the processor and input/output devices and interface to permit an external user to interact with the computer; a storage for at least one unstructured object coupled or intermittently coupled with the computer; a hierarchical query tool operating in conjunction with the memory and the processor for hierarchically exploring at least a selected dimension of a multi-dimensional space the unstructured object at a plurality of levels of detail or abstraction to determine a first result; and a dimension selection tool operating in conjunction with the memory and the processor for selecting a combination of at least two dimensions of the multi-dimensional analytical processing space.
The system may optionally be provide such that the at least one computer comprises a plurality of networked or intermittently networked computer coupled or coupleable by a communication link. Optionally, the communication link comprises a secure or encrypted communication link. The system may provide storage where the storage comprises at least one or a local data storage, a remote database, the content of a portion of the internet, the Internet, and combinations thereof. In a further embodiment, the system includes a plurality of program modules, the program modules including instructions for: analytical processing of at least one unstructured object in a multi-dimensional space; tracking a trend in the unstructured object over a predetermined dimension or variable; tracking an exception to a change in the state of the corpus over a dimension that is hidden or not plainly identified in a corpus; and generating a tabular data structure having any arbitrary concepts and their number of hits of occurrences or normalized occurrences.
Embodiments of Business and Operating Models
The invention also provides several embodiments of operating models or business models. In one embodiment, the invention provides an operating model for concept-based dynamic analysis of unstructured information objects, the operating model comprising: providing access to a computational server and at least one source of unstructured objects for performing concept-based dynamic analysis of unstructured information objects; and collecting remuneration of value by an operator for the access on a pay-per-usage, pay-per-information item, pay-per-time basis, pay-per-database basis, pay-per-query result basis, or other result or time or quantity of access basis, and combinations thereof. The access may be provided on an application service provider basis. The access is optionally provided over the Internet or on any other plurality of networked computers or information appliances. The remuneration of value may be in the form of or include a monetary payment.
Additional Embodiments
Although several embodiments of the invention have been described, it should be understood that the invention is not intended to be limited to the specifics of these embodiments. For example, specific information extracted by the analysis and categorization engine could be stored at different stages in relational database tables having a slightly different organization. Further, other data storing mechanisms could be utilized for making available the output of the analysis and categorization engine's analysis.
It will be appreciated that the algorithms, procedures, and methods described herein may be implemented as computer program software and/or firmware to be executed on a general of special purpose computer or information appliance having a processor for executing instructions and memory associated with the processor for storing data and instructions. The computer program may be stored on a tangible media such as a magnetic storage device, optical storage device, or other tangible media customarily used to store data and/or computer programs. It will also be appreciated that the computer program product may be stored at one location and transmitted electronically, such as over the Internet or other network of connected computers, for receipt and storage at another location.
The inventive system and method further provide a data structure, such as a data structure defined in electronic memory of a computer or stored in other tangible media. Embodiments of the data structures have been described with reference to the tables herein above.
The inventive system and method also provide a business or operating model or method for concept-based dynamic analysis of unstructured information. Such operating model or method may for example provide access to a server that implements the inventive techniques on a pay-per-usage, pay-per-information item, pay-per-time, or other quantity or time basis. The inventive method may also or alternatively be provided in an application service provider context.
Workers skilled in the art will appreciate that, in light of the description, a variety of interfaces can be provided for a user to view, and understand the meaning of, unstructured objects based on the structured information generated by the analysis and categorization engine.
Benefit of priority under 35 U.S.C. §119(e) and/or 35 U.S.C. §120 is claimed to U.S. Provisional Patent Application Ser. No. 60/302,971, entitled SYSTEM FOR AND METHOD OF INTELLIGENT CATEGORIZATION OF INFORMATION, filed May 7, 2001, which application is a related application and is hereby incorporated by reference. Benefit of priority under 35 U.S.C. §119(e) and/or 35 U.S.C. §120 is claimed to United States Utility patent application Ser. No. 10/087,053, entitled CONCEPT-BASED METHOD AND SYSTEM FOR DYNAMICALLY ANALYZING UNSTRUCTURED INFORMATION, filed Mar. 1, 2002, which application is a related application and is hereby incorporated by reference. Benefit of priority under 35 U.S.C. §119(e) and/or 35 U.S.C. §120 is claimed to U.S. Provisional Patent Application Ser. No. 60/366,045, entitled CONCEPT-BASED METHOD AND SYSTEM FOR DYNAMICALLY ANALYZING UNSTRUCTURED INFORMATION, filed Mar. 19, 2002, which application is a related application and is hereby incorporated by reference. This application claims the benefit of U.S. Provisional Application No. 60/366,045, filed Mar. 19, 2002, and this application is a continuation-in-part application of U.S. patent application Ser. No. 10/087,053, filed Mar. 1, 2002, now U.S. Pat. No. 6,970,881, which claims the benefit of U.S. Provisional Application No. 60/302,971, filed May 7, 2001. All of the above referenced applications are herein incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
4839853 | Deerwester et al. | Jun 1989 | A |
5278980 | Pedersen et al. | Jan 1994 | A |
5295256 | Bapat | Mar 1994 | A |
5418951 | Damashek | May 1995 | A |
5488725 | Turtle et al. | Jan 1996 | A |
5519865 | Kondo et al. | May 1996 | A |
5555408 | Fujisawa et al. | Sep 1996 | A |
5592667 | Bugajski | Jan 1997 | A |
5619709 | Caid et al. | Apr 1997 | A |
5634051 | Thomson | May 1997 | A |
5727950 | Cook et al. | Mar 1998 | A |
5761496 | Hattori | Jun 1998 | A |
5768578 | Kirk et al. | Jun 1998 | A |
5794178 | Caid et al. | Aug 1998 | A |
5819260 | Lu et al. | Oct 1998 | A |
5832182 | Zhang et al. | Nov 1998 | A |
5832470 | Morita et al. | Nov 1998 | A |
5867799 | Lang et al. | Feb 1999 | A |
5884305 | Kleinberg et al. | Mar 1999 | A |
5920864 | Zhao | Jul 1999 | A |
5924090 | Krellenstein | Jul 1999 | A |
5926811 | Miller et al. | Jul 1999 | A |
5933822 | Braden-Harder et al. | Aug 1999 | A |
5953726 | Carter et al. | Sep 1999 | A |
5963940 | Liddy et al. | Oct 1999 | A |
5983214 | Lang et al. | Nov 1999 | A |
5987446 | Corey et al. | Nov 1999 | A |
5987447 | Chang et al. | Nov 1999 | A |
6006221 | Liddy et al. | Dec 1999 | A |
6026388 | Liddy et al. | Feb 2000 | A |
6028605 | Conrad et al. | Feb 2000 | A |
6029161 | Lang et al. | Feb 2000 | A |
6038561 | Snyder et al. | Mar 2000 | A |
6055526 | Ambroziak | Apr 2000 | A |
6078924 | Ainsbury et al. | Jun 2000 | A |
6101491 | Woods | Aug 2000 | A |
6101492 | Jacquemin et al. | Aug 2000 | A |
6101515 | Wical et al. | Aug 2000 | A |
6122628 | Castelli et al. | Sep 2000 | A |
6182063 | Woods | Jan 2001 | B1 |
6189002 | Roitblat | Feb 2001 | B1 |
6192364 | Baclawski | Feb 2001 | B1 |
6195657 | Rucker et al. | Feb 2001 | B1 |
6199034 | Wical | Mar 2001 | B1 |
6199067 | Geller | Mar 2001 | B1 |
6233575 | Agrawal et al. | May 2001 | B1 |
6236987 | Horowitz et al. | May 2001 | B1 |
6240407 | Chang | May 2001 | B1 |
6256633 | Dharap | Jul 2001 | B1 |
6263335 | Paik et al. | Jul 2001 | B1 |
6289382 | Bowman-Amuah | Sep 2001 | B1 |
6300957 | Rao et al. | Oct 2001 | B1 |
6308175 | Lang et al. | Oct 2001 | B1 |
6314420 | Lang et al. | Nov 2001 | B1 |
6321228 | Crandall et al. | Nov 2001 | B1 |
6327593 | Gioffon | Dec 2001 | B1 |
6330563 | Heckerman et al. | Dec 2001 | B1 |
6366908 | Chong et al. | Apr 2002 | B1 |
6377259 | Tenev et al. | Apr 2002 | B1 |
6385600 | McGuinness et al. | May 2002 | B1 |
6385602 | Tso et al. | May 2002 | B1 |
6411924 | de Hita et al. | Jun 2002 | B1 |
6424973 | Baclawski | Jul 2002 | B1 |
6442545 | Feldman et al. | Aug 2002 | B1 |
6453312 | Goiffon et al. | Sep 2002 | B1 |
6453339 | Schultz et al. | Sep 2002 | B1 |
6477524 | Taskiran et al. | Nov 2002 | B1 |
6492989 | Wilkinson | Dec 2002 | B1 |
6496818 | Ponte | Dec 2002 | B1 |
6502045 | Biagiotti | Dec 2002 | B1 |
6510406 | Marchisio | Jan 2003 | B1 |
6513036 | Fruensgaard et al. | Jan 2003 | B2 |
6523028 | DiDomizio et al. | Feb 2003 | B1 |
6532469 | Feldman et al. | Mar 2003 | B1 |
6539376 | Sundaresan et al. | Mar 2003 | B1 |
6564210 | Korda et al. | May 2003 | B1 |
6567797 | Schuetze et al. | May 2003 | B1 |
6567804 | Ramasamy et al. | May 2003 | B1 |
6581068 | Bensoussan et al. | Jun 2003 | B1 |
6606620 | Sundaresan | Aug 2003 | B1 |
6606659 | Hegli et al. | Aug 2003 | B1 |
6611825 | Billheimer et al. | Aug 2003 | B1 |
6615199 | Bowman-Amuah | Sep 2003 | B1 |
6615253 | Bowman-Amuah | Sep 2003 | B1 |
6628312 | Rao et al. | Sep 2003 | B1 |
6629095 | Wagstaff et al. | Sep 2003 | B1 |
6629097 | Keith | Sep 2003 | B1 |
6633868 | Min et al. | Oct 2003 | B1 |
6636242 | Bowman-Amuah | Oct 2003 | B2 |
6636848 | Aridor et al. | Oct 2003 | B1 |
6640238 | Bowman-Amuah | Oct 2003 | B1 |
6640244 | Bowman-Amuah | Oct 2003 | B1 |
6640249 | Bowman-Amuah | Oct 2003 | B1 |
6643661 | Polizzi et al. | Nov 2003 | B2 |
6651058 | Sundaresan et al. | Nov 2003 | B1 |
6654761 | Tenev et al. | Nov 2003 | B2 |
6665662 | Kirkwood et al. | Dec 2003 | B1 |
6675159 | Lin et al. | Jan 2004 | B1 |
6678700 | Moore et al. | Jan 2004 | B1 |
6684207 | Greenfield et al. | Jan 2004 | B1 |
6701305 | Holt et al. | Mar 2004 | B1 |
6711585 | Copperman et al. | Mar 2004 | B1 |
6715145 | Bowman-Amuah | Mar 2004 | B1 |
6728707 | Wakefield et al. | Apr 2004 | B1 |
6732097 | Wakefield et al. | May 2004 | B1 |
6732098 | Wakefield et al. | May 2004 | B1 |
6738765 | Wakefield et al. | May 2004 | B1 |
6741988 | Wakefield et al. | May 2004 | B1 |
6742015 | Bowman-Amuah | May 2004 | B1 |
6766316 | Caudill et al. | Jul 2004 | B2 |
6778979 | Grefenstette et al. | Aug 2004 | B2 |
6778995 | Gallivan | Aug 2004 | B1 |
6810394 | Coutts et al. | Oct 2004 | B2 |
6839699 | Chan | Jan 2005 | B2 |
6842906 | Bowman-Amuah | Jan 2005 | B1 |
6862713 | Kraft et al. | Mar 2005 | B1 |
6892189 | Quass et al. | May 2005 | B2 |
6895406 | Fables et al. | May 2005 | B2 |
6901555 | Hida et al. | May 2005 | B2 |
6928398 | Fang et al. | Aug 2005 | B1 |
6941321 | Schuetze et al. | Sep 2005 | B2 |
6944626 | Cameron et al. | Sep 2005 | B2 |
6961731 | Holbrook | Nov 2005 | B2 |
6970860 | Liu et al. | Nov 2005 | B1 |
6970881 | Mohan et al. | Nov 2005 | B1 |
6976016 | Chang et al. | Dec 2005 | B2 |
6976017 | Getchius | Dec 2005 | B1 |
6978274 | Gallivan | Dec 2005 | B1 |
7007034 | Hartman, Jr. et al. | Feb 2006 | B1 |
7010752 | Ly | Mar 2006 | B2 |
7019744 | Anderson | Mar 2006 | B2 |
7023453 | Wilkinson | Apr 2006 | B2 |
7194460 | Komamura | Mar 2007 | B2 |
7194483 | Mohan et al. | Mar 2007 | B1 |
7272594 | Lynch | Sep 2007 | B1 |
7349895 | Liu et al. | Mar 2008 | B2 |
7505989 | Gardner et al. | Mar 2009 | B2 |
7533344 | Motoyama et al. | May 2009 | B1 |
7536413 | Mohan et al. | May 2009 | B1 |
7627588 | Mohan et al. | Dec 2009 | B1 |
7650405 | Hood et al. | Jan 2010 | B2 |
7676485 | Elfayoumy et al. | Mar 2010 | B2 |
7788251 | Carlson et al. | Aug 2010 | B2 |
7788367 | Motoyama et al. | Aug 2010 | B2 |
7831559 | Mohan et al. | Nov 2010 | B1 |
7890514 | Mohan et al. | Feb 2011 | B1 |
8589413 | Mohan et al. | Nov 2013 | B1 |
20010011266 | Baba | Aug 2001 | A1 |
20020007373 | Blair et al. | Jan 2002 | A1 |
20020016800 | Spivak et al. | Feb 2002 | A1 |
20020069203 | Dar et al. | Jun 2002 | A1 |
20020091696 | Craft et al. | Jul 2002 | A1 |
20020107844 | Cha et al. | Aug 2002 | A1 |
20020120609 | Lang et al. | Aug 2002 | A1 |
20020129015 | Caudill et al. | Sep 2002 | A1 |
20030014403 | Chandrasekar et al. | Jan 2003 | A1 |
20030018659 | Fuks et al. | Jan 2003 | A1 |
20030041056 | Bossemeyer et al. | Feb 2003 | A1 |
20030069882 | Nieswand et al. | Apr 2003 | A1 |
20030074177 | Bowen | Apr 2003 | A1 |
20030149586 | Chen et al. | Aug 2003 | A1 |
20030163454 | Jacobsen et al. | Aug 2003 | A1 |
20030187632 | Menich | Oct 2003 | A1 |
20040002959 | Alpert et al. | Jan 2004 | A1 |
20040019588 | Doganata et al. | Jan 2004 | A1 |
20040049478 | Jasper et al. | Mar 2004 | A1 |
20040083206 | Wu et al. | Apr 2004 | A1 |
20040167870 | Wakefield et al. | Aug 2004 | A1 |
20040167883 | Wakefield et al. | Aug 2004 | A1 |
20040167884 | Wakefield et al. | Aug 2004 | A1 |
20040167885 | Wakefield et al. | Aug 2004 | A1 |
20040167886 | Wakefield et al. | Aug 2004 | A1 |
20040167887 | Wakefield et al. | Aug 2004 | A1 |
20040167907 | Wakefield et al. | Aug 2004 | A1 |
20040167908 | Wakefield et al. | Aug 2004 | A1 |
20040167909 | Wakefield et al. | Aug 2004 | A1 |
20040167910 | Wakefield et al. | Aug 2004 | A1 |
20040167911 | Wakefield et al. | Aug 2004 | A1 |
20040199498 | Kapur et al. | Oct 2004 | A1 |
20040215634 | Wakefield et al. | Oct 2004 | A1 |
20040254916 | Dettinger et al. | Dec 2004 | A1 |
20050021290 | Velipasaoglu et al. | Jan 2005 | A1 |
20050021357 | Schuetze et al. | Jan 2005 | A1 |
20050021512 | Koenig | Jan 2005 | A1 |
20050060340 | Sommerfield et al. | Mar 2005 | A1 |
20050065957 | Jones | Mar 2005 | A1 |
20050065967 | Schuetze et al. | Mar 2005 | A1 |
20050091197 | Dettinger et al. | Apr 2005 | A1 |
20050091198 | Dettinger et al. | Apr 2005 | A1 |
20050108256 | Wakefield et al. | May 2005 | A1 |
20050149496 | Mukherjee et al. | Jul 2005 | A1 |
20050154692 | Jacobsen et al. | Jul 2005 | A1 |
20050154711 | McConnell | Jul 2005 | A1 |
20050160080 | Dawson | Jul 2005 | A1 |
20050160082 | Dawson | Jul 2005 | A1 |
20050160107 | Liang | Jul 2005 | A1 |
20050166152 | Hida et al. | Jul 2005 | A1 |
20050192824 | Schuetze et al. | Sep 2005 | A1 |
20050192926 | Liu et al. | Sep 2005 | A1 |
20050193055 | Angel et al. | Sep 2005 | A1 |
20050234879 | Zeng et al. | Oct 2005 | A1 |
20050246320 | Benysh et al. | Nov 2005 | A1 |
20050278323 | Horvitz et al. | Dec 2005 | A1 |
20060015486 | Nomiyama et al. | Jan 2006 | A1 |
20060047649 | Liang | Mar 2006 | A1 |
20060074836 | Gardner et al. | Apr 2006 | A1 |
20060100854 | Ance et al. | May 2006 | A1 |
20060106793 | Liang | May 2006 | A1 |
20060161423 | Scott et al. | Jul 2006 | A1 |
20070033221 | Copperman et al. | Feb 2007 | A1 |
20070113292 | Kao et al. | May 2007 | A1 |
20070192272 | Elfayoumy et al. | Aug 2007 | A1 |
20080065603 | Carlson et al. | Mar 2008 | A1 |
20080288431 | Colson et al. | Nov 2008 | A1 |
Number | Date | Country |
---|---|---|
2399666 | Sep 2004 | GB |
WO-9918825 | Apr 1999 | WO |
WO-00063841 | Oct 2000 | WO |
WO-0122280 | Mar 2001 | WO |
WO-0239254 | May 2002 | WO |
WO-03040892 | May 2003 | WO |
WO-2004053645 | Jun 2004 | WO |
WO-2005069881 | Aug 2005 | WO |
WO-2007047252 | Apr 2007 | WO |
Entry |
---|
Nasukawa, T. and T. Nagano, “Text Analysis and knowledge mining system”, IBM Systems Journal, vol. 40, No. 4, 2001, p. 967-984. |
Ankerst, et al., “DataJewel” Tightly Integrating Visualization with Temporal Data Mining, ICDM Workshop on Visual Data Mining, Melbourne, FL, p. 19, Nov. 19-23, 2003. |
Arents et al., “Concept-Based Retrieval of Hypermedia Information: From Term Indexing to Semantic Hyperindexing,” Information Processing &Management, vol. 29, No. 3, pp. 373-386, May-Jun. 1993. |
Belew, R., “A Connectionist Approach to Conceptual Information Retrieval,” ICAIL, May 27-29, 1987. |
Bhatia et al., “Conceptual Clustering in Information Retrieval,” Systems, Man, and Cybernetics, Part B, IEEE Transactions, v. 28, issue 3, pp. 427-436. |
Boeing Frontiers, “Text Mining, for Golden Results,” Jul. 2005, p. 50. |
Botev et al. “Context-sensitive keyword search and ranking for XML” Eighth International Workshop on the Web and Databases, Jun. 16-17, 2005. |
Chinese Office Action for Application No. 200680045412.0 dated Jul. 8, 2010. |
Chinese Office Action, CN 200680045412.0 dated Feb. 5, 2010. |
Cross, Fuzzy semantic distance measures between ontological concepts, IEEE, vol. 2, Jun. 27-30, 2004, pp. 635-640. |
Crouch, C. et al., Experiments in automatic statistical thesaurus construction, Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and development in Information Retrieval, pp. 77-88, Jun. 21-24, 1992. |
Deerwester et al., “Indexing by latent semantic analysis,” Journal of the Society for Information Science, vol. 41, No. 6, 391-407, Sep. 1990. |
Dumais et al., “Inductive Learning Algorithms and Representations for Text Categorization”, Proceedings of the seventh international conference on information and knowledge management, pp. 148-155, ACM Press, 1998. |
Dumais et al “Optimizing Search by Showing results in Context,” SIGCHI '01, Mar. 31-Apr. 5, 2001. |
European Examination Report for Application No. 06816600.8, dated Nov. 15, 2011. |
European Search Report issued in EP Application No. 06816600.8, dated Jun. 15, 2009, 6 pgs. |
Fagan, J. L., “Automatic Phrase Indexing for Document Retrieval,” Proc. of the Tenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 91-101, Jun. 3-5, 1987. |
Feldman, “Tutorial 4, Mining Unstructured Data”, KDD Tutorial Notes, pp. 182-236, 1999. |
Finkelstein et al., “Placing search in context: The concept revisited,” Tenth International Conference on World Wide Web, pp. 406-414, May 1-5, 2001. |
Fishwick et al., Ontologies for modeling and simulation: issues and approaches, Dec. 5-8, 2004, IEEE, vol. 1, 6 pages. |
Giger, H. P., “Concept Based Retrieval in Classical IR Systems,” Proceedings of the Eleventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, May 1998, pp. 275-289. |
Grotevant, S.M. and Foth, D., The Power of Multidimensional Analysis (OLAP) in Higher Education Enterprise Reporting Strategies, presented at CUMREC 1999, printed Feb. 27, 2006 from http://www.educause.edu/ir/library/html/cmr9917/cmr9917.html, 8pages. |
International Search Report and Written Opinion, PCT/US10/30801, filed Apr. 13, 2010, ISR/WO dated Jun. 10, 2010. |
International Search Report for International Application No. PCT/US06/39511, dated Apr. 25, 2007, 3 pages. |
Jain et al., Data Clustering: A Review, ACM Computing Surveys, vol. 31, No. 3, Sep. 1999. |
Jang, H. W.Park, S.Y., “Keyfact Concept for an Information Retrieval System,” Proc. of Natural Language Processing Pacific Rim Symposium, pp. 510-513, Dec. 4-7, 1995. |
Japanese Office Action for Application No. 2008-535618, dated Jan. 19, 2012. |
Jun, M.S. & Park, S. Y. “Keyfact-Based Information Retrival System”, International Symposium on Digital Library, pp. 521-554 (1997). |
Kolda et al., A Semidiscrete Matrix Decomposition for Latent Semantic Indexing in Information Retrieval, ACM Transactions on Information Systems (TOIS), vol. 16, Issue 4, pp. 322-346, Oct. 1998. |
Lawrence, S., “Context in Web Search,” Bulletin of the Technical Committee on Data Engineering, IEEE Computer Society, vol. 23, No. 3, pp. 25-32, Sep. 2000. |
Leake et al. “Exploiting rich context: An incremental approach to context-based web search,” Proceedings of Fifth International and Interdisciplinary Conference on Modeling and Using Context, Paris, France, vol. 3554, pp. 254-267, Jul. 5-8, 2005. |
Leake et al., “Towards context-based search engine selection,” Sixth International Conference on Intelligent User Interfaces, Santa Fe, New Mexico, pp. 109-112, Jan. 14-17, 2001. |
Notice of Allowance, U.S. Appl. No. 10/087,053, filed Mar. 1, 2002, dated Mar. 28, 2005. |
Notice of Allowance, U.S. Appl. No. 10/393,677, filed Mar. 19, 2003, dated Dec. 29, 2006. |
Notice of Allowance, U.S. Appl. No. 10/677,492, filed Oct. 3, 2003, dated Jul. 10, 2009. |
Notice of Allowance, U.S. Appl. No. 11/275,048, filed Dec. 5, 2008, dated Apr. 27, 2010. |
Notice of Allowance, U.S. Appl. No. 11/275,050, filed Dec. 5, 2005, dated Dec. 2, 2008. |
Notice of Allowance, U.S. Appl. No. 11/548,456, filed Oct. 11, 2006, dated Apr. 1, 2010. |
Notice of Allowance, U.S. Appl. No. 11/656,017, filed Jan. 22, 2007, dated Oct. 5, 2009. |
Notice of Allowance issued in U.S. Appl. No. 10/695,426, dated Jun. 11, 2013. |
Notice of Allowance issued in U.S. Appl. No. 11/275,046, dated Oct. 6, 2010. |
Notice of Allowance issued in U.S. Appl. No. 10/393,677, dated Jul. 13, 2006. |
Notice of Allowance issued in U.S. Appl. No. 10/087,053, dated May 12, 2005. |
Office Action U.S. Appl. No. 11/275,050, filed Dec. 5, 2005, Office Action dated Jul. 29, 2008. |
Office Action U.S. Appl. No. 11/656,017, filed Jan. 22, 2007, Office Action dated Jan. 26, 2009. |
Office Action, U.S. Appl. No. 10/087,053, filed Mar. 1, 2002, Office Action dated Aug. 20, 2004. |
Office Action, U.S. Appl. No. 10/393,677, filed Mar. 19, 2003, Office Action dated Dec. 7, 2005. |
Office Action, U.S. Appl. No. 10/677,492, filed Oct. 3, 2003, Office Action dated Apr. 21, 2008. |
Office Action, U.S. Appl. No. 10/677,492, filed Oct. 3, 2003, Office Action dated Aug. 27, 2007. |
Office Action, U.S. Appl. No. 10/677,492, filed Oct. 3, 2003, Office Action dated Dec. 23, 2008. |
Office Action, U.S. Appl. No. 10/677,492, filed Oct. 3, 2003, Office Action dated Feb. 27, 2007. |
Office Action, U.S. Appl. No. 10/677,492, filed Oct. 3, 2003, Office Action dated May 2, 2006. |
Office Action, U.S. Appl. No. 10/695,426, filed Oct. 3, 2003, Office Action dated Nov. 9, 2009. |
Office Action, U.S. Appl. No. 10/695,426, filed Oct. 3, 2003, Office Action dated Aug. 23, 2007. |
Office Action, U.S. Appl. No. 10/695,426, filed Oct. 3, 2003, Office Action dated Jun. 17, 2008. |
Office Action, U.S. Appl. No. 10/695,426, filed Oct. 3, 2003, Office Action dated Mar. 5, 2009. |
Office Action, U.S. Appl. No. 11/275,046, filed Dec. 5, 2005, Office Action dated Jul. 25, 2008. |
Office Action, U.S. Appl. No. 11/275,046, filed Dec. 5, 2005, Office Action dated Mar. 18, 2009. |
Office Action, U.S. Appl. No. 11/275,046, filed Dec. 5, 2005, Office Action dated Oct. 13, 2009. |
Office Action, U.S. Appl. No. 11/275,046, filed Dec. 5, 2005, Office Action dated Oct. 31, 2007. |
Office Action, U.S. Appl. No. 11/275,048, filed Dec. 5, 2005, Office Action dated Aug. 29, 2008. |
Office Action, U.S. Appl. No. 11/275,048, filed Dec. 5, 2005, Office Action dated Dec. 11, 2007. |
Office Action, U.S. Appl. No. 11/275,048, filed Dec. 5, 2005, Office Action dated Jun. 11, 2009. |
Office Action, U.S. Appl. No. 11/275,050, filed Dec. 5, 2005, Office Action dated Nov. 14, 2007. |
Office Action issued in U.S. Appl. No. 10/695,426, dated Jun. 15, 2012. |
Office Action issued in U.S. Appl. No. 10/695,426, dated Nov. 22, 2011. |
Office Action issued in U.S. Appl. No. 10/695,426, dated May 13, 2011. |
Office Action issued in U.S. Appl. No. 10/695,426, dated Nov. 1, 2010. |
PCT/US06/39511, filed Oct. 11, 2006, Written Opinion dated Apr. 25, 2007. |
Salton, G. et al., “A Vector Space Model for Automatic Indexing,” Communications of the ACM, Nov. 1975, pp. 613-620, vol. 18, No. 11. |
Singh et al., Generating association Rules From Semi-Structured Documents Using an Extended Concept Hierarchy, Proceedings of the sixth international conference on information and knowledge Management, pp. 193-200, Nov. 10-14, 1997. |
U.S. Appl. No. 10/695,426, filed Oct. 3, 2003, Office Action dated May 24, 2010. |
U.S. Appl. No. 11/275,046, filed Dec. 5, 2005, Office Action dated Apr. 23, 2010. |
U.S. Appl. No. 11/275,048, filed Dec. 5, 2005, Office Action dated Jan. 7, 2010. |
U.S. Appl. No. 11/548,456, filed Oct. 11, 2006, Office Action dated Apr. 14, 2009. |
U.S. Appl. No. 11/548,456, filed Oct. 11, 2006, Office Action dated Aug. 20, 2008. |
U.S. Appl. No. 11/548,456, filed Oct. 11, 2006, Office Action dated Oct. 13, 2009. |
U.S. Appl. No. 11/656,017, filed Jan. 22, 2007, Office Action dated Jul. 8, 2009. |
U.S. Appl. No. 12/423,024, filed Apr. 14, 2009, Office Action dated Mar. 9, 2011. |
U.S. Appl. No. 12/423,024, filed Apr. 14, 2009, Office Action dated Oct. 18, 2011. |
H.C. Arents et al., “Concept-Based Retrieval of Hypermedia Information: From Term Indexing to Semantic Hyperindexing,” Information processing & Management vol. 29, No. 3, pp. 373-386, 1993. |
Belew, Richard, “A Connectionist Approach to Conceptual Information Retrieval,” ICAIL '87, pp. 116-126 (1987). |
Kritzestein, B., “Starlight: Knowledge Management on a Whole New Plane,” Chemical and Biological Defense Info Analysis Center Newsletter, vol. 5, No. 4, Fall 2004 pp. 10-12. |
“Text mining, for golden results,” Boeing Frontiers, Jul. 2005, p. 50. |
U.S. Appl. No. 09/393,677, filed Sep. 10, 1999, Motoyama et al. |
U.S. Appl. No. 10/684,434, filed Oct. 15, 2003, Motoyama et al. |
Codd, E.F. et al., Providing OLAP to User-Analysts: An IT Mandate, Hyperion Solutions Corporation, CA, 1993, pp. 1-20. |
Haveliwala, T.H., “Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search,” IEEE, Jul. 15, 2002, 22 pages. |
Deerwester, S. et al., “Indexing by Latent Semantic Analysis,” Journal of the American Society for Information Science, vol. 41, No. 3, pp. 391-407, Sep. 1990. |
Nigam, K. et al., “Learning to Classify Text from Labeled and Unlabeled Documents,” Proceedings of the 15th National Conference of the American Association for Artificial Intelligence, pp. 792-799, Jul. 1998. |
Yang, Y., “An Evaluation of Statistical Approaches to Text Categorization,” Information Retrieval, vol. 1, Issue 1-2, pp. 69-90, 1999. |
Number | Date | Country | |
---|---|---|---|
60366045 | Mar 2002 | US | |
60302971 | May 2001 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 10087053 | Mar 2002 | US |
Child | 10393677 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 10393677 | Mar 2003 | US |
Child | 15097698 | US |