The embodiments of the invention will be better understood from the following detailed description with reference to the drawings, in which:
The embodiments of the invention and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. It should be noted that the features illustrated in the drawings are not necessarily drawn to scale. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments of the invention. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments of the invention may be practiced and to further enable those of skill in the art to practice the embodiments of the invention. Accordingly, the examples should not be construed as limiting the scope of the embodiments of the invention.
Most traditional business intelligence operates by reading large bodies of information and applying statistical techniques to identify “interesting” patterns and/or trends within that information. Applied as general statistical patterns, techniques, such as multivariate regression, are used to test possible relationships in the data. When tuned to spot events early in their life cycle, this approach can yield a large number of unimportant trends and/or patterns and other information that might be characterized as noise. Far too many issues to review manually turn up and the most interesting new developments may be buried because the information of greatest interest represents only a very tiny fraction of the total information available. Thus, as mentioned above, there is need in the art for a system and method of efficiently and rapidly examining high-volume text streams to retrieve data that indicates changes in conditions (e.g., changes in events, activities, associations, affiliations, financial status of competitors, etc.). There is also a need for an associated system and method for analyzing data retrieved from such high volume text streams.
In order to have computer systems read massive amounts of text and still efficiently and accurately identify useful changes in conditions several challenges must be over come. These challenges include, novelty, importance, actionability, transient issues, undetected trends/issues, timeliness, etc., and are explained in detail below.
Regarding novelty, many changes and new events in the environment are important, but not novel or unexpected. For example, the yearly change of seasons has significant effects, but it is not novel or unexpected. An automated system for discovering changes in the external environment does not know this a priori, so an object of the techniques disclosed herein is to intelligently differentiate novel information from common knowledge. The problem is to identify useful previously unknown information.
Regarding importance, not all novel changes in the environment are also important for any given company. For example, the development of electronic file sharing of music has been significant for the music industry, but has had little visible impact on the petroleum industry. Therefore, an object of the techniques disclosed herein is to consider, for each industry or category, only the relevant changes that are most critical to track.
Regarding actionability, often a change is important, but early detection by a few months will not lead to significantly different actions. For example, the development of a new algorithm for optimizing truck deliveries might have significant long-term impacts for the retail industry, but the immediate management response for a retailer may not exist. Thus, spotting this change is not advantageous.
Regarding transient issues, there is significant difficulty in identifying so many possible issues and patterns that an organization can not evaluate all of them. Thus, instead of resolving information overload, systems that identify too many trends simply add to it or worse mislead a company to spend resources on issues that appear important, but are actually transient and irrelevant. With 1.5 trillion new pages of information generated a month, there are almost an unlimited number of possible new patterns or trends. Therefore, another object of the mining technique disclosed herein is to efficiently sort through possible items of interest and focus on those most likely to persist and require action.
Regarding undetected trends/issues, equally problematic are cases in which systems are too finely tuned to recognize patterns in their earliest stages so that important trends are not identified early. As stated earlier, an object of the techniques disclosed herein is to identify meaningful, actionable trends early in their life cycle, without either identifying too many “non-issues” or missing real issues.
Regarding timeliness and currency, in order for a pattern to be identified from a collection of documents, one or more time stamps are associated with each relevant document. These time stamps indicate the timeframe which the document was written by the author, posted to a public online medium, or made available to the public such as the street date. At a minimum, the time stamp should include the date (year, month and date); in a few instances, the time of day may also be relevant, especially for those events that happen over a short period. Since oftentimes informally written content does not have time and date stamps, another object of the techniques disclosed herein is to assign a time dimension if none exists.
Taking into consideration the challenges described above, disclosed herein are embodiments of a method of mining data from a high-volume text stream, a method of evaluating that data and a system for implementing these methods in order to efficiently and accurately discover significant changes in the conditions early so that a firm or organization can take advantage of emerging beneficial changes (e.g., emerging market opportunities) and proactively address emerging disadvantageous changes.
The text mining technique disclosed is used to identify data and evidence from high-volume text streams that would suggest specific changes in conditions (e.g., changes in events, activities, associations, financial status of competitors, etc.). The approach used reverses the usual approach of starting with raw data and moving toward suggested events. Specifically, this text mining technique incorporates the use of predetermined scenarios that characterize one or more specific changes as well as identified variables that are relevant to those predetermined scenarios. The identified variables are input as mining parameters into a data mining tool. Retrieved data is analyzed statistically and the results of the analysis are evaluated for statistical significance and/or evaluated visually in order to rapidly and efficiently discover relevant changes. Data mined according to this mining technique (or any other suitable mining technique) can be evaluated according to the data evaluation technique that is also disclosed herein. The data evaluation technique visually displays results of a statistical analysis as well as additional information in order to allow a user, in conjunction with his knowledge of the industry and organization, to identify data that should be filtered, to identify trends and patterns in visually displayed data, to assess the trends and patterns and to prioritize the trends and patterns, based on the assessment. Once the trends and patterns are assessed and prioritized, a user can develop an appropriate action plan. Also, disclosed is an embodiment of a system suitable for implementing such methods.
More particularly, referring to
Once the scenarios are determined, templates for each scenario are defined. Each template includes key variables which are relevant to a given scenario (104). The template is a data structure that is filled in by a combination of manual entry and automatic methods applied to a large, continuous text stream (such as a web crawl, RSS feed, or document stream). The objective of this step is to define variables comprehensively and specifically so that high-precision identification of occurrences that meet the scenario criteria can be found. For example, the variables can comprise at least one of subjects, topics (e.g., person, place, thing, event, etc.) associated with each of the subjects, sentiment related to each subject or topic, geographic locations associated with each of the subjects, source categories, authors, and date ranges. These variables are input by a user (102) into (i.e., received by) the system and, in particular, a data mining tool within the system.
More specifically, a template is a set of variables that are defined which allow the scenario to be tracked and that can be extracted with accuracy by automated text mining techniques. As mentioned above, the key variables are comprehensive and specific and can include a set of subjects, topics of interest, geographic location, websites or sources that contain the mentions and date.
Subjects can include the specific subjects that are relevant to a given scenario. Examples of subjects that can be defined include company names and subsidiaries, company names in the same industry (oftentimes one company's issues have ripple effects through the entire industry), executive and board member names, special interest group names, nick names of companies, and names of campaigns or products. Subjects may also be, for example, product names, film or album titles, or any primary set of object of interest. Synonyms of subjects can also be included and names can be disambiguated names as appropriate.
Topics can include the interesting associations for each subject. Examples of topics that can be defined for a company can include product names, new technologies or practices that may be sensitive or debated publicly, manufacturing facility names, past environmental issues and associated campaign names. Topics can also, for example, include people associated with a company or product, or a key advertising phrase, or any distinguished topic of interest relevant to the set of subjects.
Geographic locations can include the city, county/province/state, country or other location or locations that are relevant for each subject. Examples of locations that can be defined for a company can include the locations of the corporate headquarters, manufacturing facilities or operations. Such locations can be defined narrowly or broadly.
Source (i.e., sources of the electronic text documents) categories can include the groups of digital sources that may be reporting on an event or communicating the important change. For example, one source category may be defined as “local/regional environmental special interest groups” and include websites for specific groups. Another source category may be “national media” and include digital data sources, such as, websites for on-line newspapers or other news media. Another source category may be “personal publications” which include personal blog sites, bulletin boards, etc. Other source categories may indicate if a document is from a pre-defined list of newspapers, non-governmental organizations (NGO) websites, influential blog sites, electronic newsletters, etc. Additionally, a combined source category may reflect that the a document has multiple sources.
Dates can include dates which are captured from each document for data analysis (e.g., the date the document was created, the date the document was modified, dates contained within the text of the document, etc.). These dates can be used to specify the date range for selecting the documents in the analysis. For example, a date range of three to four weeks from the present time may be desirable for getting emerging issues or a date range of more than three months can be selected for trending analysis. Older documents that mention then-controversial issues may not be relevant for current analysis.
Other variables can also be included in the template, e.g., readership or circulation of the sources, demographic information of sources, etc.
In response to user supplied execution instructions (106), the data mining tool applies a mining algorithm to a high-volume text stream and, using the input variables of the templates as the mining parameters, retrieves data that is suggestive of a change or changes indicated by the identified scenarios (116). Specifically, mining algorithms are applied to automatically identify references to the variables (e.g., subjects, topics, etc.) in the text stream. All the variations of the variables (e.g., subject names, topic terms and phrases, etc.) can also be identified in the documents. Other characteristics, such as geographic information and sentiment evaluation can be computed for each identified subject reference. This computation is done around the vicinity of the spotted entity which is within a sentence, paragraph or a given number of words or tokens before and after the subject spot. Additional topics can also optionally be obtained by clustering the words that are interesting and occur frequently around an entity or a number of entities. Consequently, by identifying instances of the key variables and related statistics in the input stream, the mining algorithm can accurately and efficiently identify the evidence of each scenario.
An exemplary data mining algorithm suitable for implementing this technique can comprise a set of unstructured text analytics mining algorithms (e.g., See U.S. patent application Ser. No. 11/160,943, filed on Jul. 15, 2005 and incorporated herein by reference, which discloses an exemplary mining algorithm suitable for use herein). The high-volume text stream from which the data is mined can comprise one or more text-based electronic documents (e.g., an unstructured text document (UTD)). The documents can be selected, for example, from the world wide web (WWW), from a wide area network (WAN), from a local area network, etc and may be preprocessed, for example, by a preprocessor, in order to provide “noise free” text to the mining algorithm.
Once the data is retrieved by the mining algorithm at process 116, it may be stored, for example, in a common data structure such as a data base (117). More specifically, the retrieved data can be inserted in appropriate fields in the templates. An exemplary structure for storing the templates and the data retrieved from the mining operations is a relational database.
A statistical analysis can then be performed on the retrieved data (118). The statistical analysis can be user-selected and modified, on demand (108). The results of the statistical analysis can be evaluated to identify trends and/or patterns that are suggestive of a change or changes characterized by the predetermined scenarios (119). Specifically, the results of the statistical analysis can be evaluated (e.g., by a processor) for statistical significance within a predetermined threshold and/or evaluated visually in order to identify such trends and/or patterns (120-121). Known techniques may be used to determine whether the results are statistically significant and/or to visually evaluate the results.
Additionally, a novel technique for visually evaluating the results of the analysis at process 121 is also disclosed. This technique can comprise displaying the results in one or more visual formats (e.g., graphs, tables with numerical values and/or text, charts, maps, tabulations, diagrams, etc.) (122). The number and types of visual formats to be displayed as well as the dimensions thereof (e.g., subjects, topics associated with each of the subjects, geographic locations associated with each of the subjects, source categories, date ranges, etc.) can also be user-selected and modified, on demand (110). These graphs, tables, etc. can be visually evaluated by the user in order to identify trends and/or patterns in the data suggestive of important, novel, actionable, timely, etc., changes in conditions.
Sample visual formats that would facilitate discovery of potentially significant market changes can include, for example: (a) the “number of mentions,” where the y-axis is the number of mentions and the x-axis are topics for each subject (e.g., comparing the number of mentions of one brand of automobile and safety with the mentions of another brand of automobile and safety); (b) “term associations,” where the y-axis is the number of mentions and the x-axis are the emerging topics associated with companies or products over time (e.g., increase in mentions of “genetically modified foods,” “aquaculture,” etc., over time; (c) “term associations,” where the y-axis is the number of mentions and the x-axis are the most frequently mentioned topics by source category (e.g., “downloading music” in regional newspapers vs. “downloading music” in personal web logs), etc.
In addition to visually displaying the results of the analysis (at process 122), other information may also be simultaneously displayed for evaluation by the user (124). For example, on demand, a user may choose to display a list of documents containing data relevant to the displayed results (e.g., relevant to a particular graph or chart), portions (i.e., a “snippet” or text fragment) of documents containing data relevant to the displayed results, or full documents containing data relevant to the displayed results. Thus, a user can drill down from a list of documents through a list of snippets to the actual document from which the snippet comes, thereby, allowing the item to be viewed in its original context. This technique allows the raw data associated with a particular visualization to be displayed in order to illustrate the rationale for including data from a particular document in the evaluation.
Thus, for example, referring to
By evaluating the displayed information, including both the displayed results (e.g., graphs, charts, etc.) at process 122 and the displayed documents or portions thereof at process 124, a user can also make data filter selections in order to discard duplicate, near-duplicate, known and uninteresting data (128). That is, the retrieved data can be filtered so that data contained in a duplicate document, data contained in a near-duplicate document, previously known data, data meeting a specified criteria (e.g., related to a given subject or topic) and uninteresting data can be discarded. Filtered data can be dynamically re-analyzed and re-displayed in the same manner, as described above at process 117-124, thereby allowing a user to evaluate more accurate results (i.e., results that include less noise) in order identify significant trends and/or patterns in the data that are suggestive of important, novel, actionable, timely, etc., changes characterized by the predetermined scenarios.
More specifically, this exemplary visualization technique works by accessing fields from the templates based on three factors, the user's selection and filtering input, the template design and the type of comparison desired. Consequently, because this visualization technique supports comparisons of values across multiple dimensions (e.g., showing how subjects differ by source category, etc.) and because a user can see how a subject (or subjects) is being discussed in certain sources or in certain timeframes, the “noise” of having all the data treated as a single conglomeration is eliminated and it is easier for a user to discover significant events or trends.
It is further anticipated that the data evaluation technique, disclosed above, for evaluating data mined at process 116 of
More particularly, as mentioned above, the evaluation process 119, includes displaying information related to the mined data (see processes 122-124). The displayed information includes the results of a statistical analysis of the mined data in at least one visual format (e.g., in one or more graphs, tables with numerical values and/or text, charts, maps, diagrams, tables, tabulations, etc.) (122). The type and number of formats, as well as the dimensions (e.g., subjects, topics associated with each of the subjects, sentiments associated with subjects or topics, geographic locations associated with each of the subjects, source categories, authors, date ranges, etc.) can be user-selected and modified, on demand (110). These graphs, tables, etc. can be created, for example, using external tools such as spreadsheets and not necessarily the system tool of the invention. That is, the system tool functions can be designed to complement the capabilities of known tools graphing tools. The displayed information can also include a list of documents containing data relevant to the displayed results, portions of documents containing data relevant to the displayed results or documents containing data relevant to the displayed results (124).
Referring to
For example, at process 402 an analyst may take advantage of the filtering capabilities of the visualization tool to identify the scenarios or template definitions that yielded results that beg further investigation. Referring to the graph 500 of
At process 404, in order to filter the data, the user can visually identify data, such as data contained in a duplicate document, data contained in a near-duplicate document, data matching specific criteria (e.g., data related to certain subjects, topics, geographies, sources, date ranges, etc.), previously known data or uninteresting data, and filter-out (i.e., discard) the identified data. Specifically, some of the evidence returned by the data mining algorithm may not be unique, e.g., one news agency story by one writer may be carried by several on-line newspapers. For some analyses, it may make sense to discard duplicate or near-duplicate documents if the breadth or coverage of a particular article is not of primary interest. Additionally, known or uninteresting evidence, based on a user's knowledge of the industry and/or knowledge a particular company, can also be discarded. For example, if two documents are returned that match the template definition and of those documents one is a web page containing a simple table of contents of an edition of a newspaper and another page contains a full-length article that is listed on the table of contents, the user can elect to discard (i.e., filter-out) data contained in the table of contents page. The remaining data (i.e., filtered data) can then be re-analyzed and re-displayed in the same manner, as described above, thereby allowing a user to evaluate more accurate results (i.e., results that include less noise) in order identify significant trends and/or patterns in the data that are suggestive of important, novel, actionable, timely, etc., changes at process 406, discussed below.
Following prioritizing and filtering of the data at processes 402-404, potentially significant patterns and/or trends can be identified by visually examining the displayed information (406). That is, for each evidence collection returned for a template definition, a user can identify trends and/or patterns, based on the source of the claim, the number of mentions of the claim, the authoritativeness of the claimant, etc. For example, if the scenario variables define a particular brand of soda as a subject, “adolescents” as the topic, personal web sites and blogs as sources for postings in the last three months, then there may be a handful of patterns in the evidence that can be grouped together, such as “reaction to that brand of soda in vending machines in public schools,” “childhood obesity,” and “alternatives to drinking that brand of soda.”
The identified trends and/or patterns can then be assessed (408). Multiple different assessments (410) can be incorporated into the overall assessment of the trends and/or patterns. For example, determinations can be made regarding the significance of the trends/patterns (e.g., Are the trends and patterns statistically significant within a predetermined threshold?) or the type of trends/patterns (e.g., is the change indicated by the trend/pattern beneficial or adverse?). Assessments can also be made to determined the likelihood that the change will occur, the potential impacts of the change (e.g., including impacts on one organization as compared to impacts on competitors) and a time frame of the change. Additional assessments can be made to verify the veracity (e.g., How reliable are the specific sources and/or the source categories for the data upon which the trends and patterns are based?), the validity (Are the trends and patterns contradictory?) and the timeliness (i.e., currency) of the data upon which the trends and patterns are based.
More specifically, various aspects of each trend and/or pattern identified can be assessed. One aspect could be the significance of the trend or pattern. That is, additional statistical analyses can be performed to determine if a change suggested by a trend or pattern is statistically significant within a given threshold. For example, the graph 600 of
Another aspect could be the type of trends or patterns identified. That is, a determination can be made by a user as to whether or not the change suggested by the retrieved data is beneficial or adverse (i.e., represents an opportunity or a threat to a company, firm, etc.). For example, potentially disastrous crisis, if managed well and promptly by the company's executives, could end up reflecting positively on the company. Additionally, organizations will often handle imminent threats differently (perhaps with more immediacy and with more executive involvement) than opportunities.
The urgency of the identified trends and patterns could also be assessed. That is, by considering the operational mode of an organization, the urgency of a response to the change indicated by the trend or pattern can be assessed. For example, if the organization is undertaking a critical project in which the milestones call for a specific operation to take place within a certain timeframe but activists are planning to thwart the employees from completing that task, the issue may need to be managed swiftly as an exception. In contrast, if an organization is planning to launch a new product in six to nine months and there was negative public reaction towards a similar competitor's product launched three months prior, the organization would incorporate the findings using existing business processes.
The potential impact on a company, organization, etc. of a change indicated by a trend or pattern can be assessed, as well as the relative impact to the competitors of that company. That is, the impact of each identified trend or pattern can be assessed by considering worst-case and best-case situations. Consideration would be given to the audience of the document, the short-term and long-term actions for the range of possibilities, etc. Furthermore, the relative impact to competitors (i.e., the reach of the change) may also affect a company's response to the change. For example, graph 700 of
The risk associated with the change can also be assessed. That is, what is the likelihood of the risk, how widespread will the evidence be known, is the risk acceptable, etc.
Other aspects of the identified trend and pattern that should be assessed include the authority (i.e., veracity or feasibility), validity and timeliness of the data upon which the trend/pattern is based. Specifically, the veracity or feasibility of each trend or pattern should be verified, using historical or other references. For example, an environmental organization may post damaging, critical remarks about the company's handling of the spill on their web site, but a user may recognize that the particular organization often has extreme views that are not well regarded or supported by others, and that they will not have much influence. Additionally, the categories of sources can be further broken into more granularly considering the readers and reach of the publication; for instance, news feeds can be differentiated by international and national sources, as a company may have different reputations and operations overseas and nationally. Non-governmental organizations may also be differentiated by their membership profile and activities in the spectrum of high activity and pro-activeness to low, broad recognition and opinion-leader vs. serving local interests, etc. Furthermore, each pattern or trend should be compared with the organization's current understanding of the marketplace. For example, does the trend contradict this understanding?, is the pattern an indication of a mismatch of company's messaging and positioning?, etc. Finally, the timeliness or currency of the evidence and the consequences from the results should be considered. For example, one environmental organizations web site might state that they are considering a letter-writing campaign for the next two weeks against all manufacturing plants in the local region. If relevant information is obtained early, corrective action can be less costly, both in terms of publicity and resources.
Once the overall assessment of the trends and/or patterns is complete, the assessments can be used to prioritize the trends and/or patterns (412). Based on the assessment and priority assigned to the various trends and patterns, responsive action plans both short and long-term can be developed (414). For example, short-term actionable business actions that effectively respond to each pattern or trend can be developed. In developing these short-term plans both corporate functions (e.g., strategic functions, executive management, communications, corporate attorneys, etc.) as well as line of business functions (e.g., product groups, marketing messages, etc.) that may need to act should be considered. Additionally, business processes, management processes within the organization, and longer-term actions that need to modified to effectively respond to each pattern or trend can be developed.
The processes, described above, can be repeated for additional trends and/or patterns or additional data (416-422), as necessary, and action plans that are developed at process 414 can be prioritized (423) based on urgency, potential harm, etc.
Referring to
The controller 804 can be configured so that it is in communication with and can provide communication between each of the other listed features of the invention and can further be adapted to provide overall control of the system 800 based on user input (e.g., via the user-interface 802).
Specifically, the user-interface 802 can be adapted to allow a user to input variables that are relevant to at least one user-identified scenario that characterizes the change (e.g., subjects, topics (i.e., persons, places, things, events, etc.) associated with each of the subjects, sentiments associated with each topic or subject, geographic locations associated with each of the subjects, source categories, authors, date ranges, etc.). The user-interface 802 is further adapted to allow a user to input additional instructions to be implemented within the system 800 via the controller 804 (e.g., execution instructions for the data mining algorithm, selections for the statistical analysis to be applied by the analyzer, display selections, data filter selections, etc.).
The data mining tool 808 can be configured to apply a data mining algorithm to an input text stream (e.g., unstructured text documents 806) in order to retrieve data. The mining algorithm can comprise a set of unstructured text analytics mining algorithms. The parameters for the data mining algorithm can comprise variables that are input by a user and that are relevant to one or more user-identified scenarios that characterize a change or changes. As mentioned above, an exemplary data mining algorithm can comprise a set of unstructured text analytics mining algorithms (e.g., See U.S. patent application Ser. No. 11/160,943, filed on Jul. 15, 2005 and incorporated herein by reference, which discloses an exemplary mining algorithm suitable for use herein).
The high-volume text stream 806 from which the data is mined can comprise one or more text-based electronic documents (e.g., an unstructured text document (UTD)) and the system 800 can further be configured such that the documents 806 are accessible, for example, via the world wide web (WWW), via a wide area network (WAN), via a local area network, etc. Prior to processing by the data mining tool 808 these electronic documents 806 can, optionally, be preprocessed by a preprocessor in order to provide “noise free” text to the mining algorithm.
Once the data is retrieved by the data mining tool 808, it may be stored, for example, in a common data structure 810 such as a data base. More specifically, the retrieved data can be inserted in appropriate fields in the templates. Referring to
The analyzer 814 can be adapted to perform a statistical analysis of the stored data. Additionally, the analyzer can be adapted to identify the results of the statistical analysis that are statistically significant within a predetermined threshold. Additionally, the processor 816 can be adapted to convert results of the analysis into one or more visual formats (e.g., one or more graphs, charts, tables with numerical values and/or text, maps, diagram, tabulations, etc.). The number and types of these visual formats as well as the dimensions thereof can be selected by the user (e.g., via the user interface) and modified, on demand. The user-selected dimensions can comprise one or more of the scenario variables, including, subjects, topics associated with each of the subjects, sentiments associated with topics and subjects, geographic locations associated with each of the subjects, source categories, authors, date ranges, etc. It is anticipated that commercially available visualization and graphical analysis software may be used to implement the analyzer 814 and processor 816 features of the system 800.
The display 818 can be adapted to display the results of the analysis so that a user can visually identify trends and patterns therein that are suggestive of the change. Specifically, the display can be adapted to allow multiple visual formats (e.g., one or more different graphs, charts, tables, maps, diagram, tabulations, etc.) to be displayed simultaneously. The display 818 can further be adapted to simultaneously display additional information, such as, a list of documents containing the data that is relevant to the displayed results, portions of documents containing data that is relevant to the displayed results, or the full text of the documents containing data that is relevant to the displayed results. As mentioned above, the display information can be selected and modified by a user, on demand, to optimize the usefulness of the visual evaluation tool.
Thus, for example, referring to
The data filters 812 can be adapted to discard user-specified data. Specifically, following a visual evaluation of displayed information, including the displayed results of the statistical analysis (e.g., graphs, charts, etc.) and the displayed documents or portions thereof from which the data represented in the displayed results was retrieved, a user may determine that certain retrieved data should be filtered-out (i.e., discarded). For example, a user may request filtering out of specific data that is contained in a duplicate document, that is contained in a near-duplicate document, that matches a specified criteria (e.g., is related to certain subjects, topics, geographies, sources, date ranges, etc.) that was previously known data or that is considered by the user to be uninteresting.
Filtered data can be dynamically re-analyzed by analyzer 814 and re-displayed by display 818 in the same manner, as described above, thereby, allowing a user to more easily and accurately evaluate the data and, specifically, the displayed information in order to identify significant trends and patterns that are suggestive of important, novel, actionable, timely, etc., changes and to recommend actions plans in response to those changes.
The embodiments of the invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment including both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Furthermore, the embodiments of the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk—read only memory (CD-ROM), compact disk—read/write (CD-R/W) and DVD.
A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
Input/output (I/O) devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers. Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
A representative hardware environment for practicing the embodiments of the invention is depicted in
Therefore, disclosed above, are embodiments of a method of mining data from a high-volume text stream and a method of evaluating that data in order to efficiently and accurately discover significant changes in conditions (e.g., changes in activities, events, associations, affiliations, market preferences, etc.). The data mining technique uses predetermined scenarios that characterize specific changes as well as key variables that are relevant to those scenarios. These variables are input as mining parameters into a data mining tool. Retrieved data is analyzed and the results are evaluated. One technique of evaluating the results includes displaying them in a visual format (e.g., graphs, tables) along with additional information (e.g., lists of documents or portions of documents containing data relevant to the displayed results). A user evaluates the displayed results and additional information in order to identify data that should be filtered, to identify trends and/or patterns in the data, and to assess the trends and/or patterns. Once the trends and/or patterns are assessed, a user can develop and prioritize appropriate action plans. Also, disclosed is an embodiment of a system suitable for implementing such methods.
The foregoing description of the specific embodiments will so fully reveal the general nature of the invention that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, those skilled in the art will recognize that the embodiments of the invention can be practiced with modification within the spirit and scope of the appended claims.