The present invention relates generally to creation of queries for structured and unstructured data repositories.
Unstructured data is typically very voluminous and overwhelms existing computer systems, which is called the Big Data problem. Data in Big Data Repositories may be unstructured and not amenable to solely traditional database query techniques. Furthermore, those requiring results from a Big Data Repositories may lack the database query creation skills to produce desired results. What is needed is a system for allowing users with knowledge regarding desirable results, but without specific knowledge of database query techniques, to cause the creation of queries appropriate for their tasks.
A system and methods are provided for interactive construction of data queries. One method comprises: generating a query based upon a plurality of user-identified data items, wherein the user-identified data items are data items representing desired results from a query, and wherein information related to the user-identified data items is included in a “given” clause of the query, assigning received input data to a hierarchical set of categories, presenting to a user a plurality of new query results, wherein the plurality of new query results are determined by scanning the received input data to find data elements in the same hierarchical categories as those in the “given” query clause and not in the same hierarchical categories as those of an “unlike” clause of the query, receiving from the user an indication as to whether each query result of the presented plurality of new query results is a desirable query result, adding query results indicated by the user as desirable to the “given” clause of the query, adding query results indicated by the user as undesirable to the “unlike” clause of the query, evaluating a metric indicative of the accuracy of the query, and responsive to a determination that the query achieves a predetermined threshold level of accuracy, storing the query.
The foregoing summary, as well as the following detailed description of preferred embodiments of the invention, will be better understood when read in conjunction with the appended drawings. For the purpose of illustrating the invention, there are shown in the drawings embodiments that are presently preferred. It should be understood, however, that the invention is not limited to the precise arrangements and instrumentalities shown.
The invention provides a method by which traditional database queries can be run on unstructured data such as Tweets, audio, and video data. In many cases the unstructured data has some meta-information such as the data's time-of-creation, author, or geographic location; but most if not all of the desired signal is hidden inside the unstructured portion. For example, one may desire to know the mood a tweet's text, such as whether it is angry or happy, but this information is not available unless the text is labeled as such, either by a human or a special mood-detecting computer program. The novel architecture provides a means for users to create computer subroutines, which may themselves integrate, build, and/or configure other subroutines. The primary capability of the novel architecture is the creation of subroutines that extract signal from the unstructured portion of a data stream (series of records). Once this signal is detected for a particular piece of data, it can be categorized by this signal and labeled with its category such that it can be processed by downstream systems that require structured data. In this way, the category of the data, once extracted, represents structure to these downstream systems. In one preferred embodiment, the system extracts hypothesized structure it is not sure of, and downstream systems determine if they are useful for desired purposes, such as predicting the future value of a particular trend (e.g. stock prices, purchase order volume, etc.). Subroutines extracting hypothesized structures that do not end up being useful may eventually be retired and replaced by new hypotheses, and the structures that have proven utility may influence and guide the subroutine building process.
The novel architecture provides a means for scaling the computer system to accommodate the required processing of a given data stream so that it can be processed in real time. Users may make their subroutines or subroutine builders available through a subroutine repository database. A user may provide guidance during the configuration of a sub-routine so that the configuration is educated to the extent that a user has the time and/or resources to educate the subroutine during configuration. The novel system provides a process through which a user may educate a subroutine for improved categorization accuracy. This education process has been designed in a novel way so as to maximize the subroutine's improvement per second of user time spent providing said education. The novel system also learns which subroutines are successful at different tasks by observing prior user experiences. Thus, over time the system improves its ability to help new users build more accurate subroutines, and to build these subroutines with less user interaction.
The structured data extracted by the subroutines can be made available to traditional database technologies such as SQL database clients. The novel system creates candidate queries in these traditional query languages for insertion into user systems. Cognitive Query Language (CQL) queries are SQL queries that operate on structure information (e.g. hypothesized category) that has been extracted by the novel system from an unstructured component.
In
Business Directors (135) and Business Development Analysts (138) desire to find insights into the data stored in the SQL Databases (119). Business Directors (135) interface with data in the SQL Databases 119 through multiple methods. If a Business Director 135 has learned the skills required to form SQL Queries that match the questions they would like to ask about the data then they may form these SQL Queries and communicate them via link 132 to the SQL Databases. The Results from these queries may then be communicated back to the Business Director (135) via the Results link 131. Another method by which Business Directors (135) may gain understanding of data stored in SQL Databases (119) is by reviewing the Daily Purchase Graphs (134) presented to the Business Directors (135) by the Query Result Presenter (133). Queries are used to generate these graphs. These queries are input to the Query Result Presenter (133) by the Software Engineer (126) via link 128.
The Business Development Analyst 138 may not want or be able to SQL Queries (120) from the questions they want answers to. In this case they may ask assistance from a Database Administrator (122) through dialog link 141. The database Administrator (122) may communicate with the Software Engineer (126) via dialog link 125 and may configure the SQL Databases (119) via link 123 such that the databases (119) are more suitable for query by the Business Development Analyst (138). The Database Administrator may advise the Business Development Analyst (138) of what queries they might communicate to the Databases (119) via link 120 in order to retrieve Results (121) that answer the questions they have about the data. Upon successful analysis of the data with respect to these questions, the Business Development Analyst (138) may communicate with the Software Engineer (127) through dialog (127) in order to load the Query Result Presenter (133) with Queries via link 128 such that the Results (129) from those queries (130) are presentable to the Business Directors in graphs such as Daily Purchase Graphs (134). Alternatively, the Business Development Analyst (138) may require the Software Engineer's (126) help in designing queries for the SQL Databases (119), which the Software Engineer will develop by utilizing the SQL Databases via link 124 and communicating with the Database Administrator (122) via link 125. Upon performing successful analysis the Business Development Analyst (138) may observe the results presented by the Query Result Presenter (133) through link 136. The Business Development Analyst (138) may then act on this analysis by offering coupons to customers which are sent via the “Coupon offers” link (139) to Customer Messaging (140), which sends messages offering coupons via link 158 to Customers (100).
Once the Business Directors have answers to their questions presented to them via link 131 or link 134 they can make decisions based on that information and either advise the Business Development Analysts (138) on further investigation (via link 137), output new strategies (151) to the Investment Strategy Department (150), send advice (153) to Product Development (152), send advertising ideas (155) to Advertising (154), or convey supply chain concerns (157) to Supply Chain Management (156). The Business Directors (135) may advise the Business Development Analyst (138) on possible interactions with the customer that should be initiated such as Coupon offers (139).
In the case that a Business Development Analyst (230) has questions about data in the Big Data Repository (210), either their own questions or questions (265) received from Business Directors about the data, they communicate these questions through dialog (224) with a Software Engineer (220). Because the data in the Big Data Repository (210) is unstructured it does not need a schema designed by a database administrator in this example. In actuality it may be the case that most of the data in a data record is unstructured but some of it is structured and represented in SQL databases (119) as in
Upon receiving answers to previously asked questions the Business Development Analyst (230) may then send Coupon offers (235) to Customer Messaging (295). Similarly, upon receiving answers (255) to previously asked questions the Business Directors (230) may then send Coupon offers (296) to Customer Messaging (295), output new strategies (276) to the Investment Strategy department (275), send advice (281) to Product Development (280), advertising ideas (286) to Advertising (285), or supply chain concerns (291) to Supply Chain Management (290).
Once the Software Engineer (320) has sufficiently developed a set of one or more MapReduce Queries, they may be selected to be run perpetually and in this case they are sent as the “Selected Perpetual MapReduce Programs” (326) to the In Memory Data Grid (370). The In Memory Data Grid (370) then executes these MapReduce programs (326) perpetually on all of the data stored in the In Memory Data Grid. It may be the case that the MapReduce programs (326) update values stored in the In Memory Data Grid (370) and therefore subsequent executions of the MapReduce programs (326) on previously processed data have new results which necessitate the repeated processing. If it is known that an execution of a MapReduce program on previously processed data will not have different results, such as in the case that the data and query configuration have not changed, then a cache of the previous result or a reference to these results can be output from the MapReduce program (possibly for further processing) without requiring re-execution of the MapReduce program on the same data. Such a caching system may be left disabled until it is detected that cached results would have been used, in which case the caching system may be enabled for future processing. The enablement of the result caching system may also have a condition such that enablement only occurs if cached results appear to be of sufficient utility, such as obviating a sufficient amount of MapReduce query re-execution per amount of memory used by the cache.
Business Directors (360) raise questions (365) to Business Development Analysts (330) in order to influence Investment Strategies (378), Product Development (380), Advertising (385), Supply Chain Management (390), and Customer Messaging (395) in response to real-time events via link 377 rather than manually via links 379, 381, 386, 391, and 396 respectively. The Business Development Analyst (330) in turn presents the ideas behind those questions to the Software Engineer (320) via dialog 324. The Software Engineer (320) develops MapReduce Queries (325) which act on Big Data in the Repository (310). Results to these queries are sent via link 315 to the NO SQL Database (340) and are presented back to the Software Engineer (320) possibly through an interactive interface. The results may also be sent to the Query Result Presenter (350) via link 345 through which they may be sent onward to the Business Directors (360) and Business Development Analysts (330) via links 355 and 357 respectively. The Query Result Presenter (350) may present Trend Graphs (355, 357) or another illustration of the Results (315, 375). These Trend Graphs or other illustration (355, 357) may be recognized by the Business Development Analyst (330) as actionable in certain cases. The Identified Real-Time Events that are actionable (377) are sent to the various acting units 378, 380, 385, 390, 395 so that these units can respond to the current trends in real time. The Business Development Analyst (330) and Software Engineer (320) may work together through dialog (324) to refine the Selected Perpetual MapReduce Programs (326) and integrate suggested actions for the Identified Real-Time Events into the Selected Perpetual MapReduce Programs (326) so that these suggested actions are integrated into the message sent via link 377 to the units receiving these messages ((378, 380, 385, 390, 395). The Software Engineer (320) and Business Development Analyst (330) also maintain the set of Selected Perpetual MapReduce Programs (326) such that those queries that are no longer useful are removed from the In Memory Data Grid (370) so that they no longer run.
Business Directors (460) and Business Development Analysts (430) have questions about their business data and desire that Investment Strategy (478), Product Development (480), Advertising (485), Supply Chain Management (490), and Customer Messaging (495) react instantaneously to important Real-Time Events (477). The Business Directors (460) and/or Business Development Analyst (430) may have an idea of what these events are but they may not know what aspects of the Real-time unstructured data (400) signal these events, or anticipate them into the future, nor do they know how to program computers in a functional programming language. Business Directors (460) may interact with the Interactive CQL Query Builder (420) in order to build these programs through interacting with the Query Builder program (420), or may ask the Business Development Analyst (430) questions communicated via link 465. The Business Development Analyst (430) generates questions and receives questions (465) from the Business Directors (460), and attempts to answer these questions through interaction with the Interactive CQL Query Builder (420) via link 424.
The Interactive CQL Query Builder (420) creates CQL Queries based on interactions with Business Directors (460) and/or Business Development Analysts (430) via links 464 and 424 respectively. These interactions provide the Business Development Analyst (430) and/or Business Directors (460) with opportunities to guide the query building process, such as selection of an input data stream, selection of trends for prediction, or submission of example data that represent desired query results. The Interactive CQL Query Builder (420) constructs queries during this process and tests them on the Big Data Repository (410) to estimate what subsequent interaction with the Business Development Analyst (430) will be the most useful, or whether such interactions are no longer necessary. The results (421) of the CQL Queries (425) are received by the Interactive CQL Query Builder (420). Some or all of these results (421) are presented to the user in an effort to refine or fix the CQL Queries under development. CQL Queries (425) may alternatively return results via the Results link (415) so that they are input to the NO SQL Database (440). The Interactive CQL Query Builder (420) may then perform further analyses on the results (415) through repeated interaction with the NO SQL Database (440) via link 422.
In this preferred embodiment the user is either a Business Development Analyst (430) or Business Director (460). Once the user is satisfied with the results (421) returned by the CQL Queries (425), the Interactive Query Builder (420) communicates these queries to the In Memory Data Grid (470) through the Selected Perpetual CQL Queries link (426) for perpetual processing within the In Memory Data Grid (470), thereby processing (and reprocessing as necessary) all new real-time unstructured data (400). The Interactive CQL Query Builder (420) may also configure SQL Databases (450) via link 423 such that data in the NO SQL Database (440) is sent to the SQL Databases (450) via link 445. Data in the SQL Databases (450) may then be queried using traditional SQL Queries by the Business Directors (460) via link 453, and by Business Development Analysts (430) via link 457. The SQL Databases (450) also receive the results of Selected Perpetual CQL Queries (426) running within the In Memory Data Grid (470) as Tagged Data (471). Spreadsheets (452) also receive this data either directly via link 471 or indirectly from the SQL Databases (450) via link 451. The Spreadsheets (452) are configured with Formulas received via link 454 from Business Directors (460) or via link 454 from Business Development Analysts (430). Script code such as VBScript may also be sent so that the Spreadsheets (452) may be endowed with the ability to perform built-in actions in response to newly arriving data (451, 471). The Spreadsheets (452) act as a simplified interface for visualizing the results of CQL Queries executing on Real Time Data. Any Business Director (460) and Business Development Analyst (430) has the ability to create different visualizations of the data since they have experience working with spreadsheets. For example, the Business Directors (460) may submit Formulas (454) to the Spreadsheets (452) that produce visualized Trend Graphs (455). Business Development Analysts (430) may perform similar interactions with the Spreadsheets (452) via link 454.
The Selected Perpetual CQL Queries (426), which are run on the In Memory Data Grid (470), identify Real-Time Events (477) and these are sent to Investment Strategy (478), Product Development (480), Advertising (485), Supply Chain Management (490), and Customer Messaging (495) so that these systems can respond to the results of the real-time data analysis performed within the In Memory Data Grid (470). The In Memory Data Grid (470) identifies Real-Time Events (477) by executing the Selected Perpetual CQL Queries (426) and these are sent to Investment Strategy (478), Product Development (480), Advertising (485), Supply Chain Management (490), and Customer Messaging (495). Investment Strategy (478), Product Development (480), Advertising (485), Supply Chain Management (490), and Customer Messaging (495) may be further configured via links 479, 481, 486, 491, and 496 respectively so that they perform certain actions upon being notified of certain Identified Real-Time Events (477). In another preferred embodiment, the Identified Real-Time Events (477) may be configured to suggest certain actions to the Investment Strategy (478), Product Development (480), Advertising (485), Supply Chain Management (490), and Customer Messaging (495) units through configuration of the Selected Perpetual CQL Queries (426). This configuration may be performed either by the Business Directors (460) via link 464, or by the Business Development Analysts (430) via link 424.
Data element #1 (521) is graphed at coordinate “(4,10)” because it has 4 occurrences of the word “ball” and 10 occurrences of the word “sports”. Data element #1 521 is graphed at coordinate “(4,10)” because it has 4 occurrences of the word “ball” and 10 occurrences of the word “sports”. Data element #2 (522) is graphed at coordinate “(5,10)” because it has 5 occurrences of the word “ball” and 10 occurrences of the word “sports”. Data element #3 (523) is graphed at coordinate “(2,8)” because it has 2 occurrences of the word “ball” and 8 occurrences of the word “sports”. Data element #4 (524) is graphed at coordinate “(2,7)” because it has 2 occurrences of the word “ball” and 7 occurrences of the word “sports”. Data element #5 (525) is graphed at coordinate “(6,3)” because it has 6 occurrences of the word “ball” and 3 occurrences of the word “sports”. Data element #6 (526) is graphed at coordinate “(10,3)” because it has 10 occurrences of the word “ball” and 3 occurrences of the word “sports”. Data element #7 (527) is graphed at coordinate “(6,3)” because it has 6 occurrences of the word “ball” and 3 occurrences of the word “sports”. Data element #8 (528) is graphed at coordinate “(9,2)” because it has 9 occurrences of the word “ball” and 2 occurrences of the word “sports”.
A primary form of data analysis that does not require the data to be labeled and augmented under human supervision is clustering. Many clustering algorithms exist, and all of them generally share the goal of achieving a description of the data that organizes it into groups such that data that is in the same group are very similar to each other (e.g. containing the same keywords, or same frequency of keywords) and data elements that are not in the same group are not as similar to each other. The means by which the clustering algorithms assign data to groups is different for each clustering algorithm. The size and number of the clusters is in some sense arbitrary although some algorithms try to self configure these variables. One means of compensating for some of the inherent arbitrariness of creating a predetermined number of clusters is to initially create many small clusters (with each group having relatively few data associated with it), and then to create a hierarchy of clusters of clusters, and clusters of clusters of clusters, etc. until all of the data is in one big cluster. When these clusters are configured as a hierarchy, as in
In
Although the clusters depicted in
Documents #1 and #2 (621, 622) are in Tier 2 Cluster D (660_since Cluster D (660) is an ancestor of these documents, and is more specifically their parent in the hierarchy, which signifies that Cluster D (660) is also the smallest cluster with these two documents (621, 622) in them. Documents #3 and #4 (623, 624) are in Tier 2 Cluster E (670) since Cluster E (670) is an ancestor of these documents, and is more specifically their parent in the hierarchy, which signifies that Cluster E (670) is also the smallest cluster with these two documents (623, 624) in them. Documents #5 and #7 (625, 627) are in Tier 2 Cluster F (680) since Cluster F (680) is an ancestor of these documents, and is more specifically their parent in the hierarchy, which signifies that Cluster F (680) is also the smallest cluster with these two documents (625, 627) in them. Documents #6 and #8 (626, 628) are in Tier 2 Cluster G (690) since Cluster G (690) is an ancestor of these documents, and is more specifically their parent in the hierarchy, which signifies that Cluster G (690) is also the smallest cluster with these two documents (626, 628) in them.
It is noteworthy that we refer to the top tier of the hierarchy either as Tier-0 or as Level 1. Tier-1 is the next tier down, comprising Cluster B (640) and Cluster C (650), and Tier-1 is also called Level 2. Tier-2 is the next tier down, comprising Clusters D, E, F, and G (660, 670, 680, 690), and Tier-2 is also called Level 3. These terms for Tiers (top to bottom labeled Tier-0 through Tier-2) and Levels (top to bottom labeled Level 1 through Level 3) will be used throughout the document.
Data element #1 (721) is graphed at coordinate “(10,7)” because it has 10 occurrences of the word “points” and 7 occurrences of the word “win”. Data element #2 (722) is graphed at coordinate “(5,2)” because it has 5 occurrences of the word “points” and 2 occurrences of the word “win”. Data element #3 (723) is graphed at coordinate “(6,9)” because it has 6 occurrences of the word “points” and 9 occurrences of the word “win”. Data element #4 (724) is graphed at coordinate “(6,3)” because it has 6 occurrences of the word “points” and 3 occurrences of the word “win”. Data element #5 (725) is graphed at coordinate “(6,10)” because it has 6 occurrences of the word “points” and 10 occurrences of the word “win”. Data element #6 (726) is graphed at coordinate “(3,5)” because it has 3 occurrences of the word “points” and 5 occurrences of the word “win”. Data element #7 727 is graphed at coordinate “(9,8)” because it has 9 occurrences of the word “points” and 8 occurrences of the word “win”. Data element #8 (728) is graphed at coordinate “(2,4)” because it has 2 occurrences of the word “points” and 4 occurrences of the word “win”.
The clustering algorithms that were options for clustering the documents along the dimensions in
In
Documents #1 and #7 (821, 827) are in Hierarchy 2 Tier 2 Cluster K (860) since Cluster K (860) is an ancestor of these documents, and is more specifically their parent in the hierarchy, which signifies that Cluster K (860) is also the smallest cluster with these two documents (821, 827) in them. Documents #3 and #5 (823, 825) are in Hierarchy 2 Tier 2 Cluster L (870) since Cluster L (870) is an ancestor of these documents, and is more specifically their parent in the hierarchy, which signifies that Cluster L (870) is also the smallest cluster with these two documents (823, 825) in them. Documents #4 and #2 (824, 822) are in Hierarchy 2 Tier 2 Cluster M (880) since Cluster M (880) is an ancestor of these documents, and is more specifically their parent in the hierarchy, which signifies that Cluster M (880) is also the smallest cluster with these two documents (824, 822) in them. Documents #8 and #6 (826, 828) are in Hierarchy 2 Tier 2 Cluster N (890) since Cluster N (890) is an ancestor of these documents, and is more specifically their parent in the hierarchy, which signifies that Cluster N (890) is also the smallest cluster with these two documents (826, 828) in them.
In the bottom row we find documents 1-8 (921-928) ordered in increasing order from left to right. These represent the same documents from
The dashed lines connecting Document #1 (921) and Document #2 (922) to Cluster D Prototype (961) indicate that these two documents are in this cluster. The left hierarchy, Hierarchy 1, uses the “ball” dimension and “sports” dimension of the input documents to cluster them. Each cluster prototype of this hierarchy (931, 941, 951, 961, 971, 981, 991) has a value for “ball” that is the average “ball” value of the documents of which it is an ancestor. Each cluster prototype of this hierarchy (931, 941, 951, 961, 971, 981, 991) also has a value for “sports” that is the average “sports” value of the documents of which it is an ancestor. Thus, Cluster D Prototype (961) has a “ball” value of 4.5 since its two document descendants, #1 & #2 (921, 922), have “ball” values of 4 and 5 respectively, and (4+5)/2=4.5. Cluster D Prototype (961) has a “sports” value of 10 since its two document descendants, #1 & #2 (921, 922), have “sports” values of 10 and 10 respectively, and (10+10)/2=10.
Cluster E Prototype (971) has a “ball” value of 2 since its two document descendants, #3 & #4 (923, 924), have “ball” values of 2 and 2 respectively, and (2+2)/2=2. Cluster E Prototype (971) has a “sports” value of 7.5 since its two document descendants, #3 & #4 (923, 924), have “sports” values of 8 and 7 respectively, and (8+7)/2=7.5.
Cluster F Prototype (981) has a “ball” value of 6 since its two document descendants, #5 & #7 (925, 927), have “ball” values of 6 and 6 respectively, and (6+6)/2=6. Cluster F Prototype (981) has a “sports” value of 2.5 since its two document descendants, #5 & #7 (925, 927), have “sports” values of 3 and 2 respectively, and (3+2)/2=2.5.
Cluster G Prototype (991) has a “ball” value of 9.5 since its two document descendants, #6 & #8 (926, 928), have “ball” values of 10 and 9 respectively, and (10+9)/2=9.5. Cluster G Prototype (991) has a “sports” value of 2.5 since its two document descendants, #6 & #8 (926, 928), have “sports” values of 3 and 2 respectively, and (3+2)/2=2.5.
The thick lines connecting Document #1 (921) and Document #7 (927) to Cluster K Prototype (962) indicate that these two documents are in this cluster. The right hierarchy, Hierarchy 2, uses the “points” dimension and “win” dimension of the input documents to cluster them. Each cluster prototype of this hierarchy (932, 942, 952, 962, 972, 982, 992) has a value for “points” that is the average “points” value of the documents of which it is an ancestor. Each cluster prototype of this hierarchy (932, 942, 952, 962, 972, 982, 992) also has a value for “win” that is the average “win” value of the documents of which it is an ancestor. Thus, Cluster K Prototype (962) has a “points” value of 9.5 since its two document descendants, #1 & #7 (921, 927), have “points” values of 10 and 9 respectively, and (10+9)/2=9.5. Cluster K Prototype (962) has a “wins” value of 7.5 since its two document descendants, #1 & #7 (921, 927), have “win” values of 7 and 8 respectively, and (7+8)/2=7.5.
Cluster L Prototype (972) has a “points” value of 6 since its two document descendants, #3 & #5 (923, 925), have “points” values of 6 and 6 respectively, and (6+6)/2=6. Cluster L Prototype (972) has a “wins” value of 9.5 since its two document descendants, #3 & #5 (923, 925), have “win” values of 9 and 10 respectively, and (9+10)/2=9.5.
Cluster M Prototype (982) has a “points” value of 5.5 since its two document descendants, #2 & #4 (922, 924), have “points” values of 5 and 6 respectively, and (5+6)/2=5.5. Cluster M Prototype (982) has a “wins” value of 2.5 since its two document descendants, #2 & #4 (922, 924), have “win” values of 2 and 3 respectively, and (2+3)/2=2.5.
Cluster N Prototype (992) has a “points” value of 2.5 since its two document descendants, #6 & #8 (926, 928), have “points” values of 3 and 2 respectively, and (3+2)/2=2.5. Cluster N Prototype (992) has a “wins” value of 4.5 since its two document descendants, #6 & #8 (926, 928), have “win” values of 5 and 4 respectively, and (5+4)/2=4.5.
Cluster B (941) is the ancestor of Cluster D (961) and Cluster E 971. Cluster B (941) is also the ancestor of those documents that are descendants of the clusters that are its descendants. This means that Cluster B (941) is an ancestor of documents #1 and #2 (921, 922) because these documents are descendants of Cluster D (961) and Cluster D (961) is a descendant of Cluster B (941). This also means that Cluster B (941) is an ancestor of documents #3 and #4 (923, 924) because these documents are descendants of Cluster E (971) and Cluster E (971) is a descendant of Cluster B (941). A cluster that is the parent of other clusters uses the same dimensions as those of its children. In the case of Cluster B Prototype (941) these dimensions are the same as those used by Cluster D Prototype (961) and Cluster E Prototype (971), namely dimensions “ball” and “sports”. The value for these dimensions can be calculated either as the average of all the documents for which it is an ancestor, or as the average of the values of all the descendent clusters in the same Tier. In the case of Cluster B Prototype (941), The Tier 2 clusters that are its descendants comprise cluster D Prototype (961) and Cluster E Prototype (971), and therefore their values can be averaged to more easily calculate the prototype values for Cluster B Prototype (941). Thus, Cluster B Prototype's (941) “ball” value is 3.25 since Cluster Prototypes D and E (961, 971) have “ball” values 4.5 and 2 respectively, and (4.5+2)/2=3.25. Cluster B Prototype's (941) “sports” value is 8.75 since Cluster Prototypes D and E (961, 971) have “sports” values 10 and 7.5 respectively, and (10+7.5)/2=8.75.
In the case of Cluster C Prototype (951), The Tier 2 clusters that are its descendants comprise cluster F Prototype (981) and Cluster G Prototype (991), and therefore their values can be averaged to more easily calculate the prototype values for Cluster C Prototype (951). Thus Cluster C Prototype's (951) “ball” value is 7.75 since Cluster Prototypes F and G (981, 991) have “ball” values 6 and 9.5 respectively, and (6+9.5)/2=7.75. Cluster C Prototype's (951) “sports” value is 2.5 since Cluster Prototypes F and G (981, 991) have “sports” values 2.5 and 2.5 respectively, and (2.5+2.5)/2=2.5.
Cluster I (942) is the ancestor of Cluster K (962) and Cluster L (972). Cluster I (942) is also the ancestor of those documents that are descendants of the clusters that are Cluster I's (942) descendants. This means that Cluster I (942) is an ancestor of documents #1 and #7 (921, 927) because these documents are descendants of Cluster K (962) and Cluster K (962) is a descendant of Cluster I (942). This also means that Cluster I (942) is an ancestor of documents #3 and #5 (923, 925) because these documents are descendants of Cluster L (972) and Cluster L (972) is a descendant of Cluster I (942). A cluster that is the parent of other clusters uses the same dimensions as those of its children. In the case of Cluster I Prototype (942) these dimensions are the same as those used by Cluster K Prototype (962) and Cluster L Prototype (972), namely dimensions “points” and “win”. The value for these dimensions can be calculated either as the average of all the documents for which it is an ancestor, or as the average of the values of all the descendent clusters in the same Tier. In the case of Cluster I Prototype (942), the Tier 2 clusters that are its descendants comprise cluster K Prototype (962) and Cluster L Prototype (972), and therefore their values can be averaged to more easily calculate the prototype values for Cluster I Prototype (942). Thus Cluster I Prototype's (942) “points” value is 7.75 since Cluster Prototypes K and L (962, 972) have “points” values 9.5 and 6 respectively, and (9.5+6)/2=7.75. Cluster I Prototype's (942) “win” value is 8.5 since Cluster Prototypes K and L (962, 972) have “win” values 7.5 and 9.5 respectively, and (7.5+9.5)/2=8.5.
In the case of Cluster J Prototype (952), The Tier 2 clusters that are its descendants comprise cluster M Prototype (982) and Cluster N Prototype (992), and therefore their values can be averaged to more easily calculate the prototype values for Cluster J Prototype (952). Thus Cluster J Prototype's (952) “points” value is 4 since Cluster Prototypes M and N (982, 992) have “points” values 5.5 and 2.5 respectively, and (5.5+2.5)/2=4. Cluster J Prototype's (952) “win” value is 3.5 since Cluster Prototypes M and N (982, 992) have “win” values 2.5 and 4.5 respectively, and (2.5+4.5)/2=3.5.
Similar to how we calculated the dimensions and values of Tier 1 Cluster Prototypes (941, 951, 942, 952), we can calculate the dimensions and values of the Tier 0 Cluster Prototypes (931, 932). Cluster A Prototype (931) uses the “ball” and “sports” dimensions utilized by its descendant clusters (941, 951, 961, 971, 981, 991), and can take the value of the average of the Tier 1 Clusters that are its descendants, namely Cluster Prototypes B and C (941, 951). Thus, Cluster A Prototype (931) has a “ball” value of 5.5 since Cluster Prototypes B and C (941, 951) have “ball” values of 3.25 and 7.75 respectively, and (3.25+7.75)/2=5.5. Cluster A Prototype (931) has a “sports” value of 5.625 since Clusters B and C (941, 951) have “sports” values of 8.75 and 2.5 respectively, and (8.75+2.5)/2=5.625.
Cluster H Prototype (932) uses the “points” and “win” dimensions utilized by its descendant clusters (942, 952, 962, 972, 982, 992), and can take the value of the average of the Tier 1 Clusters that are its descendants, namely Cluster Prototypes I and J (942, 952). Thus, Cluster H Prototype (932) has a “points” value of 5.875 since Cluster Prototypes I and J (942, 952) have “points” values of 7.75 and 4 respectively, and (7.75+4)/2=5.875. Cluster H Prototype (932) has a “win” value of 6 since Clusters I and J (942, 952) have “win” values of 8.5 and 3.5 respectively, and (8.5+3.5)/2=6.
Although Hierarchy 1 and Hierarchy 2 do not share input dimensions it is possible that they share some input dimensions and keep some unique. It is also possible that they share all input dimensions and differ only in the clustering algorithm. Although the examples of this and previous figures utilize two input dimensions per hierarchy it is possible for a hierarchy to cluster its inputs along hundreds, thousands, millions, or more dimensions. In one common scenario most of the dimensions contain zero values for most of the inputs. This is called a sparse representation and the zero values can be stored more efficiently by simply noting which dimensions are nonzero rather than listing all of the zero dimensions. This technique is often used to save memory. Although measuring the distance between two vectors with dense representations (where the zero values and non-zero values do not differ in the means by which they are stored) is compatible with SIMD architectures for improved performance, the sparse representations may benefit from hardware that does not implement SIMD but has improved sparse memory lookups as well as improved unpredictable branching (such as with a short pipeline, or a pipeline whose ill branching effects are countered by multithreading of the pipeline) and/or conditional data movement operations. Thus some hierarchies may be best calculated on certain architectures, while other hierarchies will benefit from execution on different hardware. This circumstance will be illuminated in subsequent figures.
Hierarchies may also use the cluster information of other hierarchies as input, such that the input dimension is specific to the hierarchy and tier of the cluster, and the specific cluster within that tier holds the value of that dimension. Distances between values in this dimension can be calculated and integrated into an overall distance calculation between data and data, or data and prototypes using various techniques. This will also be illuminated in subsequent figures. Although two hierarchies are listed in the example of this figure, dozens, hundreds, thousands, millions, or more hierarchies might be implemented, especially during the search for which hierarchies are most useful. We will show an automatic method of determining which hierarchies are useful, which can make the instantiation of a large number of hierarchies useful.
Finally, a unit of code implementing an algorithm that organizes data hierarchically may receive as input the raw data associated with each input element and may translate this to spatial coordinates or some other representation internally. In this preferred embodiment it may be the case that no other hierarchies are able to utilize any of the input dimensions utilized by that unit of code. In another preferred embodiment, said unit of code may provide the input dimensions to only those other units of code that are sold by the same vendor, such that the input dimensions are kept private to the vendor that has created said unit of code. In this way a vendor may keep private both the algorithm used to organize data hierarchically, and the mapping of data to dimensions used by that algorithm, such that the vendor may charge a fee relative to the total advantage that the input dimensions and algorithm provide in concert.
In step 2020 the “User creates or modifies a CQL query to search for a certain class of data”. This can be performed through a process with an interactive CQL query builder (420), which will be described in a subsequent diagram. This process can use the data uploaded or selected in step 1020. Once the CQL system has the query loaded and the user has designated that they would like to run it, the process proceeds immediately to step 1030 via link 1025.
In step 1030 the process branches based on the result to the following question: “Is the query to be run on user-provided streaming data or an existing data stream?”. If the query is to be run on a user-provided stream that is not already loading, then the process proceeds via the “User-provided stream” link (1035) to step 1040. If an existing stream (already uploading) is to be used then the process proceeds via the “Existing stream” link (1036) to step 1050.
In step 1040 the “User uploads a stream of new data”. This data will be processed in real time by the query that was developed and/or designated in step 1020. In other words, in step 1050 the data uploaded in step 1040 will be processed by said query as it is uploaded. Step 1040 proceeds immediately to step 1050 via link 1045.
In step 1050 the “Query is run on incoming data stream”. The query that is run is the query or queries that were developed and/or designated in step 1020. The stream that is processed in real time by this query is the stream designated in step 1030 (in the case that it was a pre-existing stream) or that began uploading in step 1040 (in the case that it required new uploading). The query or queries are continuously run on the incoming data stream as a result of the default repetition of step 1050 via traversal of link 1055. In the case that the “Query no longer needs to continuously run” (1056) the process proceeds via link 1056 to the “End” step (1060).
Step 1110 is the “Hierarchies H1-Hn adjust via partially-supervised algorithms P1-Pn respectively” ” step. In this step any supervised information is integrated into the hierarchy organization so that data of the same category tends to be clustered together at the higher tiers of the hierarchies, and data of different categories is made to be or remains in separate clusters. If the supervised data is designated by the user to not be relevant to the query under construction then this optimization does not occur. In the common cases that are anticipated there is little or no relevant supervised data, however it is important that this step integrate such information if it is available. Such information might come from previous queries that have been built by this same user or by other users using the same input data. In this way users can leverage each other's query building to improve their own query building, which may prove to be essential under circumstances where the interactive query builder would otherwise require a lengthy process that results in low quality queries. This step proceeds to step 1115 via link 1111.
Step 1115 is the “User provides new input data or selects an existing piece of data. This data is an example of a desired result from the query” step. In this step the user provides an example that would be a good result from the query. This interaction allows the user to build the query using examples instead of by programming, since programming skills require special training to develop, or bringing an engineer on staff that has undergone this special training. This step proceeds to step 1120 via link 1116.
Step 1120 is the “New data is organized according to hierarchies H1-Hn. Does the user have more examples of desired results?” step. In this step the hierarchies are generally not reorganized unless multiple examples have been produced by the user (i.e. step 1115 has executed at least twice). Once the system has multiple good examples of the query results, the hierarchies can be sorted by their intrinsic utility in clustering the positive examples together. For example, if three positive examples have been found, a hierarchy that has these three examples clustered together in a cluster that has a total of only four examples, the cluster is already very similar to what the user would consider a good classifier for the query, and the fourth piece of data in the cluster is a good candidate for being a positive example of data that should pass through the filter (query). Clusters of sizes that hint at good utility tend to contain more positive examples than would be expected by random selection. Hierarchies that appear to have low utility (i.e. hierarchies that cluster together the positive examples with random-like probability_can be recognized such and may be fixed by changing the input dimensions they examine, pruning the hierarchy, or changing its branching factor, etc. This step proceeds either via the “yes” link 1121 to step 1115 (in the case that the user has more positive examples to present the system), or via the “no” link 1122 to step 1125.
Step 1125 is the “The query is initialized with an initial “Given” clause including the IDs of all the example results. An “Unlike” clause is added to the query, which includes the IDs of any data indicated by the user to not be a desired result of the query” step. In this step the positive and negative examples that have been provided by the user are included in the CQL query text or its data structure and become intrinsic to the query. In a preferred embodiment, the “Given” and “Unlike” clauses of the CQL query are the only parts of the query that are outside the classic SQL syntax. They may be surrounded by comment symbols, such as curly braces “{ }” so that they do not violate the SQL syntax. This step proceeds to step 1130 via link 1126.
Step 1130 is the “A current hierarchy HC is selected by one of a number of methods, e.g. the hierarchy is selected with the highest number of positive examples that are within a short distance (hierarchy path through closest shared ancestor) of another positive example” step. In one embodiment this step includes sorting of the hierarchies such that the hierarchy most likely to find a new positive example near multiple already-found positive examples is at the front of the hierarchy list. The front of this list indicates the hierarchy with the highest priority for integration into the query (i.e. the hierarchy that appears most promising in aiding the query builder towards achieving its goals). Indeed sorting may require far more computation than is actually necessary to obtain the hierarchy with the most promising organization, since it is not necessary that the least and second least promising hierarchies be identified and precisely ordered relative to each other. A Top-1 or Top-N sort may suffice such that sorting only occurs for those hierarchies that remain current candidates to be placed in the Top-1 (meaning only the most promising hierarchy is sorted and thus does not actually require a sort since it must only be sorted with itself) or Top-N respectively.
Negative examples may also be used to select the most promising hierarchies, or to eliminate otherwise promising hierarchies from consideration. For example, hierarchies that organize multiple positive examples into reasonably tight clusters would be considered promising, however, if this tight clustering includes negative examples, or includes more than a threshold number of negative examples, then the hierarchy may not be deemed promising. In one embodiment this negative example threshold is a percentage of the number of examples that have been found to be tightly clustered in the hierarchy. In another embodiment the threshold may be set higher or lower depending on what the user has determined to be the desired precision (probability the query returns positive results). This step proceeds to step 1135 via link 1131.
Step 1135 is the “A new “current filter” FC is created for HC that selects new examples, e.g. Data passes through the filter if the set of other data it is closest to within hierarchy HC includes a minimum number of examples that have been positively identified as good results.” step. In this step the aspect of hierarchy He that caused it to be deemed promising in step 1130 is used to create a new filter. In one embodiment the selection of the hierarchy was not definitive and in step 1130, such as if too few examples have been presented by the user to allow the hierarchies to be properly sorted, such as in the case that only one example has been presented. In this case hierarchy HC is re-analyzed so that the aspect of the hierarchy that is most likely to correctly identify results for the query is selected. For example, consider a portion of the hierarchy that clusters data records together where those records cannot be easily compressed. The interactive query building system may use this as an indication that the data in that portion of the hierarchy may be of interest as it may have more information and/or less redundancy. This method also applies to the case where no examples have yet been presented by the user. The opposite method may also be utilized, so that data records that are clustered together and can be easily compressed signals an interesting cluster that may be of use as a component of a user's query. The history of success of using one or both of these techniques, or other information-based techniques, can be utilized whenever the set of positive and negative example data records results in an inconclusive choice for filtering. In another embodiment, a component of a hierarchy may be considered a good candidate for addition to the query as a filter if that hierarchy component, or a similar component in a similarly constructed hierarchy, was used in a previous query that is not known to be related to the current user's query. In this way, as the system searches for the next component of the query being built, it is possible for the system to beat random selection techniques, even in the absence of information specific to the current query. This step proceeds to step 1140 via link 1136.
Step 1140 is the “Set total trials FMAX equal to minimum of TFMAX and number of results that pass through FMAX.” step. In this step the maximum number of trials that will be used to test the current filter, FMAX, is determined. Since the total number of trials cannot be larger than the number of unknown data that are returned by the filter, this is set as an upper bound. Another upper bound for this value is set as the maximum number of trials that should be necessary to determine if a filter is a reasonable addition to a query, which is defined as the TFMAX value. The TFMAX value may be set by the user or learned by the interactive CQL query builder (420) through previous interactive sessions. Previous interactive sessions that were recorded in the context of the current input data that is to be processed may be used to produce a TFMAX value by determining how often a filter became useful after a given number of interactions. Setting the TFMAX value such that all or nearly all of these filters would still be discovered as useful is one technique for deriving the TFMAX value. This step proceeds to step 1145 via link 1141.
Step 1145 is the “Select at random one of the results passing through FC. Present it to the user.” step. In this step the system selects an instance of data that passes through the current filter in order to present it to the user and determine if it is indeed a positive example. This step proceeds to step 1150 via link 1146.
Step 1150 is the “User responds with Yes if it is a desired result of the query, or No. The response is appended to the “Given” clause if Yes, otherwise it is appended to the “Unlike” clause.” step. Positive examples are recorded intrinsically in the query so that the current state of a CQL query aids in its own refinement and improvement. The “Given” clause, which may also be referred to as the “Like” clause, maintains positive examples of data that is desired to be returned by the query (i.e. pass through the filter), The “Unlike” clause of the CQL query maintains a list of results that are known to not be positive examples for the query. In one embodiment the user may also interact through an interface including responses of “Very Like”, “Like”, “Unlike”, and “Very Unlike” so that examples that are reasonably positive examples are separated from prototypical examples, and the same data collection is performed for negative examples (i.e. bad but not terrible examples are maintained in the “Unlike” clause, and the “Very Unlike” clause maintains the list of data that are detrimental to the system if they are returned by the query). If more than “Like” and “Unlike” clauses are included in the query building process then the system may be optimized to take into account this softer classification system. Such a classification system is anticipated to be better suited to queries where it is reasonable to return false positives of certain types but not of other types. In order to maintain SQL compatibility with the query, the CQL query may be stored in a form that is CQL-specific but capable of generating an SQL-compatible query, or it may be stored such that the non-SQL-compatible clauses are held in commented sections of the SQL query so that they do not conflict with the SQL syntax and therefore the CQL query is maintained in SQL-compatible form. When the sender of a CQL query and the receiver both know that certain clauses are not needed by the receiver or downstream systems, then the sender may opt to not send those clauses that are unnecessary in order to more efficiently send SQL queries as messages, thereby enabling message passing with reduced bandwidth and lower total latency. This step proceeds to step 1155 via link 1151.
Step 1155 is the “Set the current confidence CC that an appropriate binomial distribution (see text) created the sequence of true and false positives identified by the user.” step. In this step the probability that the current filter should be added to the current query is calculated. An “appropriate” binomial distribution is one with a p-value (elemental probability of success) at least as high as the minimum precision selected by the user. The minimum precision that is allowable by the user is related to the maximum percentage of false-positives that are allowable (the probability of a false positive is one minus the precision). The binomial distribution formula simulates probability of selecting X positive examples out of Y trials from a vessel holding positive and negative examples when the probability of choosing a positive example is in any individual trial is p. This maps to the current vetting process (step 1150) such that the number of positive examples the user has identified for the current filter in step 1150 is X, the total number of times step 1150 has been visited for the current filter is Y. We do not know the true probability p unless we test all of the data records that pass through the filter. We can use the binomial formula to calculate the probability of X given Y and p. If we set the value p to the minimum precision allowable by the user (which is related to the maximum tolerable false positive rate) then we can calculate the probability that the current filter has a p value at least as high as the minimum desired precision. In fact the cumulative binomial distribution function is able to calculate the probability that x or fewer positive examples would have been found, and one minus this value is the probability that at least x+1 values would be found. We can calculate the desired value (the probability that at least the actual number of positive examples that were found would have bee found) as one minus the cumulative distribution function calculated on a value X that is one less then the number of positive examples we have found so far. A number of methods exist for calculating bounds on the value of the cumulative distribution function, and table methods can be employed for a small number of trials, which is the case when the user's time is being optimized for (very many trials would be too cumbersome for the user and therefore an unrealistic use case for the novel system).
In this way we calculate the probability that a faith in the precision of the current filter being as good as the goal precision would be well placed. In other words, it is possible that positive examples that have been identified by the user from the output of the current filter were accidental and not indicative that the filter is good at finding positive examples. The hypergeometric distribution (and its cumulative hypergeometric distribution function) is generally a more accurate estimator of the probabilities we desire to calculate in step 1155 because our presentation of data records to the user is generally “without replacement”. It is “without replacement” because we will not present the same data record to the user after they have already said whether the data record is a positive or negative example of the current query. Thus the use of the hypergeometric distribution is preferred however the binomial distribution is typically a reasonable estimate and may be preferred in certain instances such as when simpler formulas and calculations are desired. Furthermore, what constitutes a simpler formula or calculation is dependent on the software and hardware implementation and should be taken into account when selecting the binomial or hypergeometric functions. The hypergeometric function may introduce inaccuracy due to the fact that the precision of the filter on the initially uploaded input data is not the precision that the filter will have on the streaming data that will be presented later. Thus, the binomial distribution may have a built-in hedge against overly extrapolating from the development input data to the streaming data. The probability calculation in this step determines the level of confidence that should be placed in the filter that is currently under examination being sufficiently precise. This step proceeds to step 1160 via link 1156.
Step 1160 is the “Is CC at least the minimum confidence CMIN?” step. In this step the probability/confidence value CC that was calculated in step 1155 is compared to a minimum confidence value to determine whether the filter is above the confidence threshold for addition to the query. This step proceeds either via “No” link 1161 to step 1165 (in the case that the threshold was not met), or via “Yes” link 1162 to step 1170 (in the case that the threshold of confidence has indeed been met).
Step 1165 is the “Is the likelihood LC of bringing CC to at least CMIN within TFMAX total trials above LGIVEUP?” step. In this step the confidence, which has been found to not be sufficient to add the current filter to the query without further interaction, is processed with respect to all of the user interactions that have been performed using this filter and all that might still be performed. If it is determined that it is unlikely (or likelihood LC is below a certain threshold LGIVEUP) that the current filter will be found to be of sufficient quality within the maximum number of interactions to be allowed TFMAX, then the process proceeds via “No” link 1167 to step 1130. If the process determines that the likelihood LC of identifying the current filter as worthy of addition to the query is sufficiently high within the maximum number of interactions TFMAX that have been previously determined (step 1140), then the process proceeds via “Yes” link 1166 to step 1145. In one embodiment a single negative feedback by the user is sufficient to cause abandonment of the current filter, and a single positive example is enough to allow its inclusion. One example where a single positive example is sufficient for inclusion is if the filter only allows a single value (or very few values) from the initial data upload to pass through. In one preferred embodiment a minimum number of user interactions per filter is used in the low-information cases where the likelihoods are being calculated from very few user interactions with the current filter. For example, the formulas might suggest that one positive and one negative example indicate a sufficiently low likelihood LC such that the filter should be given up on, and in this instance a minimum user interaction rule may be enacted for the specific case of one positive and one negative example for the given desired precision so that the current filter is not yet given up on.
Step 1170 is the “Append the current filter FC to the current query QC” step. In this step the filter is added to the current query so that results that pass through this filter (or are labeled as “passing” through the filter) will also pass (or be labeled as passing) through the query. Step 1170 is reached when the user interactions have indicated that the precision of the current filter is at least as high as the minimum allowable precision. In another preferred embodiment, certain clauses with precision almost as high as the desired precision are maintained as optional clauses for the query that may be added to the query in subsequent configuration. Such clauses may be integrated into the query in the case that a set of clauses is found have precision higher than necessary, so that when the optional clauses are combined with set of high precision clauses the total precision is maintained above the minimum allowable precision. This step proceeds to step 1175 via link 1171.
Step 1175 is the “Is the number of hits He of the current query QC at least as much as the desired (goal) number of hits HG?” step. In this step the user is sent through the process of adding filters to the query until the desired number of results is achieved. In other words, filters of sufficient quality, with sufficiently high precision, are added to the query until enough results pass through the filter. The hits used in this step may either be calculated as true positives or as the sum of true and false positives. In the case where the user has a good estimate of the number or percent of examples that are positives then the hits may be calculated as the number of true positives so that the goal of the query is to find all or nearly all of the positive examples in the data. In the case that there is a limited amount of processing power for handling data records that pass through (or are labeled as passing through) the query then the number of hits may be calculated as the sum of the false positives and true positives, so that the total number of records identified by the system as positive is kept below some maximum number that can be processed.
In mathematical terms the query is like the disjunction of multiple clauses, where passage through any one clause is sufficient to pass through the entire query. In Boolean algebra this is called disjunctive normal form. In one embodiment, achievement of any clause that has sufficient precision may take most or all of the time that the system interacts with the user, and the discovery of any such clause is sufficient to make the query of sufficient quality. For example, a query that finds a very rare piece of data, but that data has extremely high signal for predicting a future outcome, may be sufficient to make the query useful on its own without additional clauses/filters added by means of disjunction. This step proceeds either via “No” link 1176 to step 1130, or via “yes” link 1177 to step 1180.
Step 1180 is the “Current query QC meets the desired precision with the desired level of confidence, and returns the desired number of results or more. Return current query to the user” step. In this step the query is returned to the user. This may also involve the storage of this query into a repository so that it can be loaded by the CQL system easily in the future, such as for processing a new input stream, for improvement via further interaction with the user, or for use by other users through the sale of its use by the user that originally created it. In step 1180 the user may also be presented with an option to enable the sale of the query, and, in this case, the user may also be presented with a number of possible fees to choose from. The interactive CQL query builder (420) may estimate which fees would deliver the best return for the user based on the fees of queries that have performed similar to how the user's new query is anticipated to perform. This estimate may be adjusted based on how the current fees being paid on the novel system relate to those that were previously recorded (e.g. to adjust for inflation or other market factors). This step proceeds to the “End” step 1185 via link 1181.
Step 1185 is the “End” step designating the end of the process of
The Input Data (1220) is comprised of multiple separate data records, which are rows in the grids of
Hierarchy 1 clusterer (1230) outputs via link 1235 the clusters that it has assigned using its internally stored hierarchy. The Input Data (1220) with a given Identifier (1221) maintains that same Identifier value (1251) in the output (1235). For data with a given Identifier (1251), the unstructured data (1222) that was associated with it has been processed by the Hierarchy 1 clusterer (1230) such that a cluster at each tier of Hierarchy 1 has been assigned to the data. We can see that the data record with Identifier 1251 equal to 1 has been assigned Hierarchy 1—Level 2 (1252) value of B, and a Hierarchy 1—Level 3 (1253) value of D. This example can be understood as a continuation of the example of
At the beginning of the step depicted in
The selected document (document #2, 922) is presented to the user. In this example the user selects the option designating the document as a positive example of desirable results for the current query under construction. The document ID may be added to the “Given” or “Like” clause of the query. In another preferred embodiment all of the child the units of Cluster D (961) receive signals and these units ignore the signals if they have already been presented to the user. In another embodiment the units representing the documents may be distributed across multiple computer processors. Each processor may determine whether it contains the document that will be selected at random by generating a random number. If the distributed processors use the same random number generating algorithm, seed, and remain in synchrony then they will all generate the same random number. This random number can be used to select the document in a distributed fashion. In another preferred embodiment a signal is sent only to the processor that is managing the unit representing the document that is chosen at random. This is another method that may accommodates a distributed processing of the document selection.
Note that the scale of the example had to be kept small such that it fit in a reasonable number of figures. Discovery of two successful matches in a clause may well be spurious and stronger statistical significance may be needed to justify the addition of a clause. For example, it may be that clauses with fewer than M number of positive examples cannot reach sufficient statistical significance and thus do not merit enquiry in the process performed by the interactive CQL query builder (420).
The walkthrough that began in
It is also possible that refinements to the clustering within a single hierarchy are sought as clauses. For example, if Cluster M (982) was found to not be a good clause to add due to document #4 (924) having been found to not be a desirable result of the query, the search might continue searching whether cluster J (952) is a desirable clause, with the exception of excluding Cluster M (982). Thus, instead of only having clauses that include all the documents that are children of a certain cluster, it might include all the documents under a certain cluster that do not also fall under another particular cluster. For example, if cluster H 932 is a sub-cluster of a larger hierarchy, then we may find that Cluster H is a good candidate for addition to the list of clauses comprising the current query, but only if cluster M is excluded as a special case. Thus, such a clause that includes Cluster H (932) would not be required to include documents #1-#8 (921-928) but instead could be limited to accepting documents #1, #7, #3, #5, #8, and #6. Searching for such exclusions to a clause must be examined in the context of whether such a search is the best means of reaching the goals of the query with the current query, or whether an altogether different clause is more likely to benefit the query in a way that allows the query to achieve its goals more quickly. Thus, the main purpose of the interactive CQL query builder (420), which is to minimize the amount of time the user must spend in order to create queries that achieve their goals, is preferred and pursued by the system.
This process is very different from the process by which datasets have traditionally become labeled with supervised data. Such processes have traditionally not been optimized for user time in order to process unstructured data downstream with both traditional and nontraditional database systems.
A summary of the walkthrough depicted in
The Subroutine Builder Interface implements the Interactive CQL Query Builder (420) interface and also implements further user interactions capable of configuring the Windowing (2030), Optimizer (2050) and Executors (2040). Furthermore, the Subroutine Builder Interface (2020) provides additional means of configuring the Filtering (2025) system beyond those described in
The User (2000) interacts with the Subroutine Builder Interface (2020) via link 2001 and selects a trend (2005) that he/she would like to predict in order to act upon those predictions. Thus, the User (2000) uses the Subroutine Builder Interface (2020) to select the Trend Target Data (2005) that will be used for the Subroutine that the user is building. The Subroutine Builder Interface (2020) then notifies the Trend Target Data (2005) via link 2006, which is either streaming in real time or held in storage, that it is to stream to the Optimizer (2050) via link 2007. The User (2000) also interfaces with the Subroutine Builder Interface (2020) via link 2001 in order to select what input data (2015) is to be utilized for the subroutine being created. The Subroutine Builder Interface (2020) then notifies the Input Data (2015) via link 2017 so that it is transmitted to the Filtering system (2025) via link 2016. The User (2000) then answers a series of prompts presented by the Subroutine Builder Interface (2020) and the Subroutine Builder Interface (2020) determines which subroutines (2085-2093) should be loaded from the Subroutine Repository Database (2010) into the Filtering (2025), Windowing (2030), Executors (2040), and Optimizer (2050) subroutine execution systems based on the goals, Trend Target Data (2005), Input Data (2015) selected by the User (2000), and the history of success associated with each of the subroutines (2085-2093) in the Subroutine Repository Database (2010).
The Subroutine Builder Interface (2020) selects one or more Filter Builders (2085) to load into the Filtering system (2025) via link 2021. The Filter Builders (2060, 2061) may then create the sets filters F1 and F2 (2062, 2063). There will be at least one set of Filters that does not require the output of other filters as input. In this example, Set F1 (2062) is the set of Filters (2065, 2066, 2067) that does not require the output of any other Filters. In the preferred embodiment depicted in
The Filters (2065-2067 and 2080-2082) provide Column Data output (2027) to the Windowing system (2030) which includes a Set W1 of Windows (2035) comprising multiple Windows (2036) that collect statistics on the Column Data (2027) over time. The specific statistics collected by the Windowing system (2030) are determined by the Windows (2036) that are loaded by the Subroutine Builder Interface (2020) via link 2026. These Windows (2036) will have been selected from among the Windows (2087) available in the Subroutine Repository Database (2010) via link 2022 by prioritizing the loading of Windows (2087) that have previously proven useful at the User-designated Trend Target Data (2005) and Input Data (2015). The statistics collected by the Windows (2036) are output as Statistics (2031, 2051). The Optimizer (2050) receives the statistics input (2051) and processes it using its internal Statistics-to-Trend Target Comparator (2055), or STTC. In fact it is possible for the STTC (2055) to have multiple different possible instantiations housed in the Subroutine Repository Database (2010) and to be loaded by the Subroutine Builder Interface (2020) via link 2023. The STTC (2055) correlates the Trend Target Data (2005), provided via input 2007, with the Statistics input (2051) using the goals designated by the User (2000) which are communicated to the STTC (2055) by the Subroutine Builder Interface (2020) via link 2023.
Those statistics that are proving useful at predicting the Trend Target Data (2005) in accordance with the User's (2000) goals are identified through the processing of the STTC (2055). These identified statistics are transmitted back to the Windowing System (2030) via the Reinforcement link (2052). Subsequently, the Windowing system (2030) communicates back to the filtering system (2025) via the Reinforcement (2028) link. Those Filters on which useful statistics were collected according to the Reinforcement Signal will receive said Reinforcement (2028) from the Optimizer's (2050) STTC (2055).
The Filter Builders (2060, 2061) may then create more filters that are similar to the filters that have proven useful to the downstream systems. In order to make room for these filters, the Filter Builder (2060, 2061) may remove some unproven filters that have not proven useful after attempts to collect useful statistics over said unproven filters' outputs. The useful filters may then be transmitted from the Filtering system (2025), via link 2021, to the Subroutine Builder Interface (2020), via link 2021, and onward to the Subroutine Repository Database (2010), via link 2022. Once these useful filters have arrived at the Subroutine Repository Database (2010) they are stored in the repository so that they are available to the user for future subroutine building or for sale or trade to other users that may find them useful. Such third party users may desire to load these Filters (2086) if they are for sale in the case that said third party users are interested in predicting the same Trend Target Data (2005) using the same Input Data (2015), and that said Filters (2086) proved useful under those conditions. The Filtering system (2025) may have a direct link (not shown) to the Subroutine Repository Database (2010) in order to more efficiently retrieve and store Filters (2086) into the Subroutine Repository Database (2010). This extends to the Windowing (2030), Executor (2040), and Optimizer (2050) systems as well. If these direct links are present, the Subroutine Builder Interface (2020) does not need to transfer the data itself, but need only notify these systems (2025, 2030, 2040, 2050, 2010) of what data to send and who the receiver should be.
Upon successfully predicting the Trend Target Data (2005) under the goal conditions designated by the User (2000), the set of useful statistics and prediction configuration is sent from the Optimizer (2050) to the Executors system (2040) via the Configuration link (2053). The Executor comprises one or more sets (2045) of Executors (2046) that receive statistics as input (2031) from the Windowing system (2030) and, according to their configuration (2053), as performed by the Optimizer (2050), execute specific actions designated by the User (2000) in the case of successful prediction. Such actions might comprise sending coupons to users, changing a stock trading policy, retweeting a piece of news, modifying the proportion of purchases made from one supplier or another, or some other action.
In this example Document #1 (2101) arrives first, followed by Document #2 (2102), and so on until Document #8 (2108) arrives last. At the beginning of the example the Cache (2110) is empty. More commonly there will already be data in the cache (2110) and the oldest data will be removed from the oldest pole (2130) of the cache (2110) to make room for new entries which will appear on the Newest Pole (2120). Because the cache (2110) starts out empty in our example we begin adding new entries to the cache (2110) at the oldest pole (2130) and move newer entries in a given row toward the newest pole (2120) as necessary until the row is filled. If the cache (2110) were to overflow in our example then entries would be removed from the oldest pole (2130) and added at the newest pole (2120). It is noteworthy that the poles are logical rather than a physical implementation, since sliding all of the data to the left whenever an entry is removed is an expensive operation. Ring buffers can implement the cache (2110) in the way described without requiring expensive memory operations.
When the first data (2101) arrives at the Cache (2110) via link 2027 it brings with it labels of B, D, I, and K in columns 1272, 1273, 1274, and 1275 respectively. The empty cache (2110) stores a new data record's Identifier column value (1271) in the relevant rows. In our example, for each hierarchical cluster that a data record belongs to, an entry is inserted into the corresponding row of the cache (2141-2152). This entry is stored as the data record's Identifier value (1271). Thus, a value of 1 is stored in the cache for Document #1 (2101). The value 1 is appended, starting from the left, to the B, D, I, and K rows (2141, 2143, 2147, 2149) because document #1 (2101) is in clusters B, D, I and K (941, 961, 942, 962). We can see in the cache (2110) that the value of 1 is nearest the oldest pole (2130) line in these rows (2141, 2143, 2147, 2149) showing that the value 1 was appended to the appropriate rows in the empty cache (2110).
When the second document (2102) arrives in the cache (2110), it's column values B, D, J, and M (for columns 1272, 1273, 1274, and 1275 respectively) result in the second document's (2102) ID value of 2 being stored in cache (2110) rows 2141, 2143, 2148, and 2150. Because document #1 came before it, it is positioned closer to the newest pole (2120) of the cache (2110) relative to document #1 (2101), in those rows (2141, 2143) where documents #1 and #2 (2101, 2102) both have entries. Thus, “2” goes to the right of the “1” value in rows 2141, and 2143. Upon arrival of document #3 (2103) as input, the column values of B, E, I, and L (for columns 1272, 1273, 1274, and 1275 respectively) are stored in the cache (2110) by appending “3” toward the right in rows 2141, 2144, 2147, and 2150 respectively. Upon arrival of document #4 (2104) as input, the column values of B, E, J, and M (for columns 1272, 1273, 1274, and 1275 respectively) are stored in the cache (2110) by appending “4” toward the right in rows 2141, 2144, 2148, and 2151 respectively. Upon arrival of document #5 (2105) as input, the column values of C, F, I, and L (for columns 1272, 1273, 1274, and 1275 respectively) are stored in the cache (2110) by appending “5” toward the right in rows 2142, 2145, 2147, and 2150 respectively. Upon arrival of document #6 (2106) as input, the column values of C, G, J, and N (for columns 1272, 1273, 1274, and 1275 respectively) are stored in the cache (2110) by appending “6” toward the right in rows 2142, 2146, 2148, and 2152 respectively. Upon arrival of document #7 (2107) as input, the column values of C, F, I, and K (for columns 1272, 1273, 1274, and 1275 respectively) are stored in the cache (2110) by appending “7” toward the right in rows 2142, 2145, 2147, and 2149 respectively. Upon arrival of document #8 (2108) as input, the column values of C, G, J, and N (for columns 1272, 1273, 1274, and 1275 respectively) are stored in the cache (2110) by appending “8” toward the right in rows 2142, 2146, 2148, and 2152 respectively. Since Document #8 (2108) is the last to arrive we can see that in each of the rows in which it was appended it is the rightmost entry for that row.
Statistics (2100) are gathered on these cache rows (2141-2152), which are sorted through time, and running tallies are kept for each of the different time slice time spans (2065, 2070, 2075, 2080) that will be calculated for statistics (2100). In one preferred embodiment, time slice 1 (2165) is 1 minute, time slice 2 (2170) is 5 minutes, time slice 3 (2175) is 1 hour, and time slice 4 (2180) is 24 hours. In another preferred embodiment the statistics are the sum of the number of data records (documents) that have had a column value equal to the Data Column Label 2160 during a particular time slice (2165-2180). In such an example the time slice 4 column (2180) would always hold values at least as large as the adjacent time slice 3 (2175) values, time slice 3 (2175) would always hold values at least as large as the adjacent time slice 2 (2170) values, and time slice 2 (2165) would always hold values at least as large as the adjacent time slice 1 (2165) values. In another preferred embodiment, the difference in this sum from some parameter is calculated and output. In another embodiment the percentage of all data records that have a specific Data Column Label as a column value is measured. This technique would be valuable if the number of data records that arrive via link 2027 is affected by noise, since a percentage formula naturally adjusts to periods when less data arrives. The statistics (2100) are output whenever a change is made in one of the values it holds, and the change is output over link 2190. In another embodiment the full statistics data structure is output at a certain period, such as every 10 milliseconds, or ever minute. In another embodiment both techniques are used, where the periodic output serves as a keyframe to downstream systems that are monitoring data provided over link 2190. Updates between periodic keyframe updates would then only be required to send information regarding which data has changed, and what value it has changed to. Alternatively, the value it has changed to can be relative to the keyframe value or its previous value, so long as the difference and the sign of the difference are sent, and this may require fewer bits-per-changed-value to be transmitted. This might enable improved performance in cases where bandwidth limitations are limiting performance.
A given CQL query may be implemented as a filter, where data records will be given a new column related to the name of the CQL query. Let's consider a CQL query named Q1. A given data record will have a value of “True” for column Q1 if the data record would be returned as a result by Q1, otherwise it may get a “False” value. An example Q1 could be:
(here TABLE_1270 is a reference to the table 1270 of
In another preferred embodiment, the table name is used to designate which hierarchy is being analyzed. Such a query might look like:
One can further imagine that third parties implement filters that assign a mood to a given piece of text. A CQL query operating on this data might well appear as:
Another means of leveraging the capabilities of the CQL queries appears when the window (2036) units are integrated. One method returns all of the data records when a particular statistic value is reached. For example:
This might select all of the records that cause the Time slice 2 statistic (2070) to exceed the value of 5. These data records could then possibly be processed further by downstream systems. Another method could be used to simply extract the event, rather than the data. For example:
These queries can be run indefinitely on incoming streams and the results of these queries, which may achieve insight into the unstructured portion of data records by using hierarchy filters or other filters within their clauses, can be inserted into traditional SQL databases. Thus CQL queries (and the subroutines that support them) may act as an adapter from unstructured data to structured data.
Filter Z (2290) is an example of a stand-alone filter, although it could be used as a component of a larger subroutine. Filter Z (2290) receives Data input (2210) and produces Column Data output (2296). Filter Y (2200) receives Data (2210) at its Input (2211) and propagates this input to the filters that use it, namely Filter V (2220) and Filter T (2230) via links 2212 and 2214 respectively. Filter V receives the Input (2221) and feeds it to Filter W (2223) via link 2222. Filter W in turn outputs Column Data via link 2224 which is provided as input to Filter X (2225). Filter X (2225) may make use of the original input (2221) as well as the Column data received via link 2224 in order to produces its own output which is sent via link 2226. This column data (2226) is sent to the output (2227) of Filter V (2220). Filter W (2223) may optionally send its output to the output (2227) of Filter V (2220), however in the example of
Filter T (2230) receives input (2231) and sends this to Filters Q and R (2233, 2234) via links 2232. Filter Q (2233) and Filter R (2234) receive link 2232 as input. Filter Q (2233) produces column data and outputs this via links 2236 and 2237 which are sent to Filter S (2238) and to the output (2240) of Filter T (2230) respectively. Filter R (2234) produces column data which is output via link 2235. Filter S (2238) receives input from the output of both Filter Q (2233) and Filter R (2234) via links 2236 and 2235 respectively. Filter S (2238) may also make use of the Input (2231) provided to its parent filter T (2230). Filter S (2238) then processes its input and produces column data output which is transferred via link 2239 to the output (2240) of Filter T (2230). The outputs (2227, 2240) from Filters V and T (2220, 2230) are sent to be processed by Filter U (2250) as input, via links 2228 and 2241 respectively. Filter U (2250) processes its inputs (2228, 2241), and may process the input (2211) to its parent Filter Y (2200) as well. Filter U (2250) then produces output (2260) which, along with the output (2240) from Filter T (2230) that is sent via link 2242, are received by Filter Y's (2200) output (2294). This column data is then sent from the output (2294) as Column Data output (2295). Thus Filter Y (2200) is a subroutine comprised of subroutines (2220, 2230, 2250), which may themselves comprise subroutines; and whenever a subroutine is comprised of other subroutines it organizes their inputs and outputs in a certain way to carry out the computation of the parent filter.
The example subroutines (2200, 2220, 2223, 2225, 2230, 2233, 2234, 2238, 2250, 2290) of
The entry for Subroutine T (2230) makes use of other entries in the Subroutine Repository Database (2010), namely subroutines Q, R, and S (2233, 2234, 2238), by referencing these subroutines as Constituent Subroutines (2320). We can see that Filter T (2230) contains Filters Q, R, and S (2233, 2234, 2238) in
The row entry for subroutine T (2230) has a “First Constituent Inputs” (2330) value of “In”, which denotes that the First Constituent Subroutine Q (2233) receives a link from the Input to subroutine T (2230). This is represented in
Additional columns for Fourth Constituent etc. may also be included in a preferred embodiment. The columns that are not applicable to a particular entry may not require storage overhead for the “not applicable” symbol if they are stored in a sparse format. Such a format stores the column name, or another identifier of the column, with the value held in that column. A secondary means of not storing values for column-row pairs that would not hold “not applicable” values is to use a reverse index wherein each value that occurs in a column is made to point to the list of rows that contain that value.
Subroutine T (2238) further comprises a “Tweets” value for the “Proven useful input” column (2370), CPU and GPU values for the “Typical best hardware” column (2380), and a “Consumer Index” value for the “Linked trends” column (2390).
Subroutine U 2250 has “Proven useful input” column (2370) value of “Text and audio descriptors”, a “Typical best hardware” column (2380) value of “CPU”, and a “Linked trends” column (2390) value of “S&P 500”, whereas all other columns for subroutine U (2250), besides the subroutine column 2300 and Type column 2310, are not applicable.
The entry for Subroutine V (2220) makes use of other entries in the Subroutine Repository Database (2010), namely subroutines W and X (2223, 2225) by referencing these subroutines as Constituent Subroutines (2320). We can see that Filter V (2220) contains Filters W and X (2223, 2225) in
The row entry for subroutine V (2220) has a “First Constituent Inputs” (2330) value of “In” which denotes that the First Constituent Subroutine W (2223) receives a link from the Input to subroutine V (2220). This is represented in
Subroutine V (2220) further comprises an “Audio” value for the “Proven useful input” column (2370). Subroutine V (2220) also has a “Cognitive” value for the “Typical best hardware” column (2380), which indicates that the computer hardware based on the Cognitive architecture developed by Cognitive Electronics may best execute Filter V (2220). Subroutine V (2220) further has an “S&P 500” value for the “Linked trends” column (2390), which indicates that the value of the S&P 500 stock has been successfully predicted using Filter V (2220).
Subroutine W (2223) has “Proven useful input” column (2370) value of “Music Audio”, a “Typical best hardware” column (2380) value of “Cognitive”, and a “Linked trends” column (2390) value of “S&P 500”; whereas all other columns for subroutine W (2223), besides the Subroutine column (2300) and Type column (2310), are not applicable. Subroutine X 2225 has a “Proven useful input” column (2370) value of “Audio”, a “Typical best hardware” column (2380) value of “Cognitive”, and a “Linked trends” column (2390) value of “S&P 500”; whereas all other columns for subroutine X (2225), besides the Subroutine column (2300) and Type column (2310), are not applicable.
The entry for Subroutine Y (2200) makes use of other entries in the Subroutine Repository Database (2010), namely subroutines V, T, and U (2220, 2230, 2250), by referencing these subroutines as Constituent Subroutines (2320). We can see that Filter Y (2200) contains Filters V, T, and U (2220, 2230, 2250) in
The row entry for subroutine Y (2200) has a “First Constituent Inputs” (2330) value of “In”, which denotes that the First Constituent Subroutine V (2220) receives a link from the Input to subroutine Y (2230). This is represented in
Subroutine Y (2200) further comprises “Tweets” and “RSS Feed Audio” values for the “Proven useful input” column (2370), “Cognitive”, CPU and GPU values for the “Typical best hardware” column (2380), and “Consumer Index” and “S&P 500” values for the “Linked trends” column (2390).
Subroutine Z (2290) has a “Proven useful input” column (2370) value of “Video”, a “Typical best hardware” column (2380) value of “Cognitive”, and a “Linked trends” column (2390) value of “Wireless usage”; which indicates that subroutine Z (2290) has previously been used successfully to predict the wireless usage (e.g. bandwidth consumed) in a particular environment. All other columns for subroutine Z (2290), besides the Subroutine column (2300) and Type column (2310) are not applicable.
The User (2000) interacts with the Subroutine Builder Interface (2020) via link 2001 in order to designate the input (2403), preferred organization of the Filters and Windows (2025, 2030), if any, and other configurable parts of the novel system. The user may select, through the Subroutine Builder Interface (2020), which STTC (2480) should be used from the Subroutine Repository Database (2010). The Subroutine Repository Database (2010) houses multiple Subroutine Records (2412), which was previously described in
The User (2000) further configures the subroutine under construction with the selected Input Trend (2420), which is communicated to the Optimizer (2400) via link 2423. The User (2000) further configures the subroutine under construction with the Goal Configuration (2440) via link 2422, which describes the type of prediction that is to be made on the Input Trend (2420). Correlation between the Input Statistics (2410) and the Input Trend (2420) are calculated in the STTC (2430) that has been loaded into the Optimizer (2400). The Input Statistics (2410) and the Input Trend (2420) are communicated to the loaded STTC (2430) via links 2411 and 2424 respectively. Correlation is calculated by the STTC (2430) with the specific goal (2440) that has been specified by the User (2000), which may, for example, dictate how far into the future the Input Trend (2420) is to be predicted, the granularity at which the prediction is to be made, and how confidence in the prediction may be communicated. The goal configuration (2440) is communicated to the STTC (2430) via link 2441. The method used by the loaded STTC (2430) is specific to the STTC (2480) that was selected from the Subroutine Repository Database (2010) by the Subroutine Builder Interface (2020).
Estimated Statistics-to-Trend Relationship Strength (2450) is output by the loaded STTC (2430) via link 2431. The best statistics-to-trend correlations that have been stored in the Estimated Statistics-to-Trend Relationship Strength unit (2450) are reloaded into the STTC (2430) via link (2431) at which point the STTC (2430) creates a predictor of the Input Trend (2420) from specific Input Statistics (2410) according to the selected goals (2440). This predictor is called the Configured Optimizer (2460), and is output via link 2432. The Configured Optimizer (2460) is then loaded into the Subroutine Repository Database (2010) via link 2461, where it is stored as a New Configured Optimizer (2470). The New Configured Optimizer (2470) may then be loaded into an Executor (2046) that, with additional configuration by the User (2000), performs actions based on the predictions of the New Configured Optimizer (2470).
Step 2500 is the “Start” step. This step begins the process depicted in
Step 2504 is the “Is the trend data already loaded/loading?” step. In this step the flow of the process depicted in
Step 2508 is the “User uploads or begins uploading the trend data” step. In this step the user uploads historical trend data or begins uploading a continuous stream of trend data from which historical trend data will be gathered. From this step the process proceeds to step 2504 via link 2508.
Step 2512 is the “User selects the trend from the available trend data” step. In this step the user will be presented with a means of navigating their selection through the available trend data toward the trend data they would like the system to use. In a preferred embodiment the User (2000) began uploading a proprietary stream of real-time purchases in step 2508 and the User (2000) selects this trend data stream during this step. In another preferred embodiment the User (2000) is presented with some available trend data that has a fee associated with it, such as historical stock price data. In this embodiment the system may present the user with an indicator signaling that this trend data has a fee associated with it, and this signal may include the specific price associated with the data. In another preferred embodiment the user is presented with real-time streaming stock price trend data and the fee for this data may be amortized over all users or grouped with other trends and made available through a bundle with a discount relative to purchasing the trend data individually. From this step the process proceeds to step 2516 via link 2513.
Step 2516 is the “Has the trend data previously been predicted successfully?” step. In this step the historical successes of the trend data is consulted so as to help the User (2000) make successful predictions on the trend data. Historically successful predictions on this trend data may be stored in the Subroutine Repository Database (2010) or in another storage medium. For trend data that has been successfully predicted many times and/or in many different ways, the relevant data from the Subroutine Repository Database (2010) may be condensed into summarized data so that all of the successful records do not need to be consulted whenever a User (2000) would like to make a new prediction of this trend data. Such a summarizing data structure may be updated whenever a new user or new type of prediction is successful at predicting the trend data. This step proceeds either to step 2520 via “No” link 2517 (in the case that the trend data has not previously been predicted successfully), or to step 2532 via “Yes” link 2518 (in the case that the trend data has in fact previously been predicted successfully).
Step (2520) is the “Is the input data already loaded/loading?” step. This step serves the purpose of allowing the process to diverge in its path based on whether the input data is already loaded or loading, or whether it has not. The process proceeds from this step to step 2524 via “No” link 2521 (in the case that the input data is not currently loaded or loading), or to step 2528 via “Yes” link 2522 (in the case that the input data is already loaded or loading).
Step 2524 is the “User uploads or begins uploading the input data” step. The process proceeds from this step to step 2520 via link 2525.
Step 2528 is the “User selects the input from the available data” step. In this step the user is presented with options for input data, which will be used to make predictions on the trend data. In one preferred embodiment the User (2000) may select twitter data with particular tags as the input data. In another preferred embodiment the user may select the twitter firehose (unfiltered twitter data) should such data be available. In another preferred embodiment, the user may be presented with multiple free input data options, such as RSS feed updates or Wikipedia website updates, and multiple pay-for options, such as proprietary real-time social network user data. The process proceeds from step 2528 to step 2540 via link 2529.
Step 2532 is the “Is the input data that was previously used also going to be used in this optimization?” step. This step serves as a divergent step for the process depicted in
Step 2536 is the “Present user with previously successful prediction timespans and types of predictions” step. In this step the historical data related to the set of successful predictions that have been made using the selected trend data is processed by the system. The system may retrieve this data from the Subroutine Repository Database (2010) or from another medium on which these historically-successful predictions have been stored. The User (2000) is guided through the set of previously successful timespans and types of predictions so that the user may choose from amongst these prediction timespans and types of predictions. In the case that the user selects one of these previously successful prediction types and timespans, the prediction is considered more likely to succeed. This is because a use case very similar to the current User's (2000) use case was previously successful. Such a selection is considered “known-good”. The process proceeds from this step 2536 to step 2544 via the “User chooses known-good configuration” link (2537), or to step 2540 via the “User does not choose a known-good configuration” link (2538).
Step 2540 is the “User selects the desired timespan and type of prediction. This becomes the Goal Configuration” step. In this step the User (2000) chooses a timespan and type of prediction from the list of possible timespans and types of predictions, rather than from the list of known-good timespans and types of predictions. One way in which this differs from step 2536 is that the timespan and type of prediction may be chosen independently of each other, whereas in the selection from known-good prediction types and timespans the user was presented with paired options when a particular timespan was not known-good for all prediction types, or vice versa. The process proceeds from this step to step 2552 via link 2541.
Step 2544 is the “The STTC with the best performance at the desired prediction type & timespan is loaded from the Subroutine Repository Database into the Optimizer. The Configured Optimizer that resulted from the selected STTC instance may also be loaded from the Subroutine Repository Database into the Optimizer” step. In this step the system is configured to perform similar to the previously known-good configuration that was selected. The process proceeds from this step to step 2548 via link 2545.
Step 2548 is the “User selects the means by which filters and windows form statistics for input into the optimizer. If the user has not yet set up the means by which filters and windows form statistics for input into the optimizer then the user sets up an initial configuration of such. If the user has previously selected the “minimal interaction” mode then filters and windows will be automatically selected to process arbitrary data. (Once a statistic has been found that has signal relative to predicting the desired trend, then the optimizer's feedback to the windows will result in the creation of new filters similar to those that were found to have signal.)” step. The process proceeds from this step to step 2564 via link 2549.
Step 2552 is the “Has the selected input data previously been used to successfully predict trends?” step. The process proceeds from this step to step 2560 via “No” link 2554, or to step 2556 via “Yes” link 2553.
Step 2556 is the “Present the user with STTC that have previously operated on the selected input data if any. STTC that produced successful predictions of the same timespan and type are highlighted” step. The data presented to the user may be retrieved from the Subroutine Repository Database 2010 or from some other database storing the relevant information. The process proceeds from this step to step 2548 via link 2557.
Step 2560 is the “The User is presented with a list of input data types that have been processed previously and the user is asked which of the presented input data types are most like the new input data type that will be processed. If the default option previously selected by the user is the “minimal interaction” mode then the “Unknown” input data type is automatically selected. The STTC with the best performance at the desired prediction type & timespan for the type of data selected by the user is loaded from the Subroutine Repository Database into the Optimizer. The Configured Optimizer is initialized for processing of new input data” step. The process proceeds from this step to step 2548 via link 2561.
Step 2564 is the “Filters and Windows currently or previously under development process the input data in order to generate input for the optimizer” step. The process proceeds from this step to step 2568 via link 2565.
Step 2568 is the “The current statistic is set to the first statistic being input into the optimizer” step. The process proceeds from this step to step 2572 via link 2569.
Step 2572 is the “STTC performs an iteration over the current statistic in order to determine the level of signal present in the statistic useful for performing the desired predictions on the trend data” step. The process proceeds from this step to step 2576 via “Statistic is found to not have sufficient signal” link 2575, or to step 2580 the “Statistic is found to have sufficient signal” link 2574, or to itself (step 2572) via the “Further iterations are needed to determine if the statistic has sufficient prediction signal” link 2573.
Step 2576 is the “The current statistic pointer is then set to the next statistic being received as input to the optimizer” step. The process proceeds from this step to step 2572 via the “More statistics are to be processed” link 2577, or to step 2588 via the “All input statistics have been processed” link 2578.
Step 2580 is the “The current statistic is appended to the list of statistics from which prediction will be made, in the Estimated Statistics-to-Trend Relationship Strength unit. The current statistic pointer is then set to the next statistic being received as input to the optimizer” step. The process proceeds from this step to step 2588 if “All input statistics have been processed” via link 2582 or, in the alternative, to step 2584 via link 2581.
Step 2584 is the “The window, filters and filter builder responsible for creating the statistic are notified to create similar filters and windows and to build filters based on the original and new filters/windows in order to generate related statistics that may have more signal” step. This step leads to the creation of windows, filters, and filter builders that are similar to those already found to as “known-useful”. The process proceeds from this step to step 2572 via the “More statistics are to be processed” link 2585.
Step 2588 is the “Statistics with sufficient signal are loaded into the STTC from the Estimated Statistics-to-Trend Unit. Models are then trained on the relevant statistic data and trend data to accomplish the Goal Configuration. The trained models are saved in the Configured Optimizer and stored as a New Configured Optimizer in the Subroutine Repository Database so that they can be loaded in order to make the desired predictions.” step. The process proceeds from this step to the “End” step (2592) via link 2589, which concludes the process depicted in
The User (2600) interacts with the Subroutine Builder Interface (2605) via link 2601. The Subroutine Builder Interface (2605) is analogous to that (2020) depicted in
The Segments of Input Data (2611) are also sent to the Filter 2615 units, which produce Column Data (2616) that is sent to the Window (2620) units. In another preferred embodiment the Window systems may themselves send their statistics as segments of input data to downstream Filters (2615), which themselves feed into additional downstream window units (2620). Statistics (2621) are sent by window units (2620) to the Configured Optimizer Unit(s) (2625). The Configured Optimizer unit(s) (2625) also receive Current Trend Data (2622) and create predictions on that trend data which is sent as Future Trend Predictions (2626) to the Executor unit(s) (2630). The Executor unit(s) (2630) then perform Actions (2631) that respond to the predicted future of the Trend Data (2626).
The Input Data (2700) is input into the Input Data Router (2710). The Input Data Router contains the Registered Consumer Subroutines (2715), which inform the Input Data Router (2710) as to which Server hosting Filters (2740, 2755), and Server hosting Segmenting Filter (2750) should receive a portion of Input data (2711, 2713, 2712). The Registered Consumer Subroutines (2715) are updated via the Configuration data (2717) sent from the Subroutine Host Server (2720). The Input Data Router (2710) in turn sends Data Rate information (2716), which informs the Subroutine Host Server (2720) on how much Input Data (2700) is arriving in real time. This allows the Subroutine Host Server (2720) to respond to the heavier workload that increased Input Data (2700) places on the system. The servers (2740, 2750, 2755, 2760, 2675, 2770, 2775, 2780, 2785), which are described by the bracket as server group 2724, in turn send Load information (2733) to the Subroutine Host Server (2720) which enables the Subroutine Host Server (2720) to correlate the Data Rate (2700) with the required server resources such that a sufficient number can be recruited to handle to current rate of the Input Data (2700).
When the Subroutine Host Server (2720) observes an increase the Data Rate (2716) and anticipates that this will place a load on the currently recruited servers (2724) such that they may lose their real-time response rate, then the Subroutine Host Server (2720) sends Recruitment Information (2721) to one or more Available Servers (2730). The set of Available Servers (2730) that are newly recruited to support the increased workload transition via the “Recruited Servers Going to Work” link 2731. The Subroutine Host Server (2720) then sends Configuration and Routing Information (2722) to the recruited servers (2724) such that the newly recruited servers receive a portion of data for processing. Thus, the newly recruited servers take over a portion of the work and relieve the previously recruited set of servers from having to handle the entire increased load of Input Data (2700).
Conversely, when the Subroutine Host Server (2720) detects from Load information (2733) or Data Rate information (2716) that the set of currently recruited servers (2724) is over provisioned for the current workload, then Recruitment Relief Information (2723) is sent to the relevant servers that are being relieved. This causes the relieved servers to transition from the set of currently recruited servers (2724) back to the set of Available Servers (2730) via the “Servers leaving work” link (2732). The Subroutine Host Server (2720) must also send Configuration and Routing Information (2722) so that the relieved servers do not have any data processing workload routed to them. The Subroutine Host Server (2720) also notifies the Registered Consumer Subroutines (2715) via the Configuration link (2717) that Input Data (2711, 2712, 2713) should not be routed to the relieved servers.
For completeness, as previously described, the Server hosting Filters (2740), may send Column Data (2741, 2742) to other Server hosting Filters (2755, 2760). The Server hosting Filter (2755) also receives its Input data portion from the Input Data Router (2710) and produces Column Data output (2756), which is sent to the Server hosting Filter and Window (2765). The Server hosting Filter (2760) receives Column Data (2742, 2757) from the Server hosting Filter and Server hosting Segmenting Filter (2740, 2750), and may send Column Data output (2761, 2762) to Servers hosting Filter and Window (2765, 2770).
The Servers hosting Filters and Windows (2765, 2770) send Statistics (2761, 2771, 2772) to Optimizer and Executors (2775, 2780, 2785) depending on which Statistics are required by the particular Optimizer and Executor (2775, 2780, 2785). The Optimizer and Executor (2775, 2780, 2785) receive Trend Data input (2790) and, based on the predictions they produce, enact Actions (2776, 2780, 2786).
Once a compilation of a subroutine has been made it can be tested in order to determine its performance and performance-per-watt on that system. It can be further tested for its bandwidth requirements. For example different network topologies may be available for the same architecture, one with high bandwidth (2775) and one with less bandwidth between distant nodes (2780). Once the performance of the subroutines has been measured on the various systems (2790-2795) this Performance Data information (2756) is transmitted from these systems (2790-2795) to the Subroutine Host Server (2750, analogous to 2720), which stores aggregated summaries of this data back in the Subroutine Repository Database (2800) via link 2751.
In another preferred embodiment, performance at a subset of the total set of configurations is sufficient to estimate performance on the other systems, and so each subroutine need only be tested on a few, or some non-exhaustive set of systems. For example, poor performance of a subroutine on an AMD-based GPU system may be sufficient to predict poor performance on an Nvidia-based GPU system. In another embodiment poor performance on lower-bandwidth systems (2780) anticipates the possibility of better performance on higher bandwidth systems (2775) which, with additional evidence, may support testing of additional systems only in the very fat tree networked systems. The bandwidth-to-work completed correlation may be calculated by the Subroutine Host Server (2750) from the Performance Data (2756). In this way the required network (2775, 2780) can be predicted from the workload completion rate when the subroutine is run on different systems (2760, 2765, 2770).
The novel system uses the summarized performance data stored in the Subroutine Repository Database (2800) to assign subroutines (2741-2746) to the hardware on which it performs best. Subroutines that communicate with each other are assigned to be in the same subnetwork. For example, in system 2793, subroutines #2 and #3 (2742, 2743) are executed on the same subnetwork comprising nodes 2783 and 2784. In other cases they may not need to run in the same subnetwork. This is the case with Subroutine 5 (2745) and Subroutine 4 (2744), which are run on separate networks (2792, 2795) and thus may not have high bandwidth communication with each other. The subroutines would be allocated to hardware resources in this manner if it is anticipated that subnetwork separation will not decrease performance, which would be the case if these subroutines subroutines do not communicate with each other. In another preferred embodiment, a subroutine may be migrated from a lower performing computer hardware, such as a 2 Ghz Celeron Intel Processor, to a higher performing version within the same architecture, such as a 3 Ghz Celeron Intel Processor. In this case additional hardware is not recruited, but rather higher performing hardware is only used when it is needed, and in this case it may simply be migrated from lower performing hardware. Such migration would be controlled by the demand placed on the system by the incoming data (2700).
Another aspect of the novel system is that the interactive CQL query builder process of
It is also noteworthy that the Optimizer (2400) may continue to optimize the Configured Optimizer (2460) based on reports on real-time data from the STTC (2430). In this way the system continuously improve and also adjust to changes in the input data stream. By using the Input Trend (2420) as supervised data (that we merely try to predict in advance), we can adjust to changes in performance since the supervised data allows us to constantly monitor performance.
It will be appreciated by those skilled in the art that changes could be made to the embodiments described above without departing from the broad inventive concept thereof. It is understood, therefore, that this invention is not limited to the particular embodiments disclosed, but it is intended to cover modifications within the spirit and scope of the present invention as defined by the appended claims.
This application claims the benefit of U.S. Provisional Application No. 61/845,034, filed Jul. 11, 2013.
Number | Date | Country | |
---|---|---|---|
61845034 | Jul 2013 | US |