The present disclosure relates to computer-implemented methods, software, and systems for generating insights based on numeric and categorical data.
An analytics platform can help an organization with decisions. Users of an analytics application can view data visualizations, see data insights, or perform other actions. Through use of data visualizations, data insights, and other features or outputs provided by the analytics platform, organizational leaders can make more informed decisions.
The present disclosure involves systems, software, and computer implemented methods for generating insights based on numeric and categorical data. An example method includes: receiving a request for an insight analysis for a dataset, wherein the dataset includes at least one continuous feature and at least one categorical feature, wherein continuous features are numerical features that represent features that can have any value within a range of values and wherein categorical features are enumerated features that can have a value from a predefined set of values; receiving a selection of a first continuous feature for analysis; identifying at least one categorical feature for analysis; determining, for each identified categorical feature, a deviation factor that represents a level of deviation in the dataset between categories of the categorical feature in relation to the continuous feature; determining, for each identified categorical feature, a relationship factor that represents a level of informational relationship between the categorical and continuous feature; determining, based on the determined deviation factors and the determined relationship factors, an insight score, for each identified categorical feature, that combines the deviation factor and the relationship factor for the categorical feature; and providing the insight score for at least some of the identified categorical features.
While generally described as computer-implemented software embodied on tangible media that processes and transforms the respective data, some or all of the aspects may be computer-implemented methods or further included in respective systems or other devices for performing this described functionality. The details of these and other aspects and embodiments of the present disclosure are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the disclosure will be apparent from the description and drawings, and from the claims.
The volume of available data collected and stored by organizations is constantly increasing, which can result in time-consuming or even infeasible attempts by users to understand all of the data. Data mining techniques can be used to help users better handle significant amounts of data. However, challenges can exist when using data mining algorithms and techniques.
For instance, data mining can be affected by the quality of data. As another example, efficiency of data mining can be considered, since the efficiency and scalability of data mining can depend on the efficiency of algorithms and techniques. As data amounts continue to multiply, efficiency and scalability can become critical. If algorithms and techniques are inefficiently designed, the data mining experience and scalability can be adversely affected, impacting algorithm adoption. Additionally, for some data mining approaches, the data mining of massive datasets may require multiple methods to be applied, the facilitating of data to be viewed from multiple perspectives, and the extracting of insights and knowledge. Often, an organization may have a shortage of users with the pre-requisite knowledge and expertise required to harness algorithms in unison with the data to extract valuable knowledge and insights.
Accordingly, a desired data mining algorithm can be one that is efficient, scalable, applicable without requiring significant algorithm knowledge or expertise, and easily interpretable by users. For example, an insight framework can be used which can at least partially automate the process of discovering knowledge and insights though constraint guided mining. Specifically, a continuous feature of a dataset can be selected, and behavioral and informational relationships between the continuous feature and one or more categorical features of the dataset can be determined.
The insight framework can efficiently discover interesting insights identifying deviational behavior within the categorical features based on the selected continuous feature, while gathering knowledge towards each categorical features' informational relationship with the continuous feature. The underlying algorithm provided by the framework can integrate the produced insights and knowledge to output an insight score per categorical feature. The insight score can enable the ranking of categorical features relative to the continuous feature. The output from the framework can increase knowledge regarding the selected continuous feature, with the discovered knowledge capable of being utilized in further analysis.
In summary, the framework can provide an algorithm that can produce an insight score indicating a ranked relationship between a continuous feature and categorical feature(s), incorporating mined deviation knowledge. The framework can be a generic framework that can semi-automate a knowledge extraction process through constraint guided mining. Framework outputs can be interpretable by users without significant algorithm knowledge or expertise.
The framework algorithm(s) can be efficient and scalable. For instance, a cloud native algorithm and framework can be capable of efficiently mining knowledge on massive amounts of data, scaling in a reasonable manner as the number of categorical features increase. A cloud native architecture can make the framework inherently scalable and applicable to massive concurrent parallel execution, enabling the framework to process multiple categorical features in parallel without impacting efficiency.
The system 100 can provide an efficient, scalable, and interpretable data mining solution that extracts useful information, insights, and knowledge for an organization. The system 100 can provide solutions that at least partially automate a process of knowledge and discovery and insight extraction, through a constraint guided data mining process.
For instance, a user of the client device 104 can use an application 108 to send a request for an insight analysis to the server 102. The request can be to perform an insight analysis on a dataset 110 that is either stored at or accessible by the server 102. The dataset 110 can include continuous feature(s) 112 and categorical feature(s), and the user can select a continuous feature 112 using the application 108, for example, for analysis. The user can select a subset of categorical feature(s) 114 or can accept a default of having all categorical features 114 analyzed. The selected continuous feature 112 and the selected (or defaulted) categorical features 114 can constrain the data mining analysis (e.g., other non-selected continuous features 112 or categorical features 114 can be omitted from analysis).
A continuous feature 112 can be defined as numeric data in which (conceptually) any numeric value within a specified range may be a valid value. An example of a continuous feature 112 is temperature. In some cases, a continuous feature 112 may be a numerical feature for which an aggregation of the values may be any numeric value within a specified range of values. For instance, a feature may be ages, wage amounts, or counts of some item (which, for example, may be whole numbers), but averages or other aggregations of these features (e.g., over time) can be floating point numbers that can have any value (subject to limitations of a particular floating point precision used in a physical implementation). Accordingly, features such as age, dollar amounts, or counts may be considered continuous.
Categorical features 114 can be defined as data in which values are available from a predefined set of possible category values. Category values can be items in a predefined enumeration of values, for example. Categorical data may be ordered (e.g., days of week) or unordered (e.g., gender).
Once a continuous feature 112 is selected, an analysis framework 116 can extract behavioral and informational relationship information between the continuous feature 112 and categorical features 114 that exist within the dataset 110. For example, a deviation factor calculator 118 can discover insights by identifying deviational behavior (represented as deviation factors 120) for the categorical features 114 based on the selected continuous feature 112. A higher amount of deviation for a categorical feature 114 can indicate a more interesting feature, as compared to categorical features 114 that have less deviation.
In addition to analyzing for deviation, the analysis framework 116 can, using a relationship factor calculator 122, determine relational information that may exist between the categorical feature 114 and the continuous feature 112. Relationship factors 124 can indicate how good a categorical feature 114 is (e.g., on average) at predicting values of the continuous feature 112.
An insight score calculator 126 can combine deviation factors 120 and corresponding relationship factors 124 to determine insight scores 128 for each categorical feature 114. A higher insight score 128 can indicate a higher level of insight (e.g., more interest) for a categorical feature 114. Accordingly, categorical features 114 can be ranked by their insight scores 128. Categorical features 114 that have both a relatively high deviation factor 120 and a relatively high relational factor 124 will generally have higher insight scores 128 than categorical features 114 that have either a lower deviation factor 120 or a lower relational factor 124 (or low values for both scores).
An analysis report 130 that includes ranked insight scores 128 for analyzed categorical features 114 and the selected continuous feature 112 can be sent to the client device 104 for presentation in the application 108. In some cases, only highest ranked score(s) or a set of relatively highest ranked scores are provided. In general, insight scores 128 can be provided to users and/or can be provided to other systems (e.g., to be used in other data mining or machine learning processes).
The system 100 can be configured for efficiency, scalability, and parallelization. For instance, an efficiency level can be maintained even as a size of the dataset 110 (or other datasets) grows. A cloud native architecture can be used for the system 100, which can provide scalability and enable, for example, massively concurrent parallelization. For instance, rather than have categorical features processed in sequence, different servers, systems, or components can process categorical features 114 in parallel and provide insight scores 128 to the analysis framework 116 (which can be implemented centrally), which can rank categorical features 114 by insight scores 128 once insight scores 128 have been received. The deviation factor calculator 118, the relationship factor calculator 122, and the insight score calculator 126 can be implemented on multiple different nodes, for example.
As used in the present disclosure, the term “computer” is intended to encompass any suitable processing device. For example, although
Interfaces 150 and 152 are used by the client device 104 and the server 102, respectively, for communicating with other systems in a distributed environment—including within the system 100—connected to the network 106. Generally, the interfaces 150 and 152 each comprise logic encoded in software and/or hardware in a suitable combination and operable to communicate with the network 106. More specifically, the interfaces 150 and 152 may each comprise software supporting one or more communication protocols associated with communications such that the network 106 or interface's hardware is operable to communicate physical signals within and outside of the illustrated system 100.
The server 102 includes one or more processors 154. Each processor 154 may be a central processing unit (CPU), a blade, an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or another suitable component. Generally, each processor 154 executes instructions and manipulates data to perform the operations of the server 102. Specifically, each processor 154 executes the functionality required to receive and respond to requests from the client device 104, for example.
Regardless of the particular implementation, “software” may include computer-readable instructions, firmware, wired and/or programmed hardware, or any combination thereof on a tangible medium (transitory or non-transitory, as appropriate) operable when executed to perform at least the processes and operations described herein. Indeed, each software component may be fully or partially written or described in any appropriate computer language including C, C++, Java™, JavaScript®, Visual Basic, assembler, Perl®, any suitable version of 4GL, as well as others. While portions of the software illustrated in
The server 102 includes memory 156. In some implementations, the server 102 includes multiple memories. The memory 156 may include any type of memory or database module and may take the form of volatile and/or non-volatile memory including, without limitation, magnetic media, optical media, random access memory (RAM), read-only memory (ROM), removable media, or any other suitable local or remote memory component. The memory 156 may store various objects or data, including caches, classes, frameworks, applications, backup data, business objects, jobs, web pages, web page templates, database tables, database queries, repositories storing business and/or dynamic information, and any other appropriate information including any parameters, variables, algorithms, instructions, rules, constraints, or references thereto associated with the purposes of the server 102.
The client device 104 may generally be any computing device operable to connect to or communicate with the server 102 via the network 106 using a wireline or wireless connection. In general, the client device 104 comprises an electronic computer device operable to receive, transmit, process, and store any appropriate data associated with the system 100 of
The client device 104 further includes one or more processors 158. Each processor 158 included in the client device 104 may be a central processing unit (CPU), an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or another suitable component. Generally, each processor 158 included in the client device 104 executes instructions and manipulates data to perform the operations of the client device 104. Specifically, each processor 158 included in the client device 104 executes the functionality required to send requests to the server 102 and to receive and process responses from the server 102.
The client device 104 is generally intended to encompass any client computing device such as a laptop/notebook computer, wireless data port, smart phone, personal data assistant (PDA), tablet computing device, one or more processors within these devices, or any other suitable processing device. For example, the client device 104 may comprise a computer that includes an input device, such as a keypad, touch screen, or other device that can accept user information, and an output device that conveys information associated with the operation of the server 102, or the client device 104 itself, including digital data, visual information, or a GUI 160.
The GUI 160 of the client device 104 interfaces with at least a portion of the system 100 for any suitable purpose, including generating a visual representation of the application 108. In particular, the GUI 160 may be used to view and navigate various Web pages, or other user interfaces. Generally, the GUI 160 provides the user with an efficient and user-friendly presentation of business data provided by or communicated within the system. The GUI 160 may comprise a plurality of customizable frames or views having interactive fields, pull-down lists, and buttons operated by the user. The GUI 160 contemplates any suitable graphical user interface, such as a combination of a generic web browser, intelligent engine, and command line interface (CLI) that processes information and efficiently presents the results to the user visually.
Memory 162 included in the client device 104 may include any memory or database module and may take the form of volatile or non-volatile memory including, without limitation, magnetic media, optical media, random access memory (RAM), read-only memory (ROM), removable media, or any other suitable local or remote memory component. The memory 162 may store various objects or data, including user selections, caches, classes, frameworks, applications, backup data, business objects, jobs, web pages, web page templates, database tables, repositories storing business and/or dynamic information, and any other appropriate information including any parameters, variables, algorithms, instructions, rules, constraints, or references thereto associated with the purposes of the client device 104.
There may be any number of client devices 104 associated with, or external to, the system 100. For example, while the illustrated system 100 includes one client device 104, alternative implementations of the system 100 may include multiple client devices 104 communicably coupled to the server 102 and/or the network 106, or any other number suitable to the purposes of the system 100. Additionally, there may also be one or more additional client devices 104 external to the illustrated portion of system 100 that are capable of interacting with the system 100 via the network 106. Further, the term “client”, “client device” and “user” may be used interchangeably as appropriate without departing from the scope of this disclosure. Moreover, while the client device 104 is described in terms of being used by a single user, this disclosure contemplates that many users may use one computer, or that one user may use multiple computers.
The insight discovery pre-processing component 204 can be used to filter the input dataset 202, thereby guiding a knowledge extraction process. The insight discovery pre-processing component 204 includes a feature selector 208. The feature selector 208 can be used to filter the input dataset 202 by identifying a continuous feature for constrained data mining to be applied against and categorical feature(s) for which insight discovery analysis is to be performed. The selected continuous feature and the selected categorical feature(s) can be provided to the insight discovery analysis framework 206.
The insight discovery analysis framework 206 includes a deviation factor calculator 210, a relationship factor calculator 212, and an insight incorporator 214. The deviation factor calculator 210 can be applied to the selected dataset features to calculate a factor for each selected categorical feature that represents a level of deviation that exists between the categorical feature items (e.g., categories) of the categorical feature in relation to the continuous feature. The relationship factor calculator 212 can be applied to the selected dataset features to calculate a factor for each selected categorical feature that represents a level of information the categorical feature explains in relation to the continuous feature. The insight incorporator 214 can take as input a deviation factor and a relationship factor for each categorical feature and calculate an insight score 216, for each categorical feature, that reflects the relationship of the categorical feature to the continuous feature.
At 304, a continuous feature is selected for insight discovery analysis from the input dataset 302. The selected continuous feature is provided as a first output 305. At 306, as an optional step, a subset of categorical features is optionally selected for insight discovery analysis from the available categorical features within the input dataset 302. If no subset selection is performed, all categorical features within the input dataset are selected for insight discovery analysis. A second output 308 can be either all N categorical features or a selected subset of categorical features. The first output 305 and the second output 308 can represent a constrained dataset that can be passed to the insight discovery analysis framework 206, for example.
At 406, an aggregation is applied to the continuous feature, grouping all row values of the continuous feature to form a single aggregated value. Examples of aggregate functions include sum, count, minimum, maximum, and average. A particular aggregation type to use can be predefined (e.g., defaulted) or can be selected.
At 408, a first iteration loop is initiated to iterate over each categorical feature. For a first iteration, a first categorical feature is selected. At 410, a second iteration loop is initiated to iterate, for a given categorical feature, the categories within the categorical feature. For a first iteration, a first category of the first categorical feature can be selected.
At 412, for a current category (e.g., categorical feature item), the selected aggregation is applied to aggregate the continuous feature values that exist within the categorical feature item to determine a categorical feature item contribution to the aggregated continuous feature value.
At 414, a determination is made as to whether there are additional unprocessed categories of the current categorical feature. If not all of the categories have been processed for the categorical feature, a next category is selected at 415.
At 416, after all categories of the categorical feature have been processed, a deviance factor is calculated for the current categorical feature based on the categorical feature item contributions to the aggregated continuous feature value of the categories within the categorical feature. Deviance factor determination is discussed in more detail below.
At 418, a determination is made as to whether there are additional unprocessed categorical features. If not all of the categorical features have been processed, a next categorical feature is selected, at 419.
At 420, once all categorical features have been processed, an output 420, of a set of deviation factors for the categorical features, can be provided (e.g., to an insight incorporator, as described below).
In further detail, the categorical feature item contributions discussed above can be utilized in derivation of deviance factors for the categorical features. An algorithm that can be used to derive a deviation factor is shown below:
where:
That is, a value a can be set to either a maximum or a minimum of categorical feature item contributions based on whether an average of the categorical feature item contributions is positive or negative, respectively. A deviation factor can thus represent how far a largest (negative or positive) value deviates from an average value for the categorical feature. In other words, a deviation factor for a categorical feature can represent how far a category with a largest value deviates from the average of all categories for the categorical feature.
At 510, ancillary statistics are generated for the current category. Ancillary statistics for the current category can include a mean, variance, variance relative to the dataset, and a record count.
The mean for the category can be computed using a formula of:
where x is the value of the continuous measure where the categorical feature equals the category and n is the number of records where the categorical feature equals the category.
The variance for the category can be computed using a formula of:
where
The variance for the category relative to the dataset can be computed using a formula of:
where
where nds is the number of records in the entire dataset.
The record count of the category reflects a count of rows in which the category occurs, and can be computed using a formula of:
recordcountcategory
where x is the category to be counted and si is a category at row i.
At 512, primary metrics are derived for the current category using the ancillary metrics for the category. Primary metrics can include a Sum of Square Residual (SSR) and Sum of Square Total (SST).
The SSR for a category can be computed using a formula of:
SSR
category(x)=varcategory(x)*(recordcountcategory(x)−(1−relativesamplecategory(x))).
The SST for a category can be computed using a formula of:
SST
category=varcategory relative(x)*recordcountcategory(x).
At 514, a determination is made as to whether there are additional unprocessed categories of the current categorical feature. If not all of the categories have been processed for the categorical feature, a next category is selected, at 516.
At 518, after all categories of the current categorical feature have been processed, a relationship factor is calculated for the current categorical feature. A first step in calculating the relationship factor can include computing a principal relationship factor (PRF) that reflects a relationship between the categorical feature and the continuous feature. The principal relationship factor can be computed using a formula of:
For the principal relationship factor, a value near 1 suggests a strong relationship exists between the categorical feature and the continuous feature, with factor value of near zero suggesting the absence of a relationship.
A second step in calculating the relationship factor can include computing an adjusted principal relationship factor (APRF) for the categorical feature that adjusts for the cardinality of the categorical feature. The adjusted principal relationship factor can be computed using a formula of:
where nds is the number of records in the dataset and ncategories is the cardinality of the categorical feature. Similar to the principal relationship factor, for the adjusted principal relationship factor, a value near 1 suggests that a strong relationship exists between the categorical feature and the continuous feature, with a factor value of near zero suggesting the absence of a relationship.
Utilizing the adjusted principal relationship factor, the relationship factor is then calculated for the categorical feature. The algorithm to produce the relationship factor can be defined as:
For the relationship factor, a value near one suggests the absence of a relationship between the categorical feature item and the continuous feature, with a factor value of near two suggesting a strong relationship.
At 520, a determination is made as to whether there are additional unprocessed categorical features. If not all of the categorical features have been processed, a next categorical feature is selected, at 522, and processed (e.g., at steps 506 to 518).
At 524, once all categorical features have been processed, an output 524, of a set of relationship factors for the categorical features, can be provided (e.g., to an insight incorporator, as described below).
At 606, the first input 602 and the second input 604 are merged, according to categorical feature, to create a merged list of inputs. At 608, an iteration is started that loops over each item in the merged list. For instance, inputs for a first categorical feature can be obtained from the merged list of inputs. The first categorical feature can be a current categorical feature being processed in the iteration.
At 610, a deviation factor for the current categorical feature and a relationship factor for the current categorical feature are incorporated into an insight score for the current categorical feature. Different approaches can be used during incorporation. For instance, the insight score for the current categorical feature can be determined by multiplying the deviation factor for the current categorical feature by the relationship factor for the current categorical feature.
At 612, a determination is made as to whether all categorical features have been processed. If not all categorical features have been processed, inputs are retrieved, at 614, from the merged list of inputs, for a next categorical feature. At 610, the deviation factor for the next categorical feature and the relationship factor for the next categorical feature are incorporated into an insight score for the next categorical feature.
Once all categorical features have been processed, the insight incorporator 600 can provide (e.g., to a user or to an application or system) a ranked list 616 of categorical features indicating association with the continuous feature. The ranked list 616 can rank the categorical features in terms of a level of insight and relationship information in relation to the selected continuous feature. Categorical features that have a stronger informational relationship with the continuous feature can be ranked higher in the ranked list 616 than other categorical features.
The insight algorithm can be applied to various datasets. For instance,
The deviation factor 762 being substantially close to zero indicates a relatively small amount of deviation. The relationship factor 764 being substantially close to the value of one indicates that the relationship factor 764 reasonably identifies and represents the absence of a relationship existing between the categorical feature and the continuous feature. Furthermore, given that for the first example dataset, aggregated values of the continuous feature are similar across each category (e.g., suggesting no significant deviational behavior), the deviation factor 762 being substantially close to zero is appropriate. An output product of the deviation factor 762 and the relationship factor 764 result in the insight score 766 being substantially close to zero, which accurately and collectively reflects the low deviation and the categorical feature's insignificant relationship with the continuous feature.
The relationship factor 864 computed as 1.0 reasonably identifies and represents the absence of a relationship existing between the categorical feature and the continuous feature. Furthermore, the second example dataset includes a pattern of aggregated values of the continuous feature for one category (the category 804) being significantly greater than for all other categories. Accordingly, the deviation factor 862 is substantially greater than, for example, the deviation factor 762.
An output product of the deviation factor 862 and the relationship factor 864 result in the insight score 866. The insight score 866 matching the deviation factor 862 suggests that while a significant deviation factor may be present in the second example dataset, without an informational relationship existing with the continuous feature, a categorical feature relationship with the continuous feature is insignificant (thus, the insight score 866 is not raised from the deviation factor 862).
At 1202, a request is received for an insight analysis for a dataset. The dataset includes at least one continuous feature and at least one categorical feature. Continuous features are numerical features that represent features that can have any value within a range of values and categorical features are enumerated features that can have a value from a predefined set of values.
At 1204, a selection is received of a first continuous feature for analysis.
At 1206, at least one categorical feature is identified for analysis. All categorical features can be identified or a subset of categorical features can be received.
At 1208, a deviation factor is determined for each identified categorical feature. A deviation factor represents a level of deviation in the dataset between categories of the categorical feature in relation to the continuous feature.
At 1210, a relationship factor is determined for each identified categorical feature. A relationship factor represents a level of informational relationship between the categorical and continuous feature.
At 1212, an insight score is determined for each categorical feature, based on the determined deviation factors and the determined relationship factors. An insight score combines the deviation factor and the relationship factor for the categorical feature. The level of informational relationship for a categorical feature can indicate how well the categorical feature predicts values of the continuous feature. An insight score for a given categorical feature can be determined by multiplying the deviation factor for the categorical feature by the relationship factor for the categorical feature. A higher insight score for a categorical feature represents a higher level of insight in relation to the continuous feature.
At 1214, insight scores are provided for at least some of the categorical features. The insight scores can be ranked and at least some of the ranked insight scores can be provided.
The preceding figures and accompanying description illustrate example processes and computer-implementable techniques. But system 100 (or its software or other components) contemplates using, implementing, or executing any suitable technique for performing these and other tasks. It will be understood that these processes are for illustration purposes only and that the described or similar techniques may be performed at any appropriate time, including concurrently, individually, or in combination. In addition, many of the operations in these processes may take place simultaneously, concurrently, and/or in different orders than as shown. Moreover, system 100 may use processes with additional operations, fewer operations, and/or different operations, so long as the methods remain appropriate.
In other words, although this disclosure has been described in terms of certain embodiments and generally associated methods, alterations and permutations of these embodiments and methods will be apparent to those skilled in the art. Accordingly, the above description of example embodiments does not define or constrain this disclosure. Other changes, substitutions, and alterations are also possible without departing from the spirit and scope of this disclosure.