Big data mining has been a big buzzword in numerous industries, including higher education. Most data mining projects entail building predictive models to stratify population, e.g., students, based on risk scores. As an example, U.S. Patent Application Publication No. 2010/0009331 A1 by Yaskin et al. describes a method for improving student retention rates by identifying students at risk and permitting students to raise flags if they think they are at risk. As another example, Purdue's Course Signals, as described in “PURDUE SIGNALS Mining Real-Time Academic Data to Enhance Student Success” by Pistilli and Arnold, uses a set of business rules to identify students at risk. As another example, Canadian Patent Application Serial No. CA2782841 by Essa, Hanan, and Ayad describes performance prediction systems based on user-engagement activities, social connectedness, attendance activities, participation, task completion, and preparedness.
By focusing on prediction accuracies and subsequent risk-based stratification, the current approaches do not tie in insight-driven actions, thereby failing to provide a linkage between insights and outcomes from actions taken. Instead, they treat insights and action outcomes as two distinctly separate processes, resulting in ad hoc, suboptimal, tribal solutions that are difficult to implement globally across an institution. Furthermore, since features are optimized for predictive accuracy, they often fail to provide meaningful insights in guiding interventions for maximum return on investment (ROI).
Another complicating factor is the varying degree of data availability for students. For example, incoming freshmen have very little data for most institutions while some may have their American College Test (ACT), SAT, and application data stored in student information system (SIS). A similar situation applies to transfer students, where most institutions may have only their transfer credits and possibly grade point average (GPA) without getting down to enrollment-level grades. This variety of data availability hampers the ability to develop high-accuracy models with great insights as insightful features may apply to a small subset of the student population, which prevents them from winning the combinatorial feature ranking war.
As an example, U.S. Pat. No. 8,392,153 by Pednault and Natarajan describes segmentation-based predictive models, but they rely on a decision-tree approach by segmenting valid data into an appropriate number of segments for model building tailored to each segment. As another example, U.S. Pat. No. 8,484,085 by Wennberg discusses a patient-profile segmentation based on a range of susceptibility to different surgery risk events so that models can be optimized for each risk event.
However, none of these approaches addresses the fundamental problem of some segments of the population having only a limited subset of data. Furthermore, there can exist a variety of data-availability combinations since some students take ACT or SAT, some students have transfer credits, some students take a leave of absence and return later, etc.
What's needed is an automatic way to combine population segmentation based on data or feature availability with clustering to find natural clusters within each population segment in order to maximize both predictive accuracy and extraction of insights that can lead to interventions with high likelihood for positive outcomes.
An automation analytics system and method for building analytical models for an education application uses data-availability segments of students, which are clustered into segment clusters, to create the analytical models for the segment clusters using a machine learning process. The analytical models can be used to identify at least actionable insights.
A method for building analytical models for an education application in accordance with an embodiment of the invention comprises extracting features from data of students, segmenting the students into data-availability segments, for each data-availability segment, determining a subset of features based on model performance, clustering the students within each data-availability segment into segment clusters using one or more features in the subset of features, for each segment cluster, determining another subset of features based on model performance, and creating the analytical models for the segment clusters using a machine learning process, the analytical models providing at least actionable insights. In some embodiments, the steps of this method are performed when program instructions contained in a computer-readable storage medium are executed by one or more processors.
An automation analytics system in accordance with an embodiment of the invention comprises a feature extraction module configured to extract features from data of students, a segmentation module configured to segment the students into data-availability segments, a segment feature optimizing module configured to determine a subset of features based on model performance for each data-availability segment, a clustering module configured to cluster the students within each data-availability segment into segment clusters using one or more features in the subset of features, a cluster feature optimizing module configured to determine another subset of features based on model performance for each segment cluster, and a model building module configured to create analytical models for the segment clusters using a machine learning process, the analytical models providing at least actionable insights.
Other aspects and advantages of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, illustrated by way of example of the principles of the invention.
It will be readily understood that the components of the embodiments as generally described herein and illustrated in the appended figures could be arranged and designed in a wide variety of different configurations. Thus, the following more detailed description of various embodiments, as represented in the figures, is not intended to limit the scope of the present disclosure, but is merely representative of various embodiments. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by this detailed description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.
Reference throughout this specification to features, advantages, or similar language does not imply that all of the features and advantages that may be realized with the present invention should be or are in any single embodiment of the invention. Rather, language referring to the features and advantages is understood to mean that a specific feature, advantage, or characteristic described in connection with an embodiment is included in at least one embodiment of the present invention. Thus, discussions of the features and advantages, and similar language, throughout this specification may, but do not necessarily, refer to the same embodiment.
Furthermore, the described features, advantages, and characteristics of the invention may be combined in any suitable manner in one or more embodiments. One skilled in the relevant art will recognize, in light of the description herein, that the invention can be practiced without one or more of the specific features or advantages of a particular embodiment. In other instances, additional features and advantages may be recognized in certain embodiments that may not be present in all embodiments of the invention.
Reference throughout this specification to “one embodiment,” “an embodiment,” or similar language means that a particular feature, structure, or characteristic described in connection with the indicated embodiment is included in at least one embodiment of the present invention. Thus, the phrases “in one embodiment,” “in an embodiment,” and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment.
Embodiments of the invention relates to an automated and modular framework or system for extracting insights and building models for higher-education institutions, leveraging their Student Information System (SIS), Learning Management System (LMS), and ancillary data sources. The automation analytics system in accordance with embodiments of the invention comprises (1) complex time-series event processing, (2) feature extraction to infer cognitive and non-cognitive factors, (3) computationally efficient segmentation of static and dynamic features based on data and feature availability, (4) global feature optimization for each segment, (5) clustering of each segment, (6) separate feature optimization and predictive-model building for each cluster, and (7) marriage of predictive and propensity-score models for outcomes-driven insights. Furthermore, this automation analytics system facilitates course-specific and event-based course grade, sentiment, behavioral, and social network analyses to help identify toxic/synergistic course combinations, optimize course scheduling, and determine emerging influencers, all designed to help students succeed. Thus, the automation analytics system facilitates higher-education insight and action analyses. The automation analytics system provides a pathway between insights derived from processing institutional data and actions that take advantage of the insights to produce positive outcomes. The automation analytics system uses both impact prediction and post-action impact measurements using propensity scores for self-learning. Embodiments of this invention allow institutions to extract value from insights derived from predictive analytics.
The automation analytics system integrates both insights and insight-driven actions, i.e., interventions to improve student outcomes, using features and other information extracted from education-related data for students. These features are built to maximize both prediction accuracy and insights through time-series event processing and by differentiating performance-focused features from those that offer insights on population segments for intervention opportunities. The automation analytics system includes prediction and outcomes analytics while providing provisions for the exploration of insightful (not necessarily important for prediction accuracy) features.
These derived features from time-series event processing are computed in a modular fashion to accommodate different stages of data readiness for different clients that may utilize the automation analytics system. The client's SIS and LMS data assets can be projected onto a number of representations through ETL (Extract, Transform, and Load) and signal processing to facilitate rapid analyses of a variety of orthogonal views of student records and activities over time. From multi-year historical data, the automation analytics system can extract thousands of features and an institution-specific number of dependent variables that the automation analytics system attempts to predict. In certain cases, the automation analytics system may use external data to understand which external factors influence student success. Once the factors are identified, these factors can be embedded into a student's academic journey through an application questionnaire and/or smartphone applications to capture such factors in real time with real-time feedback.
Turning now to
The data transformation module 102 is configured to transform student data into usable format. The data transformation module uses data from SIS, learning management system (LMS), Customer Relationship Management (CRM), and other data sources. In particular, raw student records are transformed to enrollment, session (multiple overlapping sessions in a term), and term (for example, semester or quarter) levels for extracting features at several levels of abstraction. At the same time, raw transactional records are transformed to orthogonal views, consisting of, but not limited to, student-faculty activity-intervention-performance (AIP) maps and student-faculty/student-student interactions, such as, but not limited to, discussion boards or Facebook applications designed for on-ground courses, for natural language processing and social network, and course-combination matrices.
The modular feature extraction module 104 is configured to extract modular features from each transformation space, followed by more derived features that require multiple information from the earlier modular features. Examples of extracted features include, but not limited to, GPA standard deviation over terms, fraction of credits earned, and credit accumulation pattern. Examples of derived features include, but not limited to, affordability gap, cramming index, social network features, and Learning Management System time series trend and change features.
The dependent variable extraction module 106 is configured to extract various dependent variables from the same data set so that multiple predictive models can be built simultaneously. Examples of dependent variables encompass, but not limited to, lead-to-application conversion, incoming student success, persistence, course grade, successful course completion, graduation, student engagement, student satisfaction, and career performance.
The segmentation module 108 is configured to divide the students into segments based on feature availability and/or user definitions. Since students have different records based on how long they have been with the institution (SIS) and time since session start (LMS), data-availability segmentation is performed to group students based on what features are valid. For each student-term-offset, there may be a row of 1's and 0's based on feature validity. Typically, there may be, but not limited to, a binary matrix representation B of Σn=1N
In order to find a unique set of data-availability combinations, B could be multiplied by a random vector r of Nfeatures×1 and group the output (B*r) by unique numbers in B*r. Each unique number represents a set of student-terms or -time snapshots that have the same valid-feature combination. Depending on the number of features, fast feature ranking based on entropy measures or Fisher's discriminant ratio can be used to prune the feature set.
The first pass described above looks for 100% similarity in valid-feature combination. For modeling and insight purposes, the requirement can be relaxed by performing secondary similarity-based clustering on the unique valid-feature combination set with a similarity threshold <1. This step ensures that there is a manageable number of data-availability segments for next-level processing.
The segmentation module 108 is also configured to divide the entire feature matrix data into separate training and test data sets for training and out-of-set testing for model performance validation. In general, time-dependent partitioning may be used to stay on the conservative side.
The segment feature optimizing module 110 is configured to perform, for each data-availability segment, feature optimization and ranking using various methods including, but not limited to, combinatorial feature analysis, such as add-on, stepwise regression, and Viterbi. Performance rank-order curves can be plotted as a function of feature dimension to identify the point of diminishing returns, which prevents overfitting. Thus, the segment feature optimizing module operates to select a number of features to define an optimal feature subset for each data-availability segment. The optimal feature subset for each data-availability segment is denoted as Ω(i), where i is the data-availability segment index. The same methods can be applied if the data are segmented manually or not at all.
The clustering module 112 is configured to group the students in each of the segments into segment-clusters. Using one or more of the top features in Ω(i), the clustering module performs clustering using various methods, such as, but not limited to, k-means, expectation-maximization, and self-organizing Kohonen map. After clustering, small clusters with membership sizes below a preset threshold can be merged to increase within-cluster similarity. This two-step process ensures that each final cluster has enough samples for model robustness and insights.
Similar to the segment feature optimizing module 110, the cluster feature optimizing module 114 is configured to perform, for each segment-cluster, feature optimization and ranking using various methods. Thus, for each data-availability (DA) segment-cluster, the process of feature optimization and ranking is repeated so that each segment-cluster model has its own set of optimized features for model accuracy, robustness, and insights. This framework facilitates outcomes-based or prediction-driven clustering with combinatorial feature optimization to ensure that the clustering vector space is populated with orthogonal, insightful features.
The model building module 116 is configured create analytical models to extract insights and effective interventions for students at risk. The model-building module computes meta-features, such as good-feature distributions and their moments, on top features to characterize good-feature distributions in terms of normality, modality (unimodal vs. multimodal), and boundary complexity. In addition, learning algorithms are assigned based on a meta-learning algorithm that maps relationships between meta-feature characteristics and appropriate learning algorithms. For example, if class-conditional good feature distributions are unimodal and Gaussian, a simple multivariate Gaussian algorithm will suffice. However, if the distributions are highly nonlinear or multi-modal, the model building module uses nonparametric learning algorithms with an objective function that rewards accuracy and punishes model complexity. This is done to ensure that resulting models are robust with high accuracy in the presence of some data mismatches over time. Furthermore, since segments and clusters are involved, the model building module keeps track of membership distances to look for significant departures from historical data characteristics by using membership Mahalanobis distance. Any significant departure will serve as a signal to retrain models to reflect changes in data caused possibly by policy changes, new interventions, changing student mix, etc.
In order to provide predictive and intervention insights, the model building module 116 explores one-dimensional (1D), two-dimensional (2D), and three-dimensional (3D) feature density and scatter plots, and identifies through alternating binary partitioning (similar to progressive wavelet decomposition in image compression) regions where actual and predicted outcomes distributions are substantially different. Such discrepancies provide hints on how to improve models further.
In the 1D space, the model building module 116 looks for features that show separation in class-conditional probability density functions (PDFs) in any sub-regions. In order to ensure that outcomes differences be attributable to an intervention, the model building module uses propensity-score models using the top features with good separation and orthogonality. The model building module matches in the propensity-score space students in various discrete outcomes (i.e., continuing vs. non-continuing) to ensure that the matching is done in the good feature vector space. The matching in propensity score improves the probability that differences in outcomes can be attributed to the intervention under consideration.
In the 2D space, the model building module 116 usually works with, but not limited to, 4 quadrants, separated by the centroid in the 2D vector space. The same process is repeated for 3 features in the 3D vector space. In the 3D space, the model building module usually works with, but not limited to, 8 cubes.
Such visualizations and drill-down analyses provide further insights into why seemingly good/poor students on the surface perform in the opposite direction. These insights will help us tailor interventions down to a micro-segment level for effective personalization.
The automation analytics system 100 provides a fundamental suite of tools, visualizations, and models with which to perform additional drill-down analyses for extracting deeper insights and identifying intervention opportunities.
The automation analytics system 100 provides the following innovations:
(1) Automated, Data-Adaptive, Hierarchical Model Building
The automation analytics system 100 builds the predictive models in five stages. During the first stage, time-series and derived features are scanned to identify a manageable number of data-availability segments based with global feature optimization for weighting during segmentation. Next, during the second stage, the automation analytics system identifies key student-success drivers for each data-availability segment. During the third stage, the automation analytics system uses the optimized feature subset to find student clusters within each data-availability segment, where each cluster contains a relatively homogeneous subset of students for transparency. Next, during the fourth stage, the automation analytics system performs feature optimization and model training for each cluster-segment combination, thereby identifying key drivers for success in each segment-cluster for transparency, actionable insights, and model robustness. Finally, during the fifth stage, the automation analytics system performs sensitivity analysis at a student or student-enrollment level to surface key drivers for success at that level. That is, the automation analytics system computes relative contribution of each key driver to the student's success and rank order the segment-cluster level key drivers for that student based on the relative level of contribution of each key driver or feature.
(2) Marriage of Predictive Models and Propensity Score Models for Outcomes Analysis
Most observational studies or small-sample randomized controlled trials (RCT) may suffer from selection bias, regression to the mean, and too many confounding variables without proper matching between test and control subjects in highly-predictive covariates or features. Most straight propensity-score matching methods (PSM) may be inadequate if matching variables have little-to-no predictive power. A paper by P. C. Austin titled “A critical appraisal of propensity-score matching in the medical literature between 1996 and 2003” reports that a majority of PSM-based clinical research papers failed to use appropriate statistical methods in balancing treated and untreated subjects. In order to address these issues simultaneously within the automation framework of the system, predictive models are combined with PSM so that an “on the fly” matching control group can be created that is indistinguishable from the intervention population, i.e., apple-to-apple comparison, in the highly predictive covariate vector space, which can encompass inclusion/exclusion criteria. The system accomplishes apple-to-apple comparison as follows:
(3) Course-Success Prediction
The automation analytics system 100 uses multiple techniques—for example, course/student similarity analyses, collaborative filtering, clustering of students based on the most predictive feature subset for course success and identifying similar courses similar students have taken, and dynamic feature-based prediction—to predict initial course success for guidance during advising sessions. In addition, using dynamic features as a term progresses, the models continuously update course-success predictions as well as time-dependent key drivers for engaging students and driving interventions. Course-grade prediction using the automation analytics system in accordance with an embodiment of the invention is now described in detail.
(4) Course Combination and Pathway Analysis
Using various representations of concurrent-course combinations and their grades along with key student attributes for success, the automation analytics system 100 looks for course-combination clusters that lead to unusual outcomes in comparison with when they were taken separately in different combinations. By using predicted course success as a proxy for student skills, the system can estimate inherent course difficulties adjusted for student skills to identify gatekeeper courses, and toxic or synergistic course combinations. These findings form the foundation of course-schedule optimization over time that can lead to student success and graduation. Optimizing course schedule using the system in accordance with an embodiment of the invention is now described in detail.
(5) Activity-Intervention-Performance Heat Maps
In health care, a patient's health heat map derived from various claims and clinical data has been used to provide not only the patient's risk scores, but also ongoing disease progression as a function of interventions and lifestyle parameters using a dynamic Bayesian network framework. Similarly, the automation analytics system 100 uses data from LMS, SIS, Customer Relationship Management (CRM), and other data sources to produce a student's heat map along his or her education journey in accordance with an embodiment of the invention as follows:
(6) Inferring Non-Cognitive Factors from the AIP Map
Inferring non-cognitive factors from the AIP map using the automation analytics system 100 in accordance with an embodiment of the invention is now described in detail.
(7) Faculty Engagement and Influence Scores
The system's construct for faculty engagement and influence scores is based on the following core tenets.
While traditional professional profiling algorithms focus on the cost of care adjusted for patient severity for physicians or on determining and then predicting the level of expertise, the approach used by the automation analytics system 100 looks for multiple outcomes variables, such as course success, withdrawal, continuation, improvements in these measures in comparison to predictions, and measurable changes in student behaviors/activities throughout the course and after student-faculty interactions. Based on these tenets, the system constructs the faculty engagement and influence scores as follows:
The following describes examples of insights that can be derived using the system. The first example is related to features that are good for prediction accuracy and/or insights. This example is described with reference to
The second example of insights is 2×2 quadrant view with drill-down analysis. This example is described with reference to
A method for building analytical models for an education application in accordance with an embodiment of the invention is now described with reference to the process flow diagram of
In an embodiment, the methods or processes described herein are provided as a cloud-based service that can be accessed via Internet-enabled computing devices, which may include personal computers, laptops, tablet, smartphones or any device that can connect to the Internet.
It should be noted that at least some of the operations for the methods or processes described herein may be implemented using software instructions stored on a computer useable storage medium for execution by a computer using one or more processors. As an example, an embodiment of a computer program product includes a computer useable storage medium to store a computer readable program that, when executed on a computer, causes the computer to perform operations, as described herein.
Furthermore, embodiments of at least portions of the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The computer-useable or computer-readable medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device), or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk, and an optical disk. Current examples of optical disks include a compact disk with read only memory (CD-ROM), a compact disk with read/write (CD-R/W), and a digital video disk (DVD).
In the above description, specific details of various embodiments are provided. However, some embodiments may be practiced with less than all of these specific details. In other instances, certain methods, procedures, components, structures, and/or functions are described in no more detail than to enable the various embodiments of the invention, for the sake of brevity and clarity.
This application is a Continuation application of U.S. patent application Ser. No. 14/592,821, filed on Jan. 8, 2015, which claims the benefit of U.S. Provisional Patent Application Ser. No. 61/925,186, filed on Jan. 8, 2014, the contents of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61925186 | Jan 2014 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14592821 | Jan 2015 | US |
Child | 17400797 | US |