The disclosed technology relates generally to learning assessments, and more particularly various embodiments relate to systems and methods for multidimensional composite assessment scoring using machine learning.
Assessment examinations have been used to monitor and affect learning progressions in students. Techniques for gathering sufficient data from enough students to validate the learning progression has posed challenges. For example, it is costly and time consuming to score responses that include written explanations that address key practices like arguing from evidence and cross-cutting concepts in patterns. In the context of science assessments, it is difficult to reliably monitor the existence of relationships in conceptual learning progression, such as the learning of cause and effect, matter cycles, and energy fluxes. Additionally, the Next Generation Science Standards (NGSS) call for integrating multiple dimensions of learning: science and engineering practices, cross cutting concepts in science, and disciplinary core ideas.
Moreover, the use of composite items that include both forced choice and extended response portions are becoming more widely used to provide additional diagnostic information about the test taker. Traditional scoring approaches evaluate forced choice and constructed responses separately. Thus, while the use of composite, or combined forced choice and constructed response type questions, can provide additional insight into a learning progression, there are no methods available to efficiently score the composite examinations in a reliable and consistent manner that enables a holistic analysis between the different response types.
Systems and methods for enhanced monitoring of learning progression are provided. An example method of enhanced monitoring of learning progression may include generating a written exemplar worksheet (WEW), a rubric used to train human coders, for the human-scored composite examinations based on examination features associated with each question, using the WEW to code enough responses to create a training set, and training a scoring model using a machine learning algorithm with the training set data, and validating the reliability of the scoring model by using confusion matrices. A machine learning model may be applied to evaluate composite items with forced choice and constructed responses as a whole to accurately score the student response and provide a sub-score indicator which could be used as a formative assessment feedback to guide future instruction. The features, parameters, and inputs to the machine learning scoring model may be modified until the model meets a reliability parameter (e.g., a learning performance parameters generated from analysis of the confusion matrices exceeds a threshold value).
Some embodiments of the present disclosure provide a computer implemented method for enhancing monitoring of learning progression. In some examples, the method includes obtaining a first set of examination and first set of responses corresponding to examinations. For example, the first set of examinations may be learning assessments comprising questions eliciting forced choice responses, constructed responses, and/or mixed responses. The method may include generating a first set of examination assessments by critiquing each of the first set of responses, and compiling the critique responses in a relational database. In some examples, critiquing the first set of responses may be performed using a graphical user interface (e.g., using a human grader).
In some embodiments, the method includes training a multidimensional scoring model based on the first set of examinations, the first set of responses, and the first set of examination assessments. For example, the multidimensional scoring model may be a machine learning model. The method may include generating a confusion matrix based on the first set of examination assessments. For example, the confusion matrix may be a table that used to describe the performance of a classification model on a set of data for which the true values are known. Some examples of the method includes determining a performance assessment value from the confusion matrix. For example, the performance assessment value may be a Kappa, a quadratic weighted Kappa, a F score, a Matthews correlation coefficient, informedness, a ROC curve, a null error rate, a positive predictive value, a prevalence, a precision, a specificity, a sensitivity, a true positive rate, a misclassification rate, a false omission rate, a false discovery rate, a fall-out, a miss rate, a negative predictive value, an accuracy, or other performance assessment values calculated from confusion matrices as known in the art.
In some embodiments, the method includes determining that the multidimensional scoring model has been sufficiently trained if the performance assessment value exceeds the selected threshold value. For example, in the case of a performance assessment value being a quadratic weighted Kappa, the selected threshold value may be 0.6 in some examples. In some examples, the selected threshold value for a quadratic weighted Kappa may be 0.7. In some embodiments, the selected threshold value for the performance assessment value may be entered using the graphical user interface, selected randomly, were pre-coded into the machine learning model.
In some embodiments, the method may include obtaining a second set of examinations and a second set of responses corresponding to the second set of examinations. The method may include applying the trained multidimensional scoring model to each of the second set of responses and second set of examinations to determine a learning progression level associated with each response of the second set of responses. For example, the learning progression level may be associated with an individual learner (e.g., a student). The learning progression level may be used to indicate the learner's competence with respect to one or more subjects, and provide recommendations to the learner to improve. In some embodiments, the learning progression level may be a grade or examination score. Some example methods also include displaying the learning progression level on the graphical user interface. The method may also include applying the trained multidimensional scoring model to each of the second set of responses and second set of examinations to determine a learner progression sub level representing a learner error type.
Other features and aspects of the disclosed technology will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, which illustrate, by way of example, the features in accordance with embodiments of the disclosed technology. The summary is not intended to limit the scope of any inventions described herein, which are defined solely by the claims attached hereto.
The technology disclosed herein, in accordance with one or more various embodiments, is described in detail with reference to the following figures. The drawings are provided for purposes of illustration only and merely depict typical or example embodiments of the disclosed technology. These drawings are provided to facilitate the reader's understanding of the disclosed technology and shall not be considered limiting of the breadth, scope, or applicability thereof. It should be noted that for clarity and ease of illustration these drawings are not necessarily made to scale.
The figures are not intended to be exhaustive or to limit the invention to the precise form disclosed. It should be understood that the invention can be practiced with modification and alteration, and that the disclosed technology be limited only by the claims and the equivalents thereof.
Embodiments of the technology disclosed herein are directed toward systems and methods for multidimensional composite assessment scoring using machine learning. Disclosed embodiments provide scalable systems and methods for electronically scoring composite examinations that include questions eliciting both forced choice and constructed responses to holistically assess, track, and enhance learning progression of individual learners. The multiple dimensions of the multidimensional composite assessment scoring systems and methods may include, for example, science and engineering practices, cross cutting concepts in science, and disciplinary core content. Each of these dimensions may be tested individually using various question formats, e.g., questions that elicit forced choice, constructed response, and/or mixed response formats.
A multidimensional scoring model may be applied to assess performance across multiple dimensions by designing assessments with composite question formats using mixed question format types in testing for more than one dimension at the same time. The scoring of these composite examinations may be enhanced using machine learning algorithm to correlate responses in progress from each learner across the multiple dimensions because those multiple dimensions are integrated in related-fashion within individual questions
The examinations may include multiple question formats. For example, some questions may elicit forced choice responses. Some questions may elicit constructed responses. Some questions may elicit a mixed format response, such as a constructed or freeform response to one part of the question, and a forced choice response to another part of the question related to the constructed response. Assessment inputs 110 may also include examination assessments obtained from learner interface 140 and/or data store 120. For example, examination assessments may include critiqued responses to the examinations, wherein the critiquing is performed by a human grader to provide a scoring rubric for each examination and response thereto. N-dimensional learning logical circuit 132 may then analyze answers to the examinations together with examination assessments to learn how a particular examination and response should be scored. N-dimensional learning logical circuit 132 may apply machine learning models such as the convolutional neural network, the decision tree, logistic regression, Bayes network or other machine learning algorithms as known in the art. The trained version of the N-dimensional learning model may be stored in data store 120.
In some embodiments, learning analytics server 130 includes assessment scoring logical circuit 134. Assessment scoring logical circuit 134 may obtain underscored composite examinations and responses thereto from data store 120 and/or learner interface 140. Assessment scoring logical circuit 134 may apply the trained N-dimensional scoring model to the unscored examinations and responses to determine learner progression for individual learners.
In some examples, learner interface 140 may be a graphical user interface. Learner interface 140 may be integrated on learning analytics server 130, or may be a remote workstation, computer, laptop, tablet, mobile device, scanner, fax machine, or other input device as known in the art. Data store 120 may be a local data storage device, a network data storage device, a cloud-based data storage device, or other storage device as known in the art. Learning analytics server 130 may communicate with learner interface 140, data store 120, and/or assessment inputs 110, over a direct connection, local area network, wide area network, wireless network, or other network communication system as known in the art. In some examples, learning analytics server 130 is operated from the cloud.
Method 200 may also include generating first sets of examination assessments at step 210. For example, generating examination assessments may be performed by one or more human graders through user interface. As such, the human graders may critique each set of responses for each examination and provide examination assessments for each examination. The examination assessments may be compiled into a scoring rubric. For example, multiple response sets for the same examination administered of different learners may be compiled to provide example examination and response pairs for each level of critique. The human-graded critiques may be scored based on written exemplar worksheets (WEW). WEW's may be created for particular examination and response sets by a master grader(s). WEW are rubrics based on multidimensional learning. WEW's may be used to train human graders to create the training set.
In some examples, human graders may score student responses by assigning a learning progression level and sublevel to each response. The sublevel may be tied to a specific student error, misconception, or omission. Once a set that contains a statistically significant data set for each sub-level is created, the training set may be considered complete. In some examples, the statistically significant number of responses to be scored for each sublevel may be 25 or more. In some examples, the statistically significant number of responses to be scored for each sublevel may be about 70. Varying numbers of responses per sublevel may be scored depending on a desired level of statistical confidence in the results.
In some embodiments, method 200 includes training a multidimensional scoring model based on the examination assessments at step 215. Training the multidimensional scoring model may employ machine learning techniques such as a convolutional neural network, a decision tree, a logistic regression, Bayes network, or other machine learning techniques as known in the art. The training set may be converted to a standardized format (e.g., a standard document or database type, with standardized character types, language, etc.). Features from the responses may then be extracted. In some examples, the sublevel code may be selected as a nominal category rather than a numerical category. The method may include extracting forced choice responses as an additional feature to be evaluated by the scoring logical circuit (e.g., N-dimensional scoring logical circuit 132), and the value of the extracted forced choice response may be holistically evaluated with the scores of constructed choice responses. As such, the forced choice responses may not be scored as dichotomous or polytomous data and summed. Features of interest may be selected and/or identified and extracted.
Some embodiments, may include performing a regression of the machine-predicted evaluation assessment data and corresponding human-graded evaluation assessment data with N-dimensional scoring logical circuit 132. In some examples, a subsequent training pass may be performed on a smaller subset of features.
Method 200 may include generating a confusion matrix to the examination assessments. The confusion matrix may be a table that is used to describe the performance of a classification model on a set of test data for which true values are known. In this case, the true values for the examinations and responses may be the scoring rubric of compiled human-graded examination assessments. The classification model may be the multidimensional scoring model In some examples, the multidimensional scoring model may be applied to examinations and responses to predictively critique those responses, and the machine-graded results may then be compiled in the confusion matrix as compared with the human-graded results as described herein.
Method 200 may include determining a performance assessment value from the confusion matrix at step 225. For example, the performance assessment value may be a Kappa, a quadratic weighted Kappa, a F score, a Matthews correlation coefficient, informedness, a ROC curve, a null error rate, a positive predictive value, a prevalence, a precision, a specificity, a sensitivity, a true positive rate, a misclassification rate, a false omission rate, a false discovery rate, a fall-out, a miss rate, a negative predictive value, an accuracy, or other performance assessment values calculated from confusion matrices as known in the art. In some embodiments, the method may include calculating a sublevel accuracy, level accuracy, Kappa, and Quadratic Weighted Kappa (QWK). In some example, other combinations of performance assessment values may be calculated.
Embodiments of method 200 may include determining that the multidimensional scoring model has been sufficiently trained if the performance enhancement value exceeds a threshold level at step 230. The threshold level may be pre-determined (e.g., coded in the multidimensional scoring model), or may be obtained from a user (e.g., through a graphical user interface). In some examples, the performance assessment value includes a QWK. The multidimensional scoring model may be trained while varying the number of features used by the model to find a maximum QWK value. In some examples, the multidimensional scoring model may be considered sufficiently trained if QWK is greater than about 0.6. In some examples, the multidimensional scoring model may be considered sufficiently trained if QWK is greater than about 0.7. If the threshold value is not reached, a different machine learning model may be selected, e.g., a convolutional neural network, decision tree, logistic regression, Bayes network, or other machine learning model as known in the art.
In some embodiments, method 200 may include applying the multidimensional scoring model to unscored examination responses at step 250. For example unscored responses may be examinations taken by one or more learners which have not been human graded. For example, referring to
Method for applying the multidimensional scoring model to unscored examination responses 250 may include applying the trained multidimensional scoring model to examinations and responses from the second sets of examinations and responses at step 260. No human grading is necessary at this step. However, human grading may still be applied for quality assurance, to verify anomalous results, and/or to continue to train the multidimensional scoring model.
In some embodiments, method 250 includes determining a learning progression level at step 265. The learning progression level may be a score or grade generated by the multidimensional scoring model for one or more scored responses. In some examples, a learning progression level may be generated for multiple dimensions of learning, e.g., science and engineering practices, cross cutting concepts in science, and/or disciplinary core ideas. The learning progression level(s) may be displayed on learner interface 140 at step 270. For example, learner interface 140 may include a graphical user interface configured to display individualized scoring results. In some examples, learner interface 140 may provide a learner with recommendations for studying, test preparation, and/or test taking based on the learning progression level(s).
Some examples of method 250 include determining a learning progression sublevel at step 275. For example, the sublevel may be tied to specific student errors, misconceptions, or omissions. Method 250 may also include displaying the learning progression sublevel(s) on learner interface 140.
Method 400 may include training a multidimensional scoring model with the training set using a full set of selected features at step 415 in a first training pass. In some examples, method 400 may also include applying a machine learning algorithm, e.g., a logistic regression to a smaller subset of features. Method 400 may also include extracting a confusion matrix at step 420 and determining a learning progression parameter (e.g., a QWK) from the confusion matrix at step 425. The multidimensional scoring model may be trained until the learning progression parameter exceeds a selected threshold (i.e., training continues if the learning progression parameter meets or exceeds the selected threshold) at step 430.
As used herein, the terms logical circuit and engine might describe a given unit of functionality that can be performed in accordance with one or more embodiments of the technology disclosed herein. As used herein, either a logical circuit or an engine might be implemented utilizing any form of hardware, software, or a combination thereof. For example, one or more processors, controllers, ASICs, PLAs, PALs, CPLDs, FPGAs, logical components, software routines or other mechanisms might be implemented to make up an engine. In implementation, the various engines described herein might be implemented as discrete engines or the functions and features described can be shared in part or in total among one or more engines. In other words, as would be apparent to one of ordinary skill in the art after reading this description, the various features and functionality described herein may be implemented in any given application and can be implemented in one or more separate or shared engines in various combinations and permutations. Even though various features or elements of functionality may be individually described or claimed as separate engines, one of ordinary skill in the art will understand that these features and functionality can be shared among one or more common software and hardware elements, and such description shall not require or imply that separate hardware or software components are used to implement such features or functionality.
Where components, logical circuits, or engines of the technology are implemented in whole or in part using software, in one embodiment, these software elements can be implemented to operate with a computing or logical circuit capable of carrying out the functionality described with respect thereto. One such example logical circuit is shown in
Referring now to
Computing system 500 might include, for example, one or more processors, controllers, control engines, or other processing devices, such as a processor 504. Processor 504 might be implemented using a general-purpose or special-purpose processing engine such as, for example, a microprocessor, controller, or other control logic. In the illustrated example, processor 504 is connected to a bus 502, although any communication medium can be used to facilitate interaction with other components of logical circuit 500 or to communicate externally.
Computing system 500 might also include one or more memory engines, simply referred to herein as main memory 508. For example, preferably random access memory (RAM) or other dynamic memory, might be used for storing information and instructions to be executed by processor 504. Main memory 508 might also be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 504. Logical circuit 500 might likewise include a read only memory (“ROM”) or other static storage device coupled to bus 502 for storing static information and instructions for processor 504.
The computing system 500 might also include one or more various forms of information storage mechanism 510, which might include, for example, a media drive 412 and a storage unit interface 520. The media drive 512 might include a drive or other mechanism to support fixed or removable storage media 514. For example, a hard disk drive, a floppy disk drive, a magnetic tape drive, an optical disk drive, a CD or DVD drive (R or RW), or other removable or fixed media drive might be provided. Accordingly, storage media 514 might include, for example, a hard disk, a floppy disk, magnetic tape, cartridge, optical disk, a CD or DVD, or other fixed or removable medium that is read by, written to or accessed by media drive 512. As these examples illustrate, the storage media 514 can include a computer usable storage medium having stored therein computer software or data.
In alternative embodiments, information storage mechanism 540 might include other similar instrumentalities for allowing computer programs or other instructions or data to be loaded into logical circuit 500. Such instrumentalities might include, for example, a fixed or removable storage unit 522 and an interface 520. Examples of such storage units 522 and interfaces 520 can include a program cartridge and cartridge interface, a removable memory (for example, a flash memory or other removable memory engine) and memory slot, a PCMCIA slot and card, and other fixed or removable storage units 522 and interfaces 520 that allow software and data to be transferred from the storage unit 522 to logical circuit 500.
Logical circuit 500 might also include a communications interface 524. Communications interface 524 might be used to allow software and data to be transferred between logical circuit 500 and external devices. Examples of communications interface 524 might include a modem or softmodem, a network interface (such as an Ethernet, network interface card, WiMedia, IEEE 802.XX or other interface), a communications port (such as for example, a USB port, IR port, RS232 port Bluetooth® interface, or other port), or other communications interface. Software and data transferred via communications interface 524 might typically be carried on signals, which can be electronic, electromagnetic (which includes optical) or other signals capable of being exchanged by a given communications interface 524. These signals might be provided to communications interface 524 via a channel 528. This channel 528 might carry signals and might be implemented using a wired or wireless communication medium. Some examples of a channel might include a phone line, a cellular link, an RF link, an optical link, a network interface, a local or wide area network, and other wired or wireless communications channels.
In this document, the terms “computer program medium” and “computer usable medium” are used to generally refer to media such as, for example, memory 508, storage unit 520, media 514, and channel 528. These and other various forms of computer program media or computer usable media may be involved in carrying one or more sequences of one or more instructions to a processing device for execution. Such instructions embodied on the medium, are generally referred to as “computer program code” or a “computer program product” (which may be grouped in the form of computer programs or other groupings). When executed, such instructions might enable the logical circuit 500 to perform features or functions of the disclosed technology as discussed herein.
Although
While various embodiments of the disclosed technology have been described above, it should be understood that they have been presented by way of example only, and not of limitation. Likewise, the various diagrams may depict an example architectural or other configuration for the disclosed technology, which is done to aid in understanding the features and functionality that can be included in the disclosed technology. The disclosed technology is not restricted to the illustrated example architectures or configurations, but the desired features can be implemented using a variety of alternative architectures and configurations. Indeed, it will be apparent to one of skill in the art how alternative functional, logical or physical partitioning and configurations can be implemented to implement the desired features of the technology disclosed herein. Also, a multitude of different constituent engine names other than those depicted herein can be applied to the various partitions.
Additionally, with regard to flow diagrams, operational descriptions and method claims, the order in which the steps are presented herein shall not mandate that various embodiments be implemented to perform the recited functionality in the same order unless the context dictates otherwise.
Although the disclosed technology is described above in terms of various exemplary embodiments and implementations, it should be understood that the various features, aspects and functionality described in one or more of the individual embodiments are not limited in their applicability to the particular embodiment with which they are described, but instead can be applied, alone or in various combinations, to one or more of the other embodiments of the disclosed technology, whether or not such embodiments are described and whether or not such features are presented as being a part of a described embodiment. Thus, the breadth and scope of the technology disclosed herein should not be limited by any of the above-described exemplary embodiments.
Terms and phrases used in this document, and variations thereof, unless otherwise expressly stated, should be construed as open ended as opposed to limiting. As examples of the foregoing: the term “including” should be read as meaning “including, without limitation” or the like; the term “example” is used to provide exemplary instances of the item in discussion, not an exhaustive or limiting list thereof; the terms “a” or “an” should be read as meaning “at least one,” “one or more” or the like; and adjectives such as “conventional,” “traditional,” “normal,” “standard,” “known” and terms of similar meaning should not be construed as limiting the item described to a given time period or to an item available as of a given time, but instead should be read to encompass conventional, traditional, normal, or standard technologies that may be available or known now or at any time in the future. Likewise, where this document refers to technologies that would be apparent or known to one of ordinary skill in the art, such technologies encompass those apparent or known to the skilled artisan now or at any time in the future.
The presence of broadening words and phrases such as “one or more,” “at least,” “but not limited to” or other like phrases in some instances shall not be read to mean that the narrower case is intended or required in instances where such broadening phrases may be absent. The use of the term “engine” does not imply that the components or functionality described or claimed as part of the engine are all configured in a common package. Indeed, any or all of the various components of an engine, whether control logic or other components, can be combined in a single package or separately maintained and can further be distributed in multiple groupings or packages or across multiple locations.
Additionally, the various embodiments set forth herein are described in terms of exemplary block diagrams, flow charts and other illustrations. As will become apparent to one of ordinary skill in the art after reading this document, the illustrated embodiments and their various alternatives can be implemented without confinement to the illustrated examples. For example, block diagrams and their accompanying description should not be construed as mandating a particular architecture or configuration.
The technology disclosed herein was developed with government support under Contract No. NSF 14-522 awarded by the U.S. National Science Foundation The government has certain rights in the invention.