Tests of various forms are often used to assess the proficiency of a test-taker with regards to a specific skill or to assess the knowledge they have acquired through study. However, most methodologies used in scoring a test rely on determining the number of “correct” responses. While this is helpful, it may not be fully reflective of a test-taker's ability, as some questions may be much more difficult than others or require specific skills that demonstrate greater ability on the part of the user (who may be a test taker or subject learner, as examples). Thus, the relative difficulty of a test item (i.e., a test question or task) and the skills it requires to complete may be important factors to consider when evaluating a person's performance.
Conventional educational and psychological modeling (with applications to both instruction (teaching) and assessment (testing)) rely on methods such as item response theory (IRT)1 or knowledge tracing (KT)2, along with pilot or operational test data to estimate item parameters (e.g., difficulty) and test-taker ability parameters. However, results obtained in this manner are often norm-referenced rather than criterion-referenced, meaning they are interpretable relative to the pilot or test-taking population rather than to the characteristics of the underlying construct. This may reduce the utility and/or validity of the interpretation of the test or item results, as it ignores the relative difficulty of an item when considering the test-taker's abilities and knowledge. 1 In psychometrics, item response theory (IRT) (also known as latent trait theory, strong true score theory, or modern mental test theory) is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. it is a theory of testing based on the relationship between individuals' performances on a test item and the test takers' levels of performance on an overall measure of the ability that item was designed to measure. See Wikipedia entry for “Item Response Theory”.2 Knowledge Tracing may be treated as a task of modelling student knowledge over time so one can accurately predict how students will perform on future interactions.
Previously, Settles et al. (2020)3 described a method that alleviates the need for pilot or operational data to create such models by using data labeled by subject matter or domain experts to train machine learning models for estimations of item difficulty. However, while an improvement, this approach is not an optimal solution. For example, once a test has been operationalized and significant observational item responses are available, the method described by Settles et al. cannot directly combine both expert-annotated data and the operational test data (i.e., the item responses collected during previous administrations of the test). This is a disadvantage, as without operational test data, the item difficulty estimates are less accurate than what can often be achieved with IRT methods. Furthermore, the method described in Settles et al. can only generate an estimate of item difficulty, and typically cannot be used to estimate item discrimination or other item parameters that may be relevant to modeling the relationship between test-taker ability and item responses. Finally, test scores derived from the Settles et al. method are inherently criterion-referenced based on a rubric) rather than norm-referenced (based on the test-taking population). 3 B. Settles, G. T. LaFlair, and M. Hagiwara. 2020. Machine Learning Driven Language Assessment. Transactions of the Association for Computational Linguistics, vol. 8, pp. 247-263, 2020.
Embodiments of the disclosure overcome these and other disadvantages of conventional approaches to evaluating the performance of a test-taker, both collectively and individually.
The terms “invention,” “the invention,” “this invention,” “the present invention,” “the present disclosure,” or “the disclosure” as used herein are intended to refer broadly to all the subject matter disclosed in this document, the drawings or figures, and to the claims. Statements containing these terms do not limit the subject matter disclosed or the meaning or scope of the claims. Embodiments covered by this disclosure are defined by the claims and not by this summary. This summary is a high-level overview of various aspects of the disclosure and introduces some of the concepts that are further described in the Detailed Description section below. This summary is not intended to identify key, essential or required features of the claimed subject matter, nor is it intended to be used in isolation to determine the scope of the claimed subject matter. The subject matter should be understood by reference to appropriate portions of the entire specification, to any or all figures or drawings, and to each claim.
In some embodiments, the disclosed system, apparatuses, and methods take into consideration both domain expert annotations and operational test data produced by actual test takers. In one embodiment, this is accomplished by using a “multi-task” machine learning (ML) approach that extends standard or conventional psychometric models. This results in a model that is both criterion-referenced and norm-referenced and yields more reliable results with stronger validity evidence than the conventional approaches can do independently.
In some embodiments, the disclosed system, apparatuses, and methods bridge a gap between the two conventional approaches mentioned (IRT and KT), by explaining or describing item parameters in terms of item features, which may be interpreted as, or linked to, sub-skills that test-takers may need to master in order to answer a test item correctly.
In some embodiments, the approach disclosed herein produces a model representing both (1) rubric or classification systems used by domain (subject matter) experts and (2) the psychometric properties that may be inferred from pilot or operational testing data. Thus, an important benefit of the described approach is that it enables joint use of both (1) expert-annotated data describing the construct of the educational or psychological domain being modeled or tested (facilitating criterion-referenced interpretations), and (2) empirical pilot or operational test item response data from the test-taking population (facilitating norm-referenced interpretations).
In some embodiments, the disclosed approach may provide a potential solution to the “cold start problem”, which occurs when initially only expert annotation data may be available. The disclosed approach may also provide a solution to the “fast start problem”, which occurs when operational data may be available but are limited, such as with computer-adaptive tests (CATs) where item exposure is controlled to maintain test security and integrity. In these situations, the disclosed model's item parameter estimates can gradually be refined and improved as pilot or operational data becomes available in a sufficient quantity.
The disclosed approach may also provide a solution to the “jump start problem”, which occurs when there are large amounts of operational data available for an existing set of test items, and one wishes to use that data to estimate parameters for new items for which little or no operational data is available. In this case, the disclosed approach may be used to generalize item parameter estimates for existing items to make more accurate estimates of item parameters for new items. The disclosed approach also provides a way to interpret the results in terms of both criterion-referenced (from expert data) and norm-referenced (from pilot/operational data) values. This may provide greater insight into a test taker's performance and abilities.
In some embodiments, the disclosure is directed to a method for more effectively assessing the performance of a person on a test or at completing a task. The method may include model design, parameter estimation, and exam administration phases, as examples. In one embodiment, the disclosed method may include the following steps, stages, or operations:
Function (IPFF) weights to jointly predict a test-taker response to a test item and a subject-matter expert's evaluation of the test item difficulty;
In one embodiment, the disclosure is directed to a system for more effectively assessing the performance of a person on a test or at completing a task. The system may include a set of computer-executable instructions, a memory or data storage element (such as a non-transitory computer-readable medium) on (or in) which the instructions are stored, and one or more electronic processors or co-processors. When executed by the processors or co-processors, the instructions cause the processors or co-processors (or a device of which they are part) to perform a set of operations that implement an embodiment of the disclosed method or methods.
In one embodiment, the disclosure is directed to a non-transitory computer readable medium containing a set of computer-executable instructions, wherein when the set of instructions are executed by one or more electronic processors or co-processors, the processors or co-processors (or a device of which they are part) perform a set of operations that implement an embodiment of the disclosed method or methods.
In some embodiments, the systems and methods disclosed herein may provide services through a SaaS or multi-tenant platform. The platform provides access to multiple entities, each with a separate account and associated data storage. Each account may correspond to a user, a set of users, an entity, a set or category of entities, a set or category of users, a set or category of tests or tasks, an industry, or an organization, for example. Each account may access one or more services, a set of which are instantiated in their account, and which implement one or more of the methods or functions described herein.
Other objects and advantages of the systems, apparatuses, and methods disclosed will be apparent to one of ordinary skill in the art upon review of the detailed description and the included figures. Throughout the drawings, identical reference characters and descriptions indicate similar, but not necessarily identical, elements. While the embodiments disclosed or described herein are susceptible to various modifications and alternative forms, specific embodiments are shown by way of example in the drawings and will be described in detail herein. However, the exemplary or specific embodiments are not intended to be limited to the forms disclosed. Rather, the present disclosure covers all modifications, equivalents, and alternatives falling within the scope of the appended claims.
Embodiments of the disclosure are described with reference to the drawings, in which:
Note that the same numbers are used throughout the disclosure and figures to reference like components and features.
One or more embodiments of the disclosed subject matter are described herein with specificity to meet statutory requirements, but this description does not limit the scope of the claims. The claimed subject matter may be embodied in other ways, may include different elements or steps, and may be used in conjunction with other existing or later developed technologies. The description should not be interpreted as implying any required order or arrangement among or between various steps or elements except when the order of individual steps or arrangement of elements is explicitly noted as being required.
Embodiments of the disclosed subject matter will be described more fully herein with reference to the accompanying drawings, which show by way of illustration, example embodiments by which the disclosed systems, apparatuses, and methods may be practiced. However, the disclosure may be embodied in different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy the statutory requirements and convey the scope of the disclosure to those skilled in the art.
Among other forms, the subject matter of the disclosure may be embodied in whole or in part as a system, as one or more methods, or as one or more devices. Embodiments may take the form of a hardware implemented embodiment, a software implemented embodiment, or an embodiment combining software and hardware aspects. For example, in some embodiments, one or more of the operations, functions, processes, or methods described herein may be implemented by a suitable processing element or elements (such as a processor, microprocessor, co-processor, CPU, GPU, TPU, OPU, state machine, or controller, as non-limiting examples) that are part of a client device, server, network element, remote platform (such as a SaaS platform), an “in the cloud” service, or other form of computing or data processing system, device, or platform.
The processing element or elements may be programmed with a set of executable instructions (e.g., software instructions), where the instructions may be stored on (or in) one or more suitable non-transitory data storage elements. In some embodiments, the set of instructions may be conveyed to a user over a network (e.g., the Internet) through a transfer of instructions or an application that executes a set of instructions.
In some embodiments, the systems and methods disclosed herein may provide services to end users through a SaaS or multi-tenant platform. The platform provides access to multiple entities, each with a separate account and associated data storage. Each account may correspond to a user, a set of users, an entity, a set or category of entities, a set or category of users, a set or category of tests or tasks, an industry, or an organization, for example. Each account may access one or more services (such as applications or functionality), a set of which are instantiated in their account, and which implement one or more of the methods, process, operations, or functions disclosed herein.
In some embodiments, one or more of the operations, functions, processes, or methods disclosed herein may be implemented by a specialized form of hardware, such as a programmable gate array, application specific integrated circuit (ASIC), or the like. Note that an embodiment of the disclosed methods may be implemented in the form of an application, a sub-routine that is part of a larger application, a “plug-in”, an extension to the functionality of a data processing system or platform, or other suitable form. The following detailed description is, therefore, not to be taken in a limiting sense.
In some embodiments, the disclosure is directed to systems, apparatuses, and methods for training a model to jointly predict (a) test item annotations (e.g., a domain or subject-matter expert's assessment of the difficulty level of a test item) and (b) test-taker responses. The disclosed approach may be used to estimate item parameters used in tests for evaluating the proficiency of a test taker, and more specifically, as part of a language proficiency test or examination. As discussed, item parameters are often estimated through item-response theory (IRT) frameworks, where such an approach has the disadvantage of requiring extensive response data from test takers.
Embodiments overcome this disadvantage in at least two ways: 1) by adopting an explanatory IRT framework that estimates test item parameters in terms of item features, enabling representation sharing across items to reduce the amount of test-taker response data needed, and 2) by generalizing conventional IRT model estimation techniques into a supervised multi-learning model training framework that can leverage both test-taker responses and subject-matter expert annotations in the parameter estimation process. Among other advantages, the disclosed approach is an effective and practical solution to the cold start, fast start, and/or jump start problems common to test design.
In a broad sense, some embodiments incorporate one or more of the following aspects and provide corresponding advantages to test developers:
Embodiments may comprise a trained, multi-task machine learning (ML) model that combines both supervised learning of test item parameters (by training the model on subject-matter expert-labeled data using a rubric or construct) and item response theory (by training the model on observational item response data). This combined approach allows the model to refine item parameter estimates by incorporating test-taker response data but does not require it to produce high-quality estimates, even for novel test items.
With regards to an Item Parameter Feature Function (IPFF), an embodiment may employ generalized linear models, neural networks, or other mathematical function dependent on item features and that includes one or more differentiable weights. The type of model or technique used may depend on the application conditions and on empirical performance of the different approaches (e.g., manual features vs. automatic identification using a NN). In simpler cases, the IPFF may be a linear function (i.e., a weighted sum of features) or a log-linear function (i.e., the log of a weighted sum of features).
With this general definition of an IPFF function, one can also implement some item parameters that are learned per-item as in conventional non-explanatory IRT models, and not based on item features. This can be accomplished by using one-hot encoding on items and including those in the item features. Similarly, one can implement item parameters that are learned as a shared parameter across an entire set of items, which may be accomplished by using an IPFF that ignores item features and includes a single learnable weight.
As described, in some embodiments, the constructed models) and training process may allow for direct comparison between item difficulty, test-taker proficiency, and a domain-specific framework (such as the Common European Framework of Reference for Languages (CEFR; Council of Europe, 2001)) on a common logit scale.4 4 In statistics, the logit function or the log-odds is the logarithm of the odds; logit (p)=log (p/(1−p))=log(p)−log (1−p)=−log(1/(p−1)), where p is a probability. It is a function that creates a map of probability values from (0,1) to (−∞,+∞).
In some embodiments, the item features may include arbitrary, rich linguistic (or other domain-relevant) features such as counts, or embeddings generated from passage-based test items, Additionally, some forms of features may assist in the interpretation of test item parameters in terms of linguistic theories and provide evidence that supports interpretation of test scores. For example, for language tests, such a model may utilize passage or contextual word embeddings (such as those generated by an embodiment of ELMo, Embeddings from Language Models, or BERT, Bidirectional Encoder Representations from Transformers) that facilitate strong generalization for an item type and address the cold start, fast start, and jump start problems by reducing the pilot testing required to introduce new items. Furthermore, the item parameters derived from using some of the described techniques may correlate with lexica-grammatical features of passages or words that are known to correlate well with reading complexity.
Using such item parameters in test design and administration can provide evidence that supports the interpretation of a test-taker's attained language comprehension and related skills. In general, increased reading complexity correlates with increased difficulty estimated by the model. As such, test takers with patterns of correct answers to higher complexity items (and therefore higher difficulty items under the model) can be inferred to have a higher relative proficiency or ability.
In some embodiments, the disclosed approach extends a conventional Rasch IRT model5 in two ways. First, in contrast to the standard IRT model which uses a single parameter for each item to represent task or test item difficulty, embodiments of the disclosure deconstruct item difficulty (or other item-specific parameter) and represent it as an Item Parameter Feature Function (IPFF), which may take the form of a weighted sum of item features. Second, embodiments incorporate this extended Rasch model into an ordinal-logistic regression multi-task learning framework to generate a unified estimation of test item difficulty across CEFR-labeled data and test-taker response data. 5 A standard Rasch model is a special case of logistic regression with one parameter per student i (their proficiency αi, an element of R) and one parameter per test item j (its difficulty βj, an element of R): p(y=1 |i, j)=σ(αi−βj)|.
This formulation allows a test creator to take advantage of both ordinal labels of items (e.g., subject-matter expert annotation rubrics based on a proficiency standard) and dichotomous responses to those items (e.g., correct/incorrect test taker responses to individual test items). The disclosed approach may also take advantage of additional item annotation data: for example, combining test-taker responses with labels from multiple, related standards-based rubrics (e.g., both the CEFR and ACTFL or other language proficiency framework). These labeled items can be disjunct or overlapping in the training data; however, it is important that the annotated items share the same feature representation as the items in the test-taker response data.
In some embodiments, the disclosed approach uses a common underlying logit scale to fit two or more separate, but related data sets. For example (and without loss of generality), the multi-task approach disclosed herein allows both ordinal annotations from subject-matter experts and dichotomous responses from test takers to be used as part of a single unified model.
If other domain-relevant annotation data exists, it may be incorporated into an embodiment or implementation of the disclosed approach. Such auxiliary data might include domain or constructs of items, such as the subject matter, narrative style and purpose, or readability scores provided by experts, as non-limiting examples. These data and labels could provide additional signal(s) to the model in a multi-task framework, similarly to how the CEFR data can provide information to the task of predicting a user's likelihood of getting an item correct and vice-versa. The auxiliary data could be incorporated either as features of a test item, (if the auxiliary data is known for all items) or as an entirely separate prediction task (e.g., given an item, predict its readability score as estimated by 3 experts). As an example, one could use input texts that are known to exhibit “beginner” instead of “advanced” language for training a language testing embodiment, even if these texts are not strictly aligned by domain experts to the proficiency standard.
Some embodiments may employ other generalizations of the Rasch or IRT model, also known as a “one parameter” or “1PL” IRT model, such as 2PL, 3PL, or 4PL models to incorporate parameters for item-level discrimination, guess-ability, or slippage in addition to difficulty. Embodiments may also use IRT models that use alternative item response functions such as continuous item response functions or item response functions that accommodate mixtures of continuous and discrete responses. As with the item difficulty parameter, these item parameters may be deconstructed and represented by an Item Parameter Feature Function (IPFF). Embodiments of the disclosure may implement these generalizations and are fully compatible with other lRT frameworks or representations as well.
In one embodiment, the disclosed IRT model uses a 2-parameter logistic response function. It has two parameters, each of which may be represented as a weighted-sum of features (or other form of IPFF). To adapt the disclosed IRT model to other forms of IRT models, one would replace the 2-parameter logistic response function with the appropriate item response function for that model. The function's test item parameters would be deconstructed in the same manner as disclosed herein for item difficulty and discrimination.
In some embodiments, the disclosed approach enumerates exam sessions {1, . . . S}, and test items {1, . . . N}, where S and N denote the number of sessions and number of items, respectively. In the traditional Rasch model (a 1-parameter logistic IRT model), the probability of a test-taker in exam session i∈{1, . . . S} correctly responding to item j∈{1, . . . N} is modeled as an item response function (IRF) of the form:
p(Yij=1; Θ)=σ(αi−βj)=exp(αi−βj)/(1+exp(αi−βj)),
where a parameter Θ=(α,β) is an element of Ri+j and represents the proficiency of each test-taker and the difficulty of each test item, respectively. Note that the symbol “σ” is used to represent the sigmoid function. As can be understood from the formula, the more proficient the student (i.e., the higher their skill level or ability), the higher the probability of a correct response, From another perspective, the model suggests that the greater a test taker's “proficiency” is relative to the difficulty of an item or task, the better the test taker is expected to perform (as indicated by the probability (p) approaching a normalized value of 1).
In some embodiments, to model test-taker response data, the Rasch or standard IRT model may be generalized (extended) into a linear logistic test model (LLTM) by deconstructing the item difficulty, βj, into multiple features. Each feature may be represented by a function (ϕk) with a corresponding weight (ωk). Note that though this example embodiment represents the overall task or item difficulty as a weighted combination of item features , other forms of combining the individual features may be used, as disclosed herein. Based on the assumed form of the test item or task difficulty, the equation or formula is extended to model the probability of the user in exam session i responding correctly to item j as:
p(Yij=1; Θ)=σ(αi−Σ[ωk.ϕk(j)]),
where the sum is over k=1 to K (meaning a total of K features are considered), the parameter =(α,ω) is an element of Ri+K, and the approach uses K feature functions ϕ to extract features of an item. For example, ϕk may represent features of linguistic complexity, word frequency, and/or topical relevance for embodiments in a language assessment setting. In LLTMs, these item features are typically interpreted as “skills” associated with responding to the item, but the log-linear formulation allows for arbitrary numerical features to be incorporated (e.g., textual embeddings such as BERT, or other domain-relevant indexes).
In some embodiments, skills may be represented as item features by using Boolean (0/1) indicators to denote whether a skill is required or not required for a given item. The disclosed model also supports arbitrary numeric values. In this context, the feature functions that extract features may be a user-defined “function” that can be computed on an item, such as an expert-labeled response to “does this item require understanding the past participle”, which would be a binary function (and similar to a “skill” as defined above). Feature functions may instead be something of the form “what is the length of the longest clause in this sentence” with a whole number value response, or “what is the ratio of pronouns to non-pronouns” which would have a fractional result. Further, the feature functions may be automatically extracted values and have no specific, individual human-understandable interpretation, such as those produced by BERT or another form of language processing.
As an example of domain or subject matter expert annotations, the Common European Framework of Reference for Language (CEFR; Council of Europe, 2001) defines guidelines for the language proficiency of non-native language learners. Its six levels of proficiency are (from lowest to highest) A1, A2, B1, B2, C1, and C2. Unlike with some traditional classification tasks, the set of CEFR categories are ordered, so the disclosed model can treat predicting an item or passage's CEFR level (z) as an ordinal regression problem. This enables an embodiment of the disclosed approach to model a CEFR level prediction task (i.e., inference of the CEFR level corresponding to a test item) in a generalized linear model by learning a set of cutpoints (level boundaries) that divide the logit scale representing item difficulty among the CEFR levels. In the scenario described, the CEFR labels are being used as an auxiliary prediction task in the disclosed multi-learning framework.
In some embodiments, the disclosed approach models a discrete probability distribution of an item passage's CEFR level (or other criterion-referenced reading difficulty scale), z, using ordinal regression. To do this, the approach implements one or more of the following operations or processes:
The transformation step (logit into a (log-) probability) requires the inverse of a “link” function. For linear regression, this is the identity function, and for logistic and SoftMax regression it is the logit function. For the ordinal regression case, the disclosed approach uses what has alternatively been referred to as the proportional odds, cumulative logit, or logistic cumulative link. This describes a function that transforms the model's internal result (the logit) into the desired output scale (for example, a value between 0 and 1).
Applying this form of link function, results in defining the probability of level (z) as:
where the approach relies on a sorted vector λ of C−1 cutpoints to divide the logit scale representing difficulty into C levels according to ξz and σ represents the sigmoid function. An item's most likely category, z, is determined from the logit scale by which cutpoints it falls between. For example, for language tests using a proficiency scale such as the CEFR levels, there will be a cutpoint on the underlying logit line that separates CEFR level A1 from CEFR level A2, and another similarly separating CEFR level A2 from CEFR level B1. Any item whose difficulty parameter on the logit scale is in between these two cutpoints would be predicted to be CEFR level A2.
As described, in the case that item difficulties are modeled as a linear combination of item features, the approach computes the item or passage difficulty as βj=Σ[ωk.ϕk(j)], where the sum is from k=1 to K (where K is the number of features). The formula, βj=Σ[ωk.ϕk(j)], is an example of an Item Parameter Feature Function (IPFF). A machine learning model is trained to jointly learn (i.e., predict or infer) the passage-based item difficulty and the CEFR level cutpoints that contextualize them. As a result, the model can directly generate a value representing a test-taker's proficiency or an item's difficulty in terms of its CEFR level. This is because the model represents both the passage's ordinal CEFR level and the test-taker's proficiency on a common (logit) scale. The probability computation therefore depends on a vector, w, of weights that govern the (relative) contribution of each item feature to the overall difficulty of the task. The weights are shared between the two prediction tasks (test taker test item response and CEFR level assigned to test item by subject-matter expert), so feature weights learned from the CEFR level prediction task can refine the item parameter estimates for the other task.
Because the trained model “learns” to predict test taker responses and the CEFR label using the same weighted combination of input features, the learned weighted combination can be aligned to both the CEFR scale and to an IRT difficulty scale, the latter of which can also be used to represent test taker ability. The disclosed approach relates CEFR level to a common scale by representing CEFR levels as a segmentation of the logit scale. This enables the expression of CEFR level, item difficulty, and test-taker proficiency on a common scale. Although the common scale is continuous, and CEFR is a discrete ordinal classification, the CEFR classes are aligned to the continuous scale, thereby allowing use of CEFR labels to better estimate item difficulties and to allow an interpretation of test taker abilities in terms of the CEFR scale.
For item featurization (to identify an item's features and generate the functions (ϕk)), an embodiment may use a text embedding representation such as the Bidirectional Encoder Representations from Transformers (BERT), which can implicitly represent linguistic content of input text. For k an element of [1, 2, . . . 768], let ϕk(j) be defined as BERT(text(j))k, where the “text” function extracts the section of text for item j and the BERT function extracts the 768-dimensional embedding of the classification token (CLS) from the BERT network's output. In this sense, BERT computes a vector to represent a section of input text. In one embodiment, the disclosed approach “tunes” the parameters of a joint model Θ=(α, λ, ω) by maximum a posteriori (MAP) estimation to better predict both test taker success on a test item based on a section of text, and the section's CEFR level, where a may be used to represent the test-taker proficiency estimates. As described, the approach uses a common logit scale segmented into CEFR levels to enable both the computation and a more effective comparison of a test-taker's ability and the CEFR level of a test item (which is indicative of the skill or ability expected to be needed to properly respond to the test item).
Embodiments of the disclosed approach or methodology may implement one or more of the following steps or stages to produce a trained model that combines supervised learning of criterion-referenced item level (which uses expert-labeled annotation data to train the model) and item response theory (which uses observational response data of test takers to train the model). This allows the model to refine or interpret item parameter estimates by incorporating test-taker response data.
In some embodiments, this may be accomplished by implementing one or more of the following steps, stages, processes, operations, or functions:
CEFR) as well as relative norm-referenced test-taker ability, because the latent output scale for the multi-task model is jointly learned; that is, each norm-referenced difficulty estimate (or test-taker ability estimate) has a corresponding criterion-referenced CEFR level as well.
In some embodiments, individual words within a passage that a test-taker must respond to may be modeled as individual items, For example, each damaged (altered) word within a c-test (a question format used to measure language proficiency) passage may be modeled as an individual item whose parameters are estimated in terms of its features using Item Parameter Feature Functions (IPFF). In such cases, word-level features, such as contextual word embeddings extracted via BERT models, may be used.
In some embodiments, multilingual language embeddings may be used as features to generalize estimates of item parameters across tests with test items in different languages. For example, if one has a large set of item responses to c-test items written in English, such an embodiment could be used to estimate the item parameters of c-test items written in Spanish, French, or other language supported by a multilingual embedding model.
In some embodiments, learnable item-level residual parameters that adjust the feature-based item parameter estimates for individual items may be included in a model. In such embodiments, the residual parameters allow the parameter estimates of a test item to be refined, independent of the item's features, as more test-taker response data is collected for that item. Such residual parameters may use Gaussian priors or other regularization method to avoid adjusting the parameter estimates too much until sufficient test-taker responses data is collected.
In some embodiments, an IRT model with item parameters deconstructed into features via an Item Parameter Feature Function (IPFF) may be treated as a generalized non-linear mixed effects model (GNLMM). In such cases, the model parameters (such as IPFF weights and test-taker proficiency estimates) may be estimated using statistical methods germane to those kinds of models,
In some embodiments, fixed proxy estimates for test-taker proficiencies may be used when estimating item parameters as deconstructions of item features. Such proxy estimates may be derived from performance on items other than those whose parameters are being estimated by the model. In such cases, the proxy estimates may be used in place of the test-taker proficiencies that would normally be estimated jointly with the item Parameter Feature Function (IPFF) weights. Alternatively, one may employ a Bayesian approach by using the proxy estimates as means for Gaussian priors on the test-taker proficiency estimates when estimating them jointly with item parameters.
Both sets of data are provided as inputs to a mufti-task machine learning algorithm 106 which is used to generate a trained model, Given appropriate features (e.g., BERT ernbeddings or other linguistic indexes, in the case of language testing embodiments), the output of the trained model may be represented as a projection 108 combining the level annotations (as defined by a framework or rubric, such as the CEFR) with test-taker proficiency a and item difficulty βj (both of which are inferred from operational test administration data) along the same ordered scale. In one sense, the output of the trained model represents how a test-taker's response relates to a value in a set of ordered indications of competency.
As non-limiting examples, Model Definition phase 150 may comprise processes, functions, or operations including:
As non-limiting examples, Machine Learning Model Training phase 160 may comprise processes, functions, or operations including:
As non-limiting examples, the Test Administration phase 170 may comprise processes, functions, or operations including:
In general, an embodiment may be implemented using a set of software instructions that are designed to be executed by a suitably programmed processing element (such as a GPU, CPU, TPU, CPU, state machine, microprocessor, processor, co-processor, or controller, as non-limiting examples). In a complex application or system such instructions are typically arranged into “modules” with each such module typically performing a specific task, process, function, or operation. The entire set of modules may be controlled or coordinated in their operation by an operating system (OS) or other form of organizational platform.
Each application module or submodule may correspond to a particular function, method, process, or operation that is implemented by the module or submodule. Such function, method, process, or operation may include those used to implement one or more aspects of the described systems and methods.
The application modules and/or submodules may include a suitable computer-executable code or set of instructions (e.g., as would be executed by a suitably programmed processor, microprocessor, co-processor, or CPU, as examples), such as computer-executable code corresponding to a programming language. For example, programming language source code may be compiled into computer-executable code. Alternatively, or in addition, the programming language may be an interpreted programming language such as a scripting language.
Modules may contain one or more sets of instructions for performing a method or function described with reference to the Figures, and the descriptions or disclosure of the functions and operations provided in the specification. These modules may include those illustrated but may also include a greater number or fewer number than those illustrated. As mentioned, each module may contain a set of computer-executable instructions. The set of instructions may be executed by a programmed processor contained in a server, client device, network element, system, platform, or other component.
A module may contain instructions that are executed by a processor contained in more than one of a server, client device, network element, system, platform or other component. Thus, in some embodiments, a plurality of electronic processors, with each being part of a separate device, server, or system may be responsible for executing all or a portion of the software instructions contained in an illustrated module. Thus, although
As shown in
In some embodiments, the modules may comprise computer-executable software instructions that when executed by one or more electronic processors or co-processors cause the processors or co-processors (or a system or apparatus containing the processors or co-processors) to perform one or more of the steps or stages of:
In some embodiments, the trained machine learning model may be used to evaluate a user's proficiency or skill level based on the user's response(s) to an item or set of items and the relative difficulty of those items. This approach may also be used as part of a decision regarding the contents of an item, the expected performance of a set of users when asked to respond to an item, and the performance of a user compared to others asked to respond to the different items.
In addition to the specific use case or application described herein (evaluating the proficiency of a test-taker), embodiments of the approach and methodology described herein may be used in the one or more of the following contexts:
In some embodiments, the functionality and services provided by the system and methods described herein may be made available to multiple users by accessing an account maintained by a server or service platform. Such a server or service platform may be termed a form of Software-as-a-Service (SaaS).
In some embodiments, the system or service(s) described herein may be implemented as micro--services, processes, workflows, or functions performed in response to a user request. The micro-services, processes, workflows, or functions may be performed by a server, data processing element, platform, or system. In some embodiments, the services may be provided by a service platform located “in the cloud”. In such embodiments, the platform is accessible through APIs and SDKs. The described model development and test-taker evaluation processing and services may be provided as micro-services within the platform for each of multiple users or companies. The interfaces to the micro-services may be defined by REST and GraphQL endpoints. An administrative console may allow users or an administrator to securely access the underlying request and response data, manage accounts and access, and in some cases, modify the processing workflow or configuration.
Note that although
Although in some embodiments, a platform or system of the type illustrated in
System 310, which may be hosted by a third party, may include a set of services 312 and a web interface server 314, coupled as shown in
In some embodiments, the set of services or applications available to a company or user may include one or more that perform the functions and methods described herein with reference to the enclosed figures. As examples, in some embodiments, the set of applications, functions, operations or services made available through the platform or system 310 may include:
The platform or system shown in
The distributed computing service/platform (which may also be referred to as a multi-tenant data processing platform) 408 may include multiple processing tiers, including a user interface tier 416, an application server tier 420, and a data storage tier 424. The user interface tier 416 may maintain multiple user interfaces 417, including graphical user interfaces and/or web-based interfaces. The user interfaces may include a default user interface for the service to provide access to applications and data for a user or “tenant” of the service (depicted as “Service UI” in the figure), as well as one or more user interfaces that have been specialized/customized in accordance with user specific requirements (e.g., represented by “Tenant A UI”, . . . , “Tenant Z UI” in the figure, and which may be accessed via one or more APIs).
The default user interface may include user interface components enabling a tenant to administer the tenant's access to and use of the functions and capabilities provided by the service platform. This may include accessing tenant data, launching an instantiation of a specific application, causing the execution of specific data processing operations, etc.
Each application server 422 or processing tier 420 shown in the figure may be implemented with a set of computers and/or components including computer servers and processors, and may perform various functions, methods, processes, or operations as determined by the execution of a software application or set of instructions. The data storage tier 424 may include one or more data stores, which may include a Service Data store 425 and one or more Tenant Data stores 426. Data stores may be implemented with any suitable data storage technology, including structured query language (SQL) based relational database management systems (RDBMS).
Service Platform 408 may be multi-tenant and may be operated by an entity in order to provide multiple tenants with a set of business-related or other data processing applications, data storage, and functionality. For example, the applications and functionality may include providing web-based access to the functionality used by a business to provide services to end-users, thereby allowing a user with a browser and an Internet or intranet connection to view, enter, process, or modify certain types of information. Such functions or applications are typically implemented by one or more modules of software code/instructions that are maintained on and executed by one or more servers 422 that are part of the platform's Application Server Tier 420. As noted with regards to
As mentioned, rather than build and maintain such a platform or system themselves, a business may utilize systems provided by a third party. A third party may implement a business system/platform as described above in the context of a multi-tenant platform, where individual instantiations of a business' data processing workflow (such as the data processing and model training described herein) are provided to users, with each company/business representing a tenant of the platform. One advantage to such multi-tenant platforms is the ability for each tenant to customize their instantiation of the data processing workflow to that tenant's specific business needs or operational methods. Each tenant may be a business or entity that uses the multi-tenant platform to provide business services and functionality to multiple users.
As noted,
For example, users may interact with user interface elements to access functionality and/or data provided by application and/or data storage layers of the example architecture. Examples of graphical user interface elements include buttons, menus, checkboxes, drop-down lists, scrollbars, sliders, spinners, text boxes, icons, labels, progress bars, status bars, toolbars, windows, hyperlinks, and dialog boxes. Application programming interfaces may be local or remote and may include interface elements such as parameterized procedure calls, programmatic objects, and messaging protocols.
The application layer 510 may include one or more application modules 511, each having one or more submodules 512. Each application module 511 or submodule 512 may correspond to a function, method, process, or operation that is implemented by the module or submodule (e.g., a function or process related to providing data processing and services to a user of the platform). Such function, method, process, or operation may include those used to implement one or more aspects of the inventive system and methods, such as for one or more of the processes or functions disclosed herein and described with reference to the Figures:
The application modules and/or submodules may include any suitable computer-executable code or set of instructions (e.g., as would be executed by a suitably programmed processor, microprocessor, or CPU), such as computer-executable code corresponding to a programming language. For example, programming language source code may be compiled into computer-executable code. Alternatively, or in addition, the programming language may be an interpreted programming language such as a scripting language. Each application server (e.g., as represented by element 422 of
The data storage layer 520 may include one or more data objects 522 each having one or more data object components 521, such as attributes and/or behaviors. For example, the data objects may correspond to tables of a relational database, and the data object components may correspond to columns or fields of such tables. Alternatively, or in addition, the data objects may correspond to data records having fields and associated services. Alternatively, or in addition, the data objects may correspond to persistent instances of programmatic data objects, such as structures and classes. Each data store in the data storage layer may include each data object. Alternatively, different data stores may include different sets of data objects. Such sets may be disjoint or overlapping.
Note that the example computing environments depicted in
This disclosure includes the following embodiments or clauses:
1. A method of estimating one or more test item parameters for an item response theory model used to evaluate a person's performance on a set of test items, comprising:
2. The method of clause 1, further comprising:
3. The method of clause 2, wherein the test items are used in a language proficiency test.
4. The method of clause 2, wherein each of the plurality of test items after a first test item is selected based on the test-takers proficiency estimated derived from the test-taker's graded responses to the previously provided items.
5. The method of clause 1, wherein the subject-matter expert's annotation of the test item is a criterion-referenced level of test-taker proficiency needed to correctly answer the test item and the common scale is both criteria-referenced and norm-referenced.
6. The method of clause 5, wherein the criterion-referenced level is Common
European Framework of Reference for Languages (CEFR).
7. The method of clause 1, wherein the test item parameters comprise one or more of test item difficulty, test item discrimination, or chance,
8. The method of clause 1, wherein one or more of the test item features are derived from a language embedding model.
9. A method of estimating one or more test item parameters for an item response theory model used to evaluate a person's performance on a set of test items, comprising:
10. The method of clause 9, wherein the test item features are expressed as multilingual language embeddings.
11. The method of clause 9, wherein the test item parameters comprise one or more of test item difficulty, test item discrimination, or chance.
12. The method of clause 9, wherein the test items are c-test items.
13. A system for estimating one or more test item parameters for an item response theory model used to evaluate a person's performance on a set of test items, comprising:
14. The system of clause 13, wherein the computer-executable instructions further comprise instructions that cause the one or more electronic processors to:
15. The system of clause 13, wherein the subject-matter expert's annotation of the test item is a criterion-referenced level of test-taker proficiency needed to correctly answer the test item and the common scale is both criteria-referenced and norm-referenced.
16. The system of clause 15, wherein the criterion-referenced level is Common European Framework of Reference for Languages (CEFR).
17. The system of clause 13, wherein the test item parameters comprise one or more of test item difficulty, test item discrimination, or chance,
18. The system of clause 13, wherein one or more of the test item features are derived from a language embedding model.
19. The system of clause 13, wherein the test items are used in a language proficiency test.
20. The system of clause 14, wherein each of the plurality of test items after a first test item is selected based on the test-takers proficiency estimated derived from the test-taker's graded responses to the previously provided items.
Embodiments of the disclosure may be implemented in the form of control logic using computer software in a modular or integrated manner. Based on the disclosure and teachings provided herein, a person of ordinary skill in the art will recognize other ways and/or methods to implement an embodiment using hardware, software, or a combination of hardware and software.
In some embodiments, certain of the methods, models, processes, or functions disclosed herein may be embodied in the form of a trained neural network or other form of model derived from a machine learning algorithm. The neural network or model may be implemented by the execution of a set of computer-executable instructions and/or represented as a data structure. The instructions may be stored in (or on) a non-transitory computer-readable medium and executed by a programmed processor or processing element. The set of instructions may be conveyed to a user through a transfer of instructions or an application that executes a set of instructions over a network (e.g., the Internet). The set of instructions or an application may be utilized by an end-user through access to a SaaS platform, self-hosted software, on-premise software, or a service provided through a remote platform.
In general terms, a neural network may be viewed as a system of interconnected artificial “neurons” or nodes that exchange messages between each other. The connections have numeric weights that are “tuned” during a training process, so that a properly trained network will respond correctly when presented with an image, pattern, or set of data. In this characterization, the network consists of multiple layers of feature-detecting “neurons”, where each layer has neurons that respond to different combinations of inputs from the previous layers.
Training of a network is performed using a “labeled” dataset of inputs in an assortment of representative input patterns (or datasets) that are associated with their intended output response. Training uses general-purpose methods to iteratively determine the weights for intermediate and final feature neurons. In terms of a computational model, each neuron calculates the dot product of inputs and weights, adds a bias, and applies a non-linear trigger or activation function (for example, using a sigmoid response function).
Machine learning (ML) is used to analyze data and assist in making decisions in multiple industries. To benefit from using machine learning, a machine learning algorithm is applied to a set of training data and labels to generate a “model” which represents what the application of the algorithm has “learned” from the training data. Each element (or example) in the form of one or more parameters, variables, characteristics, or “features” of the set of training data is associated with a label or annotation that defines how the element should be classified by the trained model. A machine learning model can predict or infer an outcome based on the training data and labels and be used as part of decision process. When trained, the model will operate on a new element of input data to generate the correct label or classification as an output.
Any of the software components, processes or functions described in this application may be implemented as software code to be executed by a processor using any suitable computer language such as Python, Java, JavaScript, C++, or Perl using procedural, functional, object-oriented, or other techniques. The software code may be stored as a series of instructions, or commands in (or on) a non-transitory computer-readable medium, such as a random-access memory (RAM), a read only memory (ROM), a magnetic medium such as a hard-drive, or an optical medium such as a CD-ROM. In this context, a non-transitory computer-readable medium is almost any medium suitable for the storage of data or an instruction set aside from a transitory waveform. Any such computer readable medium may reside on or within a single computational apparatus and may be present on or within different computational apparatuses within a system or network.
According to one example implementation, the term processing element or processor, as used herein, may be a central processing unit (CPU), or conceptualized as a CPU (such as a virtual machine). In this example implementation, the CPU or a device in which the CPU is incorporated may be coupled, connected, and/or in communication with one or more peripheral devices, such as display. In another example implementation, the processing element or processor may be incorporated into a mobile computing device, such as a srnartphone or tablet computer.
The non-transitory computer-readable storage medium referred to herein may include a number of physical drive units, such as a redundant array of independent disks (RAID), a flash memory, a USB flash drive, an external hard disk drive, thumb drive, pen drive, key drive, a High-Density Digital Versatile Disc (HD-DVD) optical disc drive, an internal hard disk drive, a Blu-Ray optical disc drive, or a Holographic Digital Data Storage (HDDS) optical disc drive, synchronous dynamic random access memory (SDRAM), or similar devices or other forms of memories based on similar technologies. Such computer-readable storage media allow the processing element or processor to access computer-executable process steps, application programs and the like, stored on removable and non-removable memory media, to off-load data from a device or to upload data to a device. As mentioned, with regards to the embodiments described herein, a non-transitory computer-readable medium may include almost any structure, technology or method apart from a transitory waveform or similar medium.
Certain implementations of the disclosed technology are described herein with reference to block diagrams of systems, and/or to flowcharts or flow diagrams of functions, operations, processes, or methods. It will be understood that one or more blocks of the block diagrams, or one or more stages or steps of the flowcharts or flow diagrams, and combinations of blocks in the block diagrams and stages or steps of the flowcharts or flow diagrams, respectively, may be implemented by computer-executable program instructions. Note that in some embodiments, one or more of the blocks, or stages or steps may not necessarily need to be performed in the order presented or may not necessarily need to be performed at all.
These computer-executable program instructions may be loaded onto a general-purpose computer, a special purpose computer, a processor, or other programmable data processing apparatus to produce a specific example of a machine, such that the instructions that are executed by the computer, processor, or other programmable data processing apparatus create means for implementing one or more of the functions, operations, processes, or methods described herein. These computer program instructions may also be stored in a computer-readable memory that may direct a computer or other programmable data processing apparatus to function in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means that implement one or more of the functions, operations, processes, or methods described herein.
While certain implementations of the disclosed technology have been described in connection with what is presently considered to be the most practical and various implementations, it is to be understood that the disclosed technology is not to be limited to the disclosed implementations. Instead, the disclosed implementations are intended to cover various modifications and equivalent arrangements included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.
This written description uses examples to disclose certain implementations of the disclosed technology, and to enable any person skilled in the art to practice certain implementations of the disclosed technology, including making and using any devices or systems and performing any incorporated methods. The patentable scope of certain implementations of the disclosed technology is defined in the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims if they have structural and/or functional elements that do not differ from the literal language of the claims, or if they include structural and/or functional elements with insubstantial differences from the literal language of the claims,
All references, including publications, patent applications, and patents, cited herein are hereby incorporated by reference to the same extent as if each reference were individually and specifically indicated to be incorporated by reference and/or were set forth in its entirety herein.
The use of the terms “a” and “an” and “the” and similar referents in the specification and in the following claims are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. The terms “having,” “including,” “containing” and similar referents in the specification and in the following claims are to be construed as open-ended terms (e.g., meaning “including, but not limited to,”) unless otherwise noted, Recitation of ranges of values herein are merely indented to serve as a shorthand method of referring individually to each separate value inclusively falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein. All methods described herein may be performed in any suitable order unless otherwise indicated herein or clearly contradicted by context. The use of all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illuminate embodiments of the invention and does not pose a limitation to the scope of the invention unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential to each embodiment of the present invention.
As used herein (i.e., the claims, figures, and specification), the term “or” is used inclusively to refer to items in the alternative and in combination.
Different arrangements of the components depicted in the drawings or described above, as well as components and steps not shown or described are possible. Similarly, some features and sub-combinations are useful and may be employed without reference to other features and sub-combinations. Embodiments of the invention have been described for illustrative and not restrictive purposes, and alternative embodiments will become apparent to readers of this patent. Accordingly, the present invention is not limited to the embodiments described above or depicted in the drawings, and various embodiments and modifications may be made without departing from the scope of the claims below.
This application claims the benefit of U.S. Provisional Application No. 63/246,125, filed Sep. 20, 2021, and titled “ System and Methods for Educational and Psychological Modelling and Assessment”, the disclosure of which is incorporated in its entirety by this reference.
Number | Date | Country | |
---|---|---|---|
63246125 | Sep 2021 | US |