The technology described herein relates generally to test generation and more specifically to generation of adaptive tests.
Accountability theory is a rational approach to improving the educational status of a nation. Accountability theory includes a set of goals the educational system wishes to achieve, a set of measures to assess how well those goals are met, a feedback loop for forwarding information to decision makers, such as teachers and administrators, based on those measures, and a systemic change mechanism for acting on the feedback and changing the system as necessary to achieve the goals.
Recent legislation has established a goal of achieving high levels of proficiency in a number of subject areas. Progress toward that goal is assessed every year at consecutive grade levels. Content standards define what an examinee should know, and achievement standards define how much an examinee should know. Tests (“exams”) are designed to determine how well an examinee measures up to these standards, and examinees are categorized according to their performance on the designed tests. The present inventors have observed a need for improving testing and assessment of examinees through better adaptive testing.
Systems and methods are provided for assigning an examinee to one of a plurality of scoring levels based on an adaptive exam that generates one or more questions of the exam subsequent to the start of administration of the exam to the examinee. A first exam question may be provided to the examinee and a first exam answer is received from the examinee. The first exam question may request a constructed response from the examinee. A score for the first exam answer may be generated, and a second exam question may be generated, where the difficulty of the second exam question is based on the score for the first exam answer. The examinee may be assigned to one of a plurality of scoring levels, where the examinee is excluded from assignment to one or more of the plurality of scoring levels based on the first exam answer without consideration of the second exam answer.
As another example, a computer-implemented method of assigning an examinee to one of a plurality of scoring levels based on an adaptive exam that generates one or more questions of the exam subsequent to the start of administration of the exam to the examinee may include providing a first exam question to the examinee and receiving a first exam answer from the examinee, where the first exam question requests a constructed response from the examinee and where the first exam answer is a constructed response. A score for the first exam answer may be generated, and a second exam question may be generated subsequent to receiving the first exam answer, where a difficultly of the second exam question is based on the score for the first exam answer. The second exam question may be provided to the examinee, a second exam answer may be received from the examinee, and a score may be generated for the second exam answer. The examinee may then be assigned to one of the plurality of scoring levels based on the score for the first exam answer and the score for the second exam answer.
As another example, a computer-implemented system of assigning an examinee to one of a plurality of scoring levels based on an adaptive exam that generates one or more questions of the exam subsequent to the start of administration of the exam to the examinee may include a processor and a computer-readable memory encoded with instructions for commanding the processor to execute steps of a method that includes providing a first exam question to the examinee and receiving a first exam answer from the examinee, where the first exam question requests a constructed response from the examinee and where the first exam answer is a constructed response. A score for the first exam answer may be generated, and a second exam question may be generated subsequent to receiving the first exam answer, where a difficultly of the score for the second exam question is based on the score for the first exam answer. The second exam question may be provided to the examinee, a second exam answer may be received from the examinee, and a score may be generated for the second exam answer. The examinee may then be assigned to one of the plurality of scoring levels based on the score for the first exam answer and the second exam answer.
As a further example, a computer-readable memory may be encoded with instructions for commanding a processor to execute steps of a method that includes providing a first exam question to the examinee and receiving a first exam answer from the examinee, where the first exam question requests a constructed response from the examinee and where the first exam answer is a constructed response. A score for the first exam answer may be generated, and a second exam question may be generated subsequent to receiving the first exam answer, where a difficultly of the second exam question is based on the score for the first exam answer. The second exam question may be provided to the examinee, a second exam answer may be received from the examinee, and a score may be generated for the second exam answer. The examinee may then be assigned to one of the plurality of scoring levels based on the score for the first exam answer and the score for the second exam answer.
By virtue of incorporating such procedures, methods, and concepts, the resulting test may have psychometric properties that are difficult to duplicate otherwise. That is, the test design can be optimal or near-optimal in a psychometric sense, i.e., in the sense of scores based on that design have desirable, specified attributes, such as that the conditional standard error of measurement be held at a certain value or that the assignment to levels of achievement reach a specified level of decision consistency compared to other possible designs.
Traditionally, cutscores are determined after a test has been designed. However, by failing to explicitly incorporate cutscores into the design of which items comprise the test, the opportunity to design an optimal test, given the set of items that is available to design the test, is lost. Given a database of previously calibrated items and a set of cutscores, which may be determined by any of a variety of methods, optimization theory can be applied to select the items that would yield scores with the desired optimal characteristics. The same design approach can be used when using item models in place of or in conjunction with pre-generated test items. An item model is a general procedure to generate items with specified psychometric characteristics. Traditionally, item generation during a test would be discouraged because conventional wisdom dictates that items should be pre-tested prior to administration to estimate those items' psychometric characteristics. Once an item model has been pre-tested, however, the present inventors have observed that an item model can be used to generate items that have known psychometric attributes without pre-testing each generated item.
Item models can be constructed to generate multiple-choice items or constructed response items. In the case of multiple choice items, the scoring may be accomplished using a lookup table. Adaptive tests have traditionally been limited to multiple choice or true/false questions, as responses to those types of questions can be quickly and accurately scored. According to approaches described herein, adaptive tests can also be generated to include questions requiring constructed responses. An exemplary adaptive test generator 104 may generate multiple choice test items and/or test items requesting a constructed response to be administered to an examinee. A question requesting a constructed response requires more than a single number or character response, such as a free-form response like a written or spoken phrase, sentence or paragraph, for instance. In the case of a constructed response, scoring has traditionally been done by human scorers. An example adaptive test generator 104 may perform automated scoring of constructed responses by utilizing a scoring engine in the form of a software module implementing suitable scoring approaches such as described elsewhere herein. The scoring engine may return a score in near real time because scores for each item in an adaptive test need to be known to make such adaptations possible.
The users 102 can interact with the adaptive test generator 104 through a number of ways, such as over one or more networks 108. Server(s) 106 accessible through the network(s) 108 can host the adaptive test generator 104. One or more data stores 110 can store the data to be analyzed by the adaptive test generator 104 as well as any intermediate or final data generated by the adaptive test generator 104. The one or more data stores 110 may contain many different types of data associated with the process, including pre-generated exam questions 112, item models 114, as well as other data. The adaptive test generator 104 can be an integrated web-based reporting and analysis tool that provides users flexibility and functionality for generating and administering an adaptive test. It should be understood that the adaptive test generator 104 could also be provided on a stand-alone computer for access by a user 102.
For an examinee routed to the less than proficient branch 406, an easier test 410, E, is administered during the second stage 412. The easier second stage test 410 is optimized to determine whether an examinee is basic or below basic levels. Based on the one or more questions of the easier second stage test 410, the examinee is further classified at 416. The further classification at 416 may provide a final assignment of the examinee to one of a plurality of scoring levels or bins based on the examinee's performance at the first stage 402 and the second stage 412. Alternatively, as depicted in
For an examinee routed to the proficient or above branch 408, a harder test 414, H, is administered during the second stage 412. The harder second stage test 414 is optimized to determine whether an examinee is proficient or advanced. Based on the one or more questions of the harder second stage test 414, the examinee is further classified at 424. The further classification at 424 may provide a final assignment of the examinee to one of a plurality of scoring levels or bins based on the examinee's performance at the first stage 402 and the second stage 412. Alternatively, as depicted in
Multi-stage adaptive testing, consisting or fixed or variable length blocks, as described further herein, is a specific variety of adaptive testing that selects one item or set of items at a time and administers short forms of a level of difficulty based on the student's previous performance. When the goal of an exam is to divide students into a plurality of levels, a reasonable logic in constructing an adaptive test is to take those levels into account to maximize the consistency of proficient classifications. R, the routing test, therefore needs to be designed with that in mind. Ideally, the other classifications are equally consistent but not at the expense of consistency of the proficient classification. A variable-length approach may also be utilized.
Traditionally, the cutscores that define the achievement levels are not known ahead of time and, therefore, cutscores are typically not seen as design factors. However, the present inventors have determined that the cutscores or close approximations of those cutscores can be established as operational parameters at the design stage to ensure that the assessment eventually produced is optimal for the task at hand, e.g., classifying students into achievement levels based on the assessment policy. The cutscores can be determined during the design process or by a preliminary administration, for example. Having the cutscores for the various achievement levels defined during the development process can ensure “an adequate exercise pool” because the psychometric attributes of the items to be produced are known prior to the start of the item production process, which translates to ensuring accurate and consistent classifications.
A further consideration is the item format. The assessment described in
Another consideration is the nature of the decision rule for assigning second stage tests. As noted above, the routing test may be optimized to classify students into proficient-or-above and below-proficient levels. One approach to implementing a routing decision is to estimate ability or proficiency on the statistical attributes of the items responded to at the end of the routing test and assign form H if the estimate exceeds the cutscore for proficient. Alternatively, a sumscores could be computed, which can be as effective as more complex routing rules that utilize the estimation of ability.
The adaptive test generator 502 may perform a variety of functions. For example, the adaptive test generator may provide and/or generate exam questions, as depicted at 516. The exam questions utilized by the adaptive test generator may be pre-generated exam questions 510, such as those stored in the one or more data stores 508, as well as items generated during the administration of an exam (on-the-fly).
Item generation can be a straightforward process of inserting values into variables of an item model or may be more complex. Item generation can include the production of items by algorithmic means in such a way that the psychometric attributes (e.g., difficulty and discriminating power) of the generated items are predictable rather than simply being mass produced with unknown psychometric attributes. Items that have similar psychometrics attributes are referred to as isomorphs. Another type of generated item, variants, differs predictably in some respect, such as difficulty. The distinction between isomorphs and variants is one of convenience in that it is possible to conceive of an item generation process that encompasses both cases where, for example, holding psychometrics attributes constant is a special case where “variants” are isomorphs.
Approaches for generating a large number of items algorithmically, such as described below, improves efficiency and cost effectiveness. Such algorithms should be capable of rendering items that include graphics—which are notoriously expensive to produce by conventional means. The availability of items that appear different to the examinee but have similar psychometric attributes is beneficial to test security because it is less feasible for examinees to anticipate the content of the test. This approach, in turn, makes it possible to create comparable forms and to administer effectively distinct and yet comparable forms for each individual test taker.
The dependability of student-level classifications can be reduced to the extent there is lack of isomorphicity because lack of isomorphicity becomes part of the definition of error of measurement. From generalizability theory it is known that if the objective of the assessment is to rank students, generalizability is given by
where σ2 (δ) is composed of a subset of the sources of error variability. By contrast, when the measurement goal is to make categorical or absolute decisions, such as classifying students into achievement levels, dependability is given by
where σ2 (Δ) includes all sources of error variability, including lack of isomorphicity. In that case, σ2 (Δ)>σ2 (δ) and, therefore, Eρ12≧Eρ22 to the extent there is lack of isomorphicity.
Similarly, from the item response theory (IRT) it is known that lack of isomorphicity is tantamount to the case where the same item has multiple item characteristics curves (icc's), one for each instance of an item model. Since it is not known ahead of time which instance will be presented, the expectation of the icc's is one representation of the multiple icc's that could be used as a parameterization of the item model. Expected response functions can be used for that purpose. To the extent the icc's for different instances differ in difficulty but have the same discriminating power (slope), the discriminating power of the expected response function will be less than the discriminating power of the individual instances. When estimating ability, the conditional standard error of measurement will be larger as a result because of the increased uncertainty. In short, lack of isomorphicity has a price, namely to reduce the certainty of estimates of test performance whether viewed from a generalizability or IRT perspective.
One effective mechanism for generating isomorphic items algorithmically and on-the-fly is an item model. Item models are oriented to producing instances that are isomorphic. Instances are the actual items presented to test takers. Item models can be embedded in an on-the-fly adaptive testing system so that the items are produced from item models at run time. Existing items may become the basis for construction of item models. The adaptive test generator 502 may instantiate items from an item model as needed during the adaptive item selection process.
When the goal is to produce isomorphs, a key step in the development process is verifying that the item model produces sufficiently psychometrically isomorphic instances, and that appear to be distinct items. In one example, the 3-PL item parameter estimates used typically in admission tests are obtained from experimental sections of the test devoted to pre-testing. The resulting parameter estimates are then used in the adaptive test. Those parameters may be attenuated by means of expected response functions. Fitting an expected response function to instances of an item model acknowledges the variability in the true parameters of the instances. This has the effect of attenuating or reducing the discriminating power of an expected response function as a function of the variability of the instances.
Item model development may begin by conducting a construct analysis by inspecting groups of items that measure similar skills. A set of source items is ultimately selected. Item models may be broadly or narrowly defined. Broadly-defined item models may be deliberately designed to generate instances that vary with respect to their surface characteristics, their psychometric characteristics, or both. Narrowly-defined item models are designed to generate instances that are isomorphic. Isomorphic instances vary with respect to their surface features, but they share a common mathematical structure and similar psychometric characteristics.
With suitable design of the item models it is possible to generate sufficiently exchangeable or isomorphic items. A natural extension of this idea is the form model. A form model as referred to herein is an array of item models not unlike a test blueprint. However, a form model may go beyond a test blueprint in that a set of item models, rather than more general specifications, may define the form model. Forms generated from a form model may be parallel to the extent that the item models that comprise the form model can be written to generate sufficiently isomorphic instances or items. That is, the forms produced from a form model do not have to be explicitly equated because by design the scores from different forms are comparable. Extending that reasoning to the adaptive test depicted in
One design issue for tests intended to classify students is where to “peak” the information function. That is, where to concentrate the discriminating power of the test given the goal to classify students as consistently and accurately as possible, rather than obtain a point estimate of their ability. Peaking information at the cutscore leads to more consistent classification than peaking to the mean of the population.
Application of optimization theory utilizes explication of a design space, an objective function, and a set of constraints to formulate a form model. A design space as referred to herein is the array of candidate item models and information about each item model and can be represented as a matrix. The columns of the design space are attributes of the item models that are thought to be relevant to determining the objective function and satisfying the stated constraints. There is a column for each task attribute that will be considered in the design. The optimization design problem is finding a subset of the rows of the design space, a row corresponding to an item model or an item that meets a prescribed decision consistency level. The objective function is a means for navigating the search space. (i.e., a means for navigating the rows of the design space). Many, if not most, of the possible designs are infeasible because they violate design constraints, such as exceeding some pre-specified maximum length. In principle, the objective function can be applied to each possible design based on a given design space to identify ideal candidate solutions. In practice, the space of possible solutions is too large to search explicitly. Optimization methods may be used instead to solve such problems.
where Pc refers to the target proportion and pc refers to the actual proportion in a candidate form.
Maximizing construct representation by minimizing the discrepancies of the content against the target is desirable, but restrictions may be needed to obtain an operationally feasible form model. Such restrictions, or constraints, could include desirable characteristics of the distribution of task models, the maximum testing time, and co-occurrence constraints where certain task models may appear or not appear with each other. For illustration purposes, attention is limited to the time demands and the information function at the cutscores.
To express that a form should not be longer than a class period of 50 minutes, for example, for a form consisting of J item models, the following constraint can be defined, FT:
For purposes of illustration, it can be assumed that three cutscores have been defined at values of θ=−1, 0, and 1. It can also be assumed that the information function values at those values of ability are known from suitably calibrated item parameters. The information function for the polytomous items can be based on the generalized partial credit model and is given as,
The information factors can be coded as separate constraints:
A rationale for the choice of information values is to control a level of decision consistency. An IRT approach to estimate a proportion of misclassified students between two adjacent classifications given a set of item parameter estimates and a cutscore expressed on the theta metric and the conditional standard error of measurement at the cutscore may be desirable. Given item parameter estimates, the conditional standard error of measurement at a cutscore θc is given by
where θc is the value of θ corresponding to a cutscore. That is, by specifying a design that meets information targets, the corresponding conditional standard error of measurement is specified resulting in decision consistency.
With reference back to
Automated scoring can be implemented into the adaptive test generator 502 in a variety of ways. For example, Educational Testing Service® offers its m-rater™ product that can automatically score mathematics expressions and equations, as well as some graphs. For example, if the key to an item is
m-rater can score student responses such as
or any other mathematical equivalent as correct. m-rater™ can also assess numerical equivalence. For example, if the key to an item is 3/2, responses such as 1.5, 6/4, or any other numerical equivalent will be scored as correct. Another product of Educational Testing Service®, c-rater™, can automatically score short text responses. In general, automated scoring of constructed responses can be carried out using approaches known to the those of ordinary skill in the art such as those described in U.S. Pat. No. 6,796,800, entitled “Methods for Automated Essay Analysis” and U.S. Pat. No. 7,392,187, entitled “Method and System for the Automatic Generation of Speech Features for Scoring High Entropy Speech,” the entirety of both of which is herein incorporated by reference.
An additional level of complexity is added when on-the-fly question generation is incorporated into an adaptive test that utilizes constructed responses. To score constructed responses, a scoring key for the constructed response may be generated when the question that requests the constructed response is generated. For example, for a text constructed response, a concept-based scoring rubric may be generated. Certain key concepts may provide evidence for a particular score level, when present in an examinee's response. Because there are often multiple approaches to solving a problem or providing an explanation, a concept-based scoring rubric specifies alternative sets of concepts that should be present at a particular score level. A next step may be to human-score a sample of student responses in accordance with the concept-based scoring rubric. Typically, this sample consists of 100-200 responses, and the responses are scored by two human scorers working independently. The concept-based scoring rubric and the human-scored responses may be loaded into a computer for generation of a scoring model that provides scoring that is consistent with the human-score sample.
To score mathematics based constructed responses, a first step is to define a concept-based rubric. A second step is to create simulated scored student responses. Because mathematic constructed responses are expressed in mathematical form, it may be more straightforward to predict representative student responses, and simulating them is typically sufficient for the purpose of building a model. The concept-based rubric and the scored responses are used to build the scoring model.
Following administration of a number of exam questions 504 and receipt and scoring associated exam answers 506, the adaptive test generator 502 assigns examinees to scoring level or otherwise provides examinees a score at 520. The scores assigned by the adaptive test generator 502 may be stored in the one or more data stores 508, as indicated at 514.
It should be noted that questions at a stages in an adaptive test may be provided in a variety of ways. For example, each stage could consist of a single question, where an examinee is routed based on a score generated for the examinee's response to that single question. Such a configuration could be viewed as having many stages. Alternatively, a block of questions may be provided at a single stage, and the examinee may be routed based on their scores for that block of multiple questions at the stage. The questions of such a stage could be dictated by a form model. As another example, blocks of questions for a stage could be of varying length, with the length of a block provided being determined based on an IRT ability estimate. In other words, a number of questions are provided at a stage until a degree of confidence is reached in the forthcoming classification for the next stage.
Communications between the examinee and the adaptive test generator 902 (i.e., the modality of the stimulus and response) may take a variety of forms or combination of forms in addition to the transmission of questions and receipt of answers using text or numbers as described above. For example, communications may be performed using audio and speech. A test item prompt may be provided to an examinee via recorded speech or synthesized speech. An examinee could respond vocally, and the examinee's speech could be captured and analyzed using speech recognition technology. The content of the examinee's speech could then be evaluated and scored, and a next question could be provided to the examinee based on the determined score. Communications may be performed numerically, graphically, aurally, in writing, or in a variety of other forms.
A disk controller 1160 interfaces with one or more optional disk drives to the system bus 1152. These disk drives may be external or internal floppy disk drives such as 1162, external or internal CD-ROM, CD-R, CD-RW or DVD drives such as 1164, or external or internal hard drives 1166. As indicated previously, these various disk drives and disk controllers are optional devices.
Each of the element managers, real-time data buffer, conveyors, file input processor, database index shared access memory loader, reference data buffer and data managers may include a software application stored in one or more of the disk drives connected to the disk controller 1160, the ROM 1156 and/or the RAM 1158. Preferably, the processor 1154 may access each component as required.
A display interface 1168 may permit information from the bus 1152 to be displayed on a display 1170 in audio, graphic, or alphanumeric format. Communication with external devices may optionally occur using various communication ports 1172.
In addition to the standard computer-type components, the hardware may also include data input devices, such as a keyboard 1173, or other input device 1174, such as a microphone, remote control, pointer, mouse and/or joystick.
This written description uses examples to disclose the invention, including the best mode, and also to enable a person skilled in the art to make and use the invention. The patentable scope of the invention may include other examples. For example, the systems and methods may include data signals conveyed via networks (e.g., local area network, wide area network, interne, combinations thereof, etc.), fiber optic medium, carrier waves, wireless networks, etc. for communication with one or more data processing devices. The data signals can carry any or all of the data disclosed herein that is provided to or from a device.
Additionally, the methods and systems described herein may be implemented on many different types of processing devices by program code comprising program instructions that are executable by the device processing subsystem. The software program instructions may include source code, object code, machine code, or any other stored data that is operable to cause a processing system to perform the methods and operations described herein. Other implementations may also be used, however, such as firmware or even appropriately designed hardware configured to carry out the methods and systems described herein.
The systems' and methods' data (e.g., associations, mappings, data input, data output, intermediate data results, final data results, etc.) may be stored and implemented in one or more different types of computer-implemented data stores, such as different types of storage devices and programming constructs (e.g., RAM, ROM, Flash memory, flat files, databases, programming data structures, programming variables, IF-THEN (or similar type) statement constructs, etc.). It is noted that data structures describe formats for use in organizing and storing data in databases, programs, memory, or other computer-readable media for use by a computer program.
The computer components, software modules, functions, data stores and data structures described herein may be connected directly or indirectly to each other in order to allow the flow of data needed for their operations. It is also noted that a module or processor includes but is not limited to a unit of code that performs a software operation, and can be implemented for example as a subroutine unit of code, or as a software function unit of code, or as an object (as in an object-oriented paradigm), or as an applet, or in a computer script language, or as another type of computer code. The software components and/or functionality may be located on a single computer or distributed across multiple computers depending upon the situation at hand.
It may be understood that as used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise. Finally, as used in the description herein and throughout the claims that follow, the meanings of “and” and “or” include both the conjunctive and disjunctive and may be used interchangeably unless the context expressly dictates otherwise; the phrase “exclusive or” may be used to indicate situation where only the disjunctive meaning may apply.
The disclosure has been described with reference to particular exemplary embodiments. However, it will be readily apparent to those skilled in the art that it is possible to embody the disclosure in specific forms other than those of the embodiments described above. The embodiments are merely illustrative and should not be considered restrictive. The scope of the disclosure is given by the appended claims, rather than the preceding description, and all variations and equivalents which fall within the range of the claims are intended to be embraced therein.
This application claims priority to U.S. Provisional Application No. 61/236,319, filed Aug. 24, 2009, entitled “Form Models Implemented into an Adaptive Test,” the entirety of which is herein incorporated by reference.
Number | Date | Country | |
---|---|---|---|
61236319 | Aug 2009 | US |