This disclosure is related generally to automated information retrieval and more particularly to automated retrieval and selection of media appropriate for developing test items.
Developing test items, such as those for reading and listening proficiency tests, may be labor-intensive and time-intensive. Test developers often spend hours browsing for examples online to get inspiration for new test items, as real-life examples could help developers diversify the genre or topic of test items and make the language used sound more natural. Typically, test developers query for audios and videos using conventional search engines (e.g., Google.com), review the search results, and select examples as seed materials for developing new test items. The process of reviewing search results is especially time consuming since the results are audio and/or video clips.
Test designers looking for test ideas often search online for audio/video materials. Unfortunately, they often have to spend considerable time sifting through materials that are unsuitable or inappropriate for test-item development. To minimize the time wasted, this invention describes a system, apparatus, and method of retrieving media materials for generating test items. In one example, the system may query one or more data sources based on a search criteria for retrieving media materials, and receive candidate media materials based on the query, each of which including an audio portion. The system may obtain a transcription of the audio portion of each of the candidate media materials. The system may analyze the transcription for each candidate media material to identify characteristics of the associated candidate media material. The candidate media materials may be filtered based on the identified characteristics to derive a subset of the candidate media materials. A report may then be generated for the user identifying one or more of the candidate media materials in the subset. Exemplary systems comprising a processing system and a memory for carrying out the method are also described. Exemplary non-transitory computer readable media having instructions adapted to cause a processing system to execute the method are also described.
The technology described herein relates to systems and methods for retrieving and selecting appropriate media materials (e.g., containing audio and/or video in addition to text) for developing test items, such as for language proficiency tests. In some implementations, the system may receive a keyword query from a user (e.g., a test developer) and use it to retrieve media materials that include speech audio. The retrieved materials may differ substantially in terms of audio quality (if they are audio or video files), vocabulary difficulty, syntactic complexity, distribution of technical terms and proper names, and/or other content and linguistic features that may influence the materials' usefulness to the user. Rather than returning all the retrieved materials to the user, the system may automatically filter out the materials with undesirable characteristics and only return a selected set that is more likely to be of use to the user. The information retrieval system described herein may therefore significantly reduce the amount of time spent by a test developer reviewing inadequate materials.
In an exemplary implementation where the server 120 carries out the operations, the server 120 may retrieve relevant media materials (e.g., containing audio, video, and/or text) based on the user's specification (e.g., keyword entry or selection). The materials may be retrieved from any source 130, such as the World Wide Web, a specific third-party source (e.g., YouTube.com), a repository of previously collected materials hosted remotely or locally, and/or the like. The server 120 may also retrieve training materials from a repository 140 (local or remote). The training materials may be existing test items similar to what that the test developer wishes to develop, or they may be samples selected by experts. As will be described in further detail below, the retrieved materials may undergo a variety of filtering and selection operations, some of which may utilize the training materials, to identify materials that are most likely to be useful to the user 100. The server 120 may then return the results to the user's computer 110, which may in turn display the results to the user 100. The user 100 may review and use the returned materials to develop new test items.
The system may generate a query based on the user input 200 and use it to retrieve relevant media materials (e.g., audio, video, and/or text) 210. In addition to using the user input 200, the system in some implementations may also automatically add synonyms and closely related terms as search parameters (e.g., if the user entered the keyword “film,” the system may also search for “movies”). The system may query any combination of data sources, including the World Wide Web, private networks, specific databases, etc. The retrieval may be carried out by using Application Programming Interfaces (APIs) provided by online service providers, web scraping algorithms, audio/video search engines, and/or the like. For example, in some implementations the search may be based on a comparison of the user-entered keywords to a media's title, file name, metadata, hyperlink, contextual information (e.g., the content of the webpage where the media is found), user remarks, audiovisual indexes created by the hosting service, and other indicia of content. The retrieved materials may be considered as candidates for the final set of materials presented to the user.
The retrieved media materials are then filtered based on any number and combination of characteristics associated with the materials, such as, but not limited to, audio quality, transcription quality, content relevance to the user's search criteria, appropriateness of the linguistic features used, etc. The filtering modules described in detail below provide additional examples of how some characteristics are identified and analyzed for purposes of filtering out undesirable media materials.
The audio quality of some of the retrieved candidate media materials may be unacceptably poor since the retrieval algorithm may not have taken into consideration audio quality. A material with poor audio quality may be unsuitable for use by the test developer or by the system (e.g., poor audio quality may hamper the system's ability to use speech recognition technology to transcribe the content). Therefore, in some embodiments the system may filter the retrieved materials based on audio quality 220.
Another audio feature that may be used for assessing audio quality is based on jitter measurements (i.e., irregularities/deviations in pitch periods), which is undesirable if excessive. Any conventional method for extracting jitter information from audio may be used. For example, the speech analysis software, PRAAT (developed by the University of Amsterdam), may be used to measure jitter information 330 from each of the audio/video materials 310. In some implementations, local frame-to-frame jitter may be measured, which in general is the average absolute difference between consecutive periods, divided by the average period. The jitter measurement may, in some implementations, be used as a feature in the statistical model for determining sufficiency of audio quality (e.g., at 350).
In addition to the above, any other conventionally known measures of audio or speech features may be employed. For example, the pitch contour 340 of each audio/video material 310 may be measured. In some implementations, the pitch contour may be compared to sample human pitch contours in the target language of the test items (e.g., English, Spanish, etc.). A similarity measure may be calculated based on, e.g., the root-mean-square deviations between the measured pitch contour and the sample pitch contours. The similarity measure of the pitch contours may also be used as a feature in the statistical model for assessing audio quality 350.
As another example, estimations of the signal-to-noise ratio 345 of each audio/video material 310 may be used. In situations where separate measurements of the “signal” and the “noise” for the audio/video materials are unavailable, the signal-to-noise ratio of the materials may be estimated based on assumptions about signal behavior and noise behavior. For example, the NIST STNR utility (National Institute of Standards and Technology Signal-to-Noise Ratio), developed by Columbia University, and the WADA method (Waveform Amplitude Distribution Analysis), developed by Carnegie Mellon University, may be used to estimate the signal-to-noise ratio of the audio/video materials. The estimated signal-to-noise ratio may again be used as a feature in the statistical model 350.
The audio feature measurements (e.g., 320, 330, 340, 345) for each audio/video material 310 may be input into a statistical model 350 to determine whether the material 310 should be filtered out or kept as a candidate for further analysis. In some implementations, the statistical model may be trained using training audio/video materials of known quality (e.g., as determined by human reviewers). For example, a model may be represented by a linear combination of weighted audio feature measurements (i.e., the independent variables) that predicts a value representing audio quality (i.e., the dependent variable). During training, the known quality of each training material, which may be represented numerically, would replace the dependent variable of the model, and the training material's audio feature measurements (e.g., obtained using the aforementioned audio metrics) would replace the independent variables. The goal of the training is to find weights for the independent variables that would optimize the predictability of the dependent variable. Regression analysis or any other model-training processes known to one of ordinary skill in the art may be used to determine the proper weights for the independent variables in the model. Once the model has been trained, the audio feature measurements of an audio/video material may be input into the model to obtain an audio quality score 350. Based on the score, the audio/video material may be retained as a candidate or filtered out 360. For example, if a audio quality score fails to meet a predetermined threshold, then the corresponding audio/video material may be filtered out of the group of candidate materials. The predetermined threshold may be based on empirical observations or be specified by the user.
Rather than training and using a model to analyze the audio measurements (e.g., at 350), an assessment of audio quality may be performed by directly comparing the audio measurements (e.g., 320, 330, 340, 345) to benchmark characteristics or values. Based on the comparison of the audio measurements to their respective benchmarks, the corresponding audio/video material may be retained or discarded. For example, in some implementations a material may be discarded for having any substandard audio measurement (e.g., a material may be filtered out if its estimated signal-to-noise ratio fail to meet a predetermined threshold).
Referring again to
In some implementations, an initial screening of the transcriptions may be used to filter out unsuitable materials 240.
In cases where no existing transcription is available, the accuracy of an ASR-generated transcription may be scrutinized by using any confidence measure (CM) algorithm 440, such as the normalized acoustic score and N-best based confidence score, as described in L. Chase, “Word and Acoustic Confidence Annotation for Large Vocabulary Speech Recognition” (1997) and T. J. Hazen et. al, “Recognition Confidence Scoring and Its Use in Speech Understanding Systems” (2002), both of which are expressly incorporated by reference herein. Depending on the CM, the corresponding candidate material may be filtered out or retained 445. For example, if the CM of an ASR-generated transcription fails to meet a predetermined threshold (e.g., the CM is too low), then the corresponding material may be filtered out from the candidate group.
The candidate materials may also be scrutinized for including excessive undesirable/inappropriate terms.
In addition to filtering based on audio quality and transcription quality, the content of the materials may be compared against the user-entered search criteria to identify materials with the best match. In some implementations, the system may first parse the user's search criteria (e.g., from step 200) and determine whether the user has specified a desired topic or text type 250. For example, the words in the user's search criteria may be classified by comparing them to a collection of topic labels and a collection of text-type words. Alternatively, the system's user interface may allow the user to enter keywords or make selections in separate topic and text-type forms. Based on the classification of the user's search criteria, an appropriate filter module may be invoked. For example, if the search criteria specify a topic, a topic filter module 260 may be invoked to identify audio/video materials that are sufficiently similar to the user-specified topic.
Referring again to
The training materials in the target group and the garbage group may be used to train a classification model for classifying a given material's transcription into either of the groups 630. In some implementations, the classification model may use TF-IDF (Term Frequency-Inverse Document Frequency) values of words in a transcription as features for predicting whether the transcription belongs in the target or garbage group (TF-IDF is a numerical statistic that is intended to reflect how important a word is to the document). In other words, the classification model's independent variables may correspond to the TF-IDF values and the dependent variable may correspond to an indication of whether a transcription belongs in the target group or garbage group. Once the model has been trained, it can be applied to the collection of candidate materials to identify those that match the user-specified text type (i.e., those that fall into the target group) 640. The ones matching the user-specified text type may remain a candidate, and the ones that do not (i.e., those that fall into the garbage group) may be discarded 650.
In cases where the user's search criteria only includes a topic but not a text type, it may be desirable to return a collection of topic-relevant audio/video materials categorized by text type. For example, if the user is interested in materials relating to finance, he may be presented with categories of financial materials that are from lectures, presentations, news, etc. This may be implemented using a classification method similar to the one described above, but instead of training the classification model based on two categories (i.e., a target group and a garbage group), the training would be based on the training materials' text-type labels (e.g., lecture, conference article, journal, etc.). Thus, when the classification model is applied to a audio/video material, it would output a prediction of which text type the material would likely fall under.
The text-type filter module 600 may also use clustering algorithms to determine whether a material's text type matches the user-specified text type. For example, k-mean clustering (e.g., as implemented by Apache Mahout) and/or Expectation-Maximization algorithms may be used to automatically cluster the remaining candidate audio/video materials into groups. As known by persons of ordinary skill in the art, k-mean clustering algorithm iteratively cluster data around k closest cluster centers. In general, the algorithm is given a number k and a set of data (e.g., text documents) represented by numeric features in n dimensional space 660. Where the data is text, the numeric features may be TF-IDF vector values, as previously mentioned. Typically, the algorithm begins by randomly selecting k cluster centers in the n dimensional space and then clustering the given data around those k cluster centers (e.g., based on the calculated distances between the data points to the centers). However, since the goal of the text-type filter module 600 is to find materials of a specific user-specified text type, in some implementations the initial k cluster centers may be explicitly set, rather than randomly selected. For example, each of the initial k cluster centers may correspond to a known text type 670 (e.g., one cluster center may be derived from a collection of lectures, another cluster center may be derived from a collection of presentations, etc.). Having provided initial cluster centers that correspond to text types, the algorithm may then cluster the transcriptions of the audio/video materials around those cluster centers 680. The clustering algorithm may then recalculate each cluster's center based on the data clustered around it 685, and again cluster the data around the new centers 680. This process may iterate for a specified number of times or until the cluster centers stabilize 690. In some implementations, the audio/video materials represented by the final cluster associated with the user-specified text type would be retained 695.
In other implementations, the aforementioned k cluster centers may be randomly selected, and the transcriptions would be placed into k clusters according to the k-mean algorithm. After the k clusters of transcriptions have been determined, any cluster labeling algorithm may be used to pick descriptive labels for each of the clusters. In one example, cluster labeling may be based on external knowledge such as pre-categorized documents (e.g., human-assigned labels to existing test items or training documents). The process in some implementations may start by extracting linguistic features from the transcriptions in each cluster. The features may then be used to retrieve and rank n-nearest pre-categorized documents (e.g., pre-categorized documents with similar linguistic features). One of the n-nearest pre-categorized documents may be selected (e.g., the one with the best rank), and the pre-determined words (e.g., the category titles) used to describe that document may be used as the cluster label for the corresponding cluster of transcriptions. Each cluster of transcriptions may be labeled in this manner. Thereafter, the cluster labels may be compared to the user-specified text type, using any conventional semantics similarity algorithm, to identify the best-matching cluster. The final materials presented to the user may be selected from the best-matching cluster.
Referring again to
In another implementation, candidate audio/video materials may be filtered based on the formality level of the speech therein. For example, some materials may use speech that is overly formal (e.g., in news reporting or business presentations) or overly informal (e.g., conversations at a playground or bar) for purposes of test item generation. In one implementation, a model for predicting formality level may be trained, similar to the process described above with respect to complexity levels. For example, a collection of training materials with predetermined formality levels (e.g., as labeled by human) may be retrieved, and various linguistic features of the training materials may be extracted. A model (e.g., represented by a linear combination of variables) may then be trained using the extracted linguistic features as values for the independent variables and the predetermined formality levels as values for the dependent variable. In some implementations, linear regression may be used to determine the optimal weights/coefficients for the independent variables. The set of optimal weights/coefficients may then be incorporated into the model for predicting formality levels. The model may be applied to the transcriptions (specifically, the linguistic features of the transcriptions) of the candidate audio/video materials to predict the formality level of the speech contained therein. The candidate audio/video materials may then be filtered based on the formality levels and a predetermined selection criteria (e.g., formality levels above and/or below a certain threshold may be filtered out).
In some instances it may also be desirable to filter out audio/video materials based on the level of inclusion of inappropriate words, such as offensive words or words indicating that the topic relates to religion or politics. In some implementations, a list of predetermined inappropriate words may be retrieved. Each transcript may then be analyzed to calculate the frequency in which the inappropriate words appear. Based on the frequency of inappropriate-word occurrences (e.g., as compared to a predetermined threshold), the corresponding candidate audio/video material may be filtered out.
Referring again to
As one of ordinary skill in the art would recognize, the filters described herein may be applied in any sequence and are not limited to any of the exemplary embodiments. For example, the linguistic filter module may be applied first, followed by the transcription-quality filter, followed by the audio-quality filter, and followed by the text-type filter and topic filter. In addition, one or more of the filters may be processed concurrently using parallel processing. For example, each of the filters may be processed on a separate computer/server and the end results (e.g., similarity scores, model outputs, filter recommendations, etc.) may collectively be analyzed (e.g., using a model) to determine whether a media material ought to be filtered out. Furthermore, the retrieval system may utilize a subset or all of the filters described herein.
Additional examples will now be described with regard to additional exemplary aspects of implementation of the approaches described herein.
A disk controller 860 interfaces one or more optional disk drives to the system bus 852. These disk drives may be external or internal floppy disk drives such as 862, external or internal CD-ROM, CD-R, CD-RW or DVD drives such as 864, or external or internal hard drives 866. As indicated previously, these various disk drives and disk controllers are optional devices.
Each of the element managers, real-time data buffer, conveyors, file input processor, database index shared access memory loader, reference data buffer and data managers may include a software application stored in one or more of the disk drives connected to the disk controller 860, the ROM 856 and/or the RAM 858. Preferably, the processor 854 may access each component as required.
A display interface 868 may permit information from the bus 852 to be displayed on a display 870 in audio, graphic, or alphanumeric format. Communication with external devices may optionally occur using various communication ports 873.
In addition to the standard computer-type components, the hardware may also include data input devices, such as a keyboard 872, or other input device 874, such as a microphone, remote control, pointer, mouse and/or joystick.
Additionally, the methods and systems described herein may be implemented on many different types of processing devices by program code comprising program instructions that are executable by the device processing subsystem. The software program instructions may include source code, object code, machine code, or any other stored data that is operable to cause a processing system to perform the methods and operations described herein and may be provided in any suitable language such as C, C++, JAVA, for example, or any other suitable programming language. Other implementations may also be used, however, such as firmware or even appropriately designed hardware configured to carry out the methods and systems described herein.
The systems' and methods' data (e.g., associations, mappings, data input, data output, intermediate data results, final data results, etc.) may be stored and implemented in one or more different types of computer-implemented data stores, such as different types of storage devices and programming constructs (e.g., RAM, ROM, Flash memory, flat files, databases, programming data structures, programming variables, IF-THEN (or similar type) statement constructs, etc.). It is noted that data structures describe formats for use in organizing and storing data in databases, programs, memory, or other computer-readable media for use by a computer program.
The computer components, software modules, functions, data stores and data structures described herein may be connected directly or indirectly to each other in order to allow the flow of data needed for their operations. It is also noted that a module or processor includes but is not limited to a unit of code that performs a software operation, and can be implemented for example as a subroutine unit of code, or as a software function unit of code, or as an object (as in an object-oriented paradigm), or as an applet, or in a computer script language, or as another type of computer code. The software components and/or functionality may be located on a single computer or distributed across multiple computers depending upon the situation at hand.
It should be understood that as used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise. Further, as used in the description herein and throughout the claims that follow, the meaning of “each” does not require “each and every” unless the context clearly dictates otherwise. Finally, as used in the description herein and throughout the claims that follow, the meanings of “and” and “or” include both the conjunctive and disjunctive and may be used interchangeably unless the context expressly dictates otherwise; the phrase “exclusive or” may be used to indicate situation where only the disjunctive meaning may apply.
The present application claims from the benefit of U.S. Provisional Application Ser. No. 61/897,360, entitled “Automated Authentic Listening Passage Selection System for the Language Proficiency Test,” filed Oct. 30, 2013, the entirety of which is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
61897360 | Oct 2013 | US |