The present disclosure generally relates to a musical performance evaluation system, a performance evaluation method, and a non-transitory computer-readable medium storing musical performance evaluation program. More specifically, the present disclosure relates to providing feedback on the presence of errors in a musical performance.
A commonly held notion in automatic music performance analysis (MPA) research is that deviations of music performances from their underlying music score can be regarded as performance errors. However, some music pedagogy research suggests that some of deviations of music performances are more apparent to a listener than others. For example, a chord that is voiced differently from that written in the score might be overlooked, but missing a note in a characteristic motif, or playing a note that clashes with the underlying harmony would stand out. Thus, some of the errors in music performances are more noticeable than other errors in music performances.
In recent years, the music industry has developed various computer aided devices intended to teach and/or improve a student's ability to play an instrument such as the piano. For example, various instrument teaching software and apps have been developed to teach and/or improve a student's ability to play a musical instrument. These teaching software and apps do not distinguish between the different types of errors in music performances. Rather, all errors in music performances are treated in the same manner.
The music education software which provides analysis solely founded on rigid note level rhythmic and pitch correctness has been challenged on the basis that users might end up too focused on playing too correctly (almost robotically) to attain the highest scores. It has been discovered that there are many considerations for designing useful music education software that automatic assess a musical performance. For example, it has been discovered that beginning and intermediate students need feedback on their performances differently from advanced students. The present disclosure is basically directed to a musical performance evaluation system which provides feedback to beginning and intermediate students on their performances. In particular, in musical performances, some mistakes or errors may stand out to listeners, whereas other mistakes may go unnoticed. How noticeable the errors depends on factors including the contextual appropriateness of the errors and a listener's degree of familiarity with the musical performance that is being performed. A conspicuous error or mistake is considered to be an error or mistake where there is something obviously wrong with the performance to a listener regardless of the listener's degree of knowledge of the musical performance that is being performed. More specifically, a conspicuous error is considered to be a performance error that can be detected by the majority of listeners with a formal music training, regardless of their degree of knowledge about the underlying music score of a performed piece. Of course, conspicuous errors are dependent on the listener's knowledge of the piece and the proficiency of the performer. Furthermore, conspicuous error and expression are two sides of the same coin. For example, hitting an adjacent key can either come across as an expressive ornament or a conspicuous error. This suggests that conspicuous error detection should inherently be conditioned on the style, the level of the listener, and the player's proficiency.
One aspect of the present disclosure is to infer a time sequence of binary labels for evaluating a musical performance by indicating the presence of conspicuous errors at a given time for a given a sequence of music.
Another aspect of the present disclosure is to provide a musical performance evaluation system having a score independent conspicuous error detector to aid beginner to intermediate students by evaluating their musical performances.
In accordance with one aspect of the present disclosure, a musical performance evaluation system is provided that basically comprises an audio input, a notification device and at least one processor. The audio input is configured to input a musical performance. The notification device is configured to output an evaluation of the musical performance. The at least one processor is operatively coupled with the computer-readable storage medium and the notification device. The at least one processor is configured to execute a musical performance evaluation program to identify errors in the musical performance based on the reference data of segments of musical performances containing errors, classify the errors in the musical performance as either a conspicuous error or an inconspicuous error, and instruct the notification device to output the evaluation of the musical performance identifying a presence of conspicuous errors differently from inconspicuous errors.
In accordance with another aspect of the present disclosure, a computer-implemented musical performance evaluation method is provided to an evaluation of a musical performance. The computer-implemented musical performance evaluation method comprises acquiring a musical performance played by a user; identifying errors in the musical performance based on reference data of segments of musical performances containing errors using at least one processor; classifying the errors in the musical performance as either a conspicuous error or an inconspicuous error using the at least one processor; and instructing a notification device to output an evaluation of the musical performance by identifying a presence of conspicuous errors differently from inconspicuous errors.
In accordance with another aspect of the present disclosure, a non-transitory computer-readable medium is provided that stores a musical performance evaluation program, which when executed by a computing device causes the computing device to perform operations comprising: acquiring a musical performance of an instrument played by a user; identifying errors in the musical performance based on reference data of segments of musical performances containing errors using at least one processor of the computing device; classifying the errors in the musical performance as either a conspicuous error or an inconspicuous error using the at least one processor of the computing device; and instructing a notification device to output an evaluation of the musical performance by identifying a presence of conspicuous errors differently from inconspicuous errors.
Also, other objects, features, aspects and advantages of the disclosed musical performance evaluation system will become apparent to those skilled in the art from the following detailed description, which, taken in conjunction with the annexed drawings, discloses preferred embodiments of the musical performance evaluation system.
Referring now to the attached drawings which form a part of this original disclosure.
Selected embodiments will now be explained with reference to the drawings. It will be apparent to those skilled in the musical field from this disclosure that the following descriptions of the embodiments are provided for illustration only and not for the purpose of limiting the invention as defined by the appended claims and their equivalents.
Referring initially to
Basically, as explained below in more detail, the musical performance evaluation system 10 illustrated in
Also, in exemplary embodiment of
Here, in exemplary embodiment of
Here, in exemplary embodiment of
Referring to
The term “processor” as used herein refers to hardware that executes a software program, and does not include a human being. The term “computer-readable storage medium” as used herein refers to any non-transitory computer storage device and does not include a transitory propagating signal. For example, the computer-readable storage device 32 includes a storage medium such as a hard disk, an optical disk, a magnetic disk, or a memory card, and stores a learning model as described below.
Preferably, the computer-readable storage device 32 includes RAM (Random Access Memory) and ROM (Read Only Memory). The RAM of the computer-readable storage device 32 is a volatile memory that is used as work area for the processor 30 and temporarily storing various data. The ROM is a non-volatile memory that stores various support programs.
The processor 30 and the computer-readable storage device 32 are formed of one or more semiconductor chips that are mounted on a circuit board. The processor 30 is electrically connected to the computer-readable storage device 32. The processor 30 is also operatively coupled with the display 16 (i.e., the notification device), the keyboard 18, the mouse 20, the external speaker 22 (i.e., the notification device), the communication cable 24 (i.e., the audio input) and the microphone 26 (i.e., the audio input).
The processor 30 implements various functions by executing various functions stored in the computer-readable storage device 32. In the illustrated example of
The digital audio workstation 34 is a software program stored in the computer-readable storage device 32. Alternatively, the digital audio workstation 34 can be stored in the computer-readable storage device of the remote server 28 (e.g., a network server or a cloud server). The digital audio workstation 34 is typically configured to compose, produce, record, mix and edit audio and MIDI data. The digital audio workstation 34 can be any commercially available software or a proprietary software so long as the software can at least record a musical performance played on a musical instrument that is to be evaluated. Since digital audio workstations are well known, the digital audio workstation 34 will not be discussed in further detail herein. In any case, the digital audio workstation 34 is configured to record the musical performance and provide MIDI note events of the musical performance as an input to the at least one processor 30. Preferably, the MIDI note events of the musical performance includes a start time, an end time, a pitch and a velocity for each of the note events of the musical performance. Here, the digital audio workstation 34 creates at least one piano roll of the musical performance as the input to the at least one processor 30. Preferably as described below, the at least one piano roll of the musical performance includes a first piano roll providing data on note onsets of the musical performance and a second piano roll providing data on sustained portions due to key depression of the musical performance.
Referring now to
If the musical performance evaluation system 10 is connected to a network such as the Internet, the learning model can be stored in the server 28 (e.g., a network server or a cloud server) instead of the computer-readable storage device 32. As explained below in more detail, the musical performance evaluation program 36 is preferably a score independent evaluation program. However, the musical performance evaluation program 36 is not limited to a score independent evaluation program as mentioned below.
Basically, in the musical performance evaluation system 10, the musical performance is evaluated by comparing the musical performance against reference data that is stored in the computer-readable storage device 32 or the remote server 28 (e.g., a network server or a cloud server). The reference data is collected to identify errors or mistakes that were made by performers playing musical performances. In the main embodiment, the reference data does not include complete musical scores of musical performances that are error free (i.e., sheet music of musical performances). Rather, in the main embodiment, the reference data comprises a plurality of inaccurate musical performances having errors wherein the errors in the inaccurate musical performances have been identified. For example, in the case where the musical instrument is a piano, the reference data can be obtained by acquiring MIDI data of a plurality of actual piano performances performed by pianists of a variety of skill levels, and/or by creating synthetic data of a plurality of piano performances with procedurally generated mistakes. In either case, the errors or mistakes in the musical performances of the MIDI data and/or synthetic data are identified and classified. In particular, the reference data was annotated at the regions considered to contain one or more conspicuous errors. Thus, the term “reference data” as used herein refers to a set of data representing a plurality of segments of musical performances that includes annotation at the regions considered to contain one or more conspicuous errors.
Basically, in the musical performance evaluation system 10, the computer-readable storage medium 32 has reference data of segments of musical performances containing errors. Alternatively, the reference data of segments of musical performances containing errors can be stored on the remote server 28 (e.g., a network server or a cloud server). The notification device (e.g., the display 16 and/or the speaker 22) outputs an evaluation of the musical performance received via the audio input (e.g., the digital connection 24 and/or the microphone 26). The at least one processor 30 executes the musical performance evaluation program 36 to identify errors in the musical performance based the reference data stored in the computer-readable storage medium 32, classify the errors in the musical performance as either a conspicuous error or an inconspicuous error, and instruct the notification device (e.g., the display 16 and/or the speaker 22) to output the evaluation of the musical performance identifying a presence of conspicuous errors differently from inconspicuous errors. Preferably, the at least one processor 30 is configured to instruct the notification device (e.g., the display 16 and/or the speaker 22) to output the evaluation of the musical performance by only identifying the presence of the conspicuous errors.
Now, some examples of reference data that can be used in evaluating a musical performance using the musical performance evaluation system 10 will be described. As mentioned above, the reference data can include MIDI data of a plurality of musical performances and/or synthetic data of a plurality of musical performances with procedurally generated mistakes. In the case of acquiring MIDI data of actual piano performances, for example, three qualitatively different piano playing MIDI data can be acquired to identify errors or mistakes that were made by performers in playing musical performances on a piano. A first set of piano playing MIDI data can include a predetermined number of sight-reading performances by beginning and intermediate adult pianists with formal music training. The first set of piano playing MIDI data is referred to herein as sight-reading data (SR). The performances of the sight-reading data can be comprised of mostly piano reductions of popular classical pieces arranged for beginner to intermediate. A second set of piano playing MIDI data can include a predetermined number of performances by late beginner to early advanced pianists. The second set of piano playing MIDI data is referred to herein as performance data (PF). The performances of the performance data are approximately 3 minutes each, and collected from a digital piano recording app. Not all performed pieces in the performance data are known, but most of them are pop and classical, that are either read from a score, or semi-improvised. While user attributes are unknown in the performance data, the performance data suggests that the skill levels range between late beginner and early advanced. A third set of piano playing MIDI data can include a predetermined number of performances from Bergmüller's 25 Etudes by advanced pianists. The third set of piano playing MIDI data is referred to herein as Bergmüller data (BM). The performances from Bergmüller's 25 Etudes can be recorded twice on a digital piano. The performances from Bergmüller's 25 Etudes are played by an advanced pianist who had previously played the etudes. In the Bergmüller data, the pianist practiced each etude briefly before recording two takes.
For example, a test was performed a piano playing MIDI data including: (1) 103 sight-reading sessions for beginning and intermediate adult pianists with formal music training; (2) 245 performances by late beginner to early advanced pianists on a digital piano, and (3) 50 etude performances by an advanced pianist. The data of this test was annotated at the regions considered to contain conspicuous errors. Then, in this test, the temporal convolutional network 40 was used to detect the sites of such conspicuous errors from a piano roll of a musical performance. Then, the output from the temporal convolutional network 40 for each detected error was then processed using the classifier head 42 to determine the probability of each of the detected errors being a conspicuous error. Finally, a piano roll of the musical performance was displayed identifying only the conspicuous errors in the musical performance.
The total time for the sight-reading data is a first predetermined amount of time. The total time for the performance data is a second predetermined amount of time. The total time for the Bergmüller data is a third predetermined amount of time. The first predetermined amount of time, the second predetermined amount of time and the third predetermined amount of time can be set as needed and/or desired. Non-overlapping splits of the sight-reading data and the performance data are used for training, validation, and testing, whereas the Bergmüller data is kept exclusively for testing. In each set of data conspicuous errors were annotated as conspicuous errors. One example of an annotation procedure will be described herein. Alternative, annotation procedures can be used as needed and/or desired.
In the illustrated example, preferably at least two annotators are used. The first annotator is a person, who is an experienced classical piano teacher. The second annotator is a person, who is training in music production and is also an intermediate level pianist. The first annotator labeled the sight-reading data and the Bergmüller data, while the second annotator labeled the performance data. In each case, the first annotator and the second annotator are to indicate (yes/no) whether they know the piece being performed. For the sight-reading data and the performance data, the first annotator and the second annotator are given instructions to annotate obvious performance mistakes (referred to herein as conspicuous errors) that can be recognized even without checking the score, and it was left up to the first annotator and the second annotator to decide what is a conspicuous error. The annotation of the MIDI data can be done with a music production software such as Cubase. The first annotator and the second annotator are to add an annotation at MIDI note 0 covering a span of a time window which they judge as pertaining to an error.
The Bergmüller data was treated differently because it has been played off of known musical score data. First, the performances were automatically annotated with sites of score deviations using a score alignment system. Then, the first annotator manually reviewed the labels by listening to the performance while looking at the corresponding sheet music, and added missing deviations from the score or removed those which do not reflect errors. The first annotator simultaneously manually labeled each error as conspicuous or not. Although the ratio of annotated regions to total performance time may be small in the Bergmüller data, its annotation approach allows investigation of the relationship between the set of errors obtained by comparing with a score (presumably all errors) to conspicuous errors.
In this methodology, some types of errors may be labeled more consistently than others. The more common errors include insertions and deletions of notes that do not fit in musical context, abrupt pauses, and unstable rhythm coming from hesitations during playing. The common errors will be annotated with reasonable consistency in terms of label location and span when the common errors are relatively short local after which the player recovers into their playing flow. However, more compound deviations were labelled ambiguously. For example, sometimes after an error a player would “sneak-in” some practice before resuming the flow of the piece. In such examples, if the short phrase being practiced sounds out of context, but in itself is coherent, an open question is where the label should be, and whether it should be one continuous label or an intermittent one. Moreover, the presence of unannotated conspicuous mistakes in the data, but there is an inherent ambiguity in how one would assess a “bad but acceptable” and “erroneous performance”. If a region contrasts with the annotator's expectation of the music given how that performer is playing, then it will be annotated. This opens the possibility that the annotators have calibrated what should count as a mistake based on individual performance. Silence regions are one of the main sources of ambiguity, since silences between correct portions are unannotated regardless of their length, but silences within or surrounding mistake portions often receive a mistake label.
Since annotating actual musical performances is time consuming and difficult to obtain, two pre-training methods to overcome data scarcity are proposed. The first pre-training method trains a part of the model as an autoencoder, while the second pre-training method uses synthetic data with procedurally generated errors. Experimental evaluation shows that the TCN performs at an F-measure of 78% without pretraining for sight-reading data. However, the proposed pretraining methods improve the F-measure on the performance data and the Bergmüller data to the extent of approaching that of conspicuous error labels by a human annotator.
First, the pre-training method of using an autoencoder 44 will be described. The autoencoder 44 is used to train the feature extractor in an unsupervised manner by using a collection of piano performances. The autoencoder 44 comprises an encoder 46 and a decoder 48. Specifically, in the illustrated example, the feature extraction of the temporal convolutional network 40 is used as the encoder 46, and a temporal convolutional network with transposed 1d convolutions instead of a 1d convolution as the decoder 48. MIDI data of musical performances of unknown performance qualities are input into the encoder 46. The encoder 46 then compresses the input (e.g., piano roll MIDI data) and the decoder 48 attempts to recreate the input (e.g., piano roll MIDI data) from the compressed version provided by the encoder 46. After training, the encoder model is saved and the decoder 48 is discarded. The encoder 46 can then be used as a data preparation technique to perform feature extraction on raw data that can be used to train the machine learning model of the temporal convolutional network 40. The encoder 46 learns how to interpret the input and compress it to an internal representation defined by the bottleneck layer. The decoder 48 takes the output of the encoder 46 (the bottleneck layer) and attempts to recreate the input. Once the autoencoder 44 is trained, the decoder 48 is discarded and we only keep the encoder 46 and use it to compress examples of input to vectors output by the bottleneck layer. In this way, a space of ϕ is pre-trained so as to model the space of piano performances within a given receptive field of the temporal convolutional network 40. This method could be useful if a large data set of performances of unknown performance qualities are obtainable.
Now, the pre-training method of using synthetic data with procedurally generated errors will be described. Here, the machine learning model is pre-trained on performance data onto which errors are simulated and corresponding error labels are inserted to match the expected format of data that would otherwise be obtained using human annotators listening to musical performances and annotating the musical performances. Specifically, systematic adjustments are applied to a set of mistake-free performances and modify the note events in a manner inspired by performance mistakes made by beginning adult pianists. For example, the mistake-free performances can be commercial MIDI piano data containing mostly jazz and classical piano MIDI performances. For each note event with probability pc, we modify the note in one of the following ways:
In this example, the probabilities are set as follows: pc=5%, po=10%, pi=39%, pr=39%, ps=2%, and pp=10%. Furthermore, for note replacement and insertion, n is chosen so that n=1, 2 are chosen with probabilities of 22% and n=4, 6 by 2%. This method is useful if many performances that are known to be relatively error-free are obtainable. Furthermore, this method can be used for data augmentation, since not all synthetic errors sound conspicuous.
As mentioned above and as seen in
As mentioned above and as seen in
In this example of
Also, in this example of
Each of these training methods were evaluated to create trained machine leaning models. The trained machine leaning models were then validated on the sight-reading data and the performance data, and evaluated on a test split of the sight-reading data and the performance data, and the entire Bergmüller data. As the metric, in these tests, the transcription precision/recall/F1-measure using mir_eval were evaluated, treating the estimated and the ground-truth annotations as note events occurring at a predefine pitch. When computing the transcription metrics, the note onset and offset tolerances were set to 2 seconds. Furthermore, based on the validation set, the ends of the estimated segments have been padded by 0.2 seconds and overlapping segments have been merged. The test results of the trained machine leaning models are set forth in Tables 1, 2 and 3.
For the performance dataset and the Bergmüller dataset, the augmentation strategies offer some improvements. The two training methods proposed, i.e., the use of synthetic data and the autoencoder 44, also result in improvements. In general, both training methods tend to improve the recall rate, suggesting that they provide similar qualitative improvements, and either one can be used depending on the data available.
As seen in Tables 2 and 3, despite the augmentation strategies, the accuracy scores of the F-measures for the performance data and the Bergmüller data were relatively low, even taking into account the ambiguities of conspicuous errors. In particular, the accuracy scores of the F-measures for the performance data and the Bergmüller data were relatively low as compared to the accuracy scores of the F-measures for the sight-reading data, even taking into account the ambiguities of conspicuous errors. As seen in Tables 2 and 3, the performance data and the Bergmüller data are difficult to infer, as seen by the differences in the F-measure between the sight-reading dataset and the two.
As another example, the validation of the F-measure of the trained machine leaning models on the synthetic dataset is about 60%. This suggests that the model is moderately capable of pinpointing the ground-truth labels if they are easy to classify, or generated stochastically but systematically. At the same time, however, as the best performing F-measure of 38% on the performance dataset, the trained machine leaning model falls short of the target accuracy score of 43% for the F-measure.
The trained machine leaning model for the sight-reading data performs the best. This is likely due to most of the mistakes being quite conspicuous in a sight-reading situation, especially compared to the performance data and the Bergmüller data, both of which contain mostly beginner-intermediate performances with occasional mistakes. The performance of the trained machine leaning model tends to drop as more pretraining steps are added. This is likely to occur because the pretraining data mostly contain data of the same type as the performance dataset, increasing the disparity between the training data and the test data. In sight-reading situations, the results suggest it is sufficient simply to train on a data set that solely contains data from the same set, instead of pretraining or augmenting the dataset with typical amateur performances containing some conspicuous errors.
The above methods used in the leaning model tend to capture repetition, pauses, hesitations, and note insertions that occur in narrow pitch intervals as mistakes. At the same time, however, the very same properties arising from musical expression or composition are detected as false positives, such as repeated motifs, ornaments, and grand pauses. Even though such musical aspects are superficially performed similarly to the aforementioned mistakes, humans are capable of differentiating between genuine performance mistakes and those within musical contexts. Thus, the learning model can be improved by modeling the underlying composition better. Thus to define manifestations of conspicuous errors, a midpoint is preferably found between a rule based approach and one learned from empirical labels. The outcome is preferably a set of error descriptions, some of which happen at particular time instants and some over longer windows, whether continuous windows or a longer span of intermittent labels. However, since the conspicuousness of errors is inspired by a perceptual idea, these errors are preferably defined through an empirical process. Also, synthetic data is helpful for improving performance. However, in certain cases, some synthesized mistakes may sound unnatural. For example, in the case of induced pitch insertions, it may seem to be difficult for someone to be able to perform with such confidence and tempo despite the extent of out-of-context pitch insertions. It has been observed that beginners make errors and employ recovery strategies that are more complex than synthetic data that merely create simple errors such as repetition, pauses, hesitations, and note insertions. Thus, it is desirable to study beginning pianists' behavior to create synthetic data having more natural sounding errors.
Referring to
In particular, here, the digital audio workstation 34 outputs data representing at least one piano roll from the musical performance played on the digital piano 12. Preferably, the digital audio workstation 34 outputs data representing a first piano roll 50 and a second piano roll 52 to the temporal convolutional network 40. In other words, in the illustrated embodiment, two piano rolls are extracted for a given sequence of piano note events in the musical performance. The first piano roll 50 is one for the note onset, while the second piano roll 52 is for the sustained portion according to the key depression. Specifically, suppose a set of/MIDI note events (start time, end time, pitch, velocity) given as {(si, ei, pi, vi)}iI, and a sampling rate of R are given. Then, a 256-dimensional piano roll X ∈R256×T is computed, such that X (pi, round (Rsi))=vi, and X (128+pi, round (Rs))=vi for s ∈[si, ci]. Then for example, a Python package such as Partitura is used for the computation, and R is set to 16 Hz. Partitura can load musical scores (in MEI, MusicXML, Kern, and MIDI formats), MIDI performances, and score-to-performance alignments. The package includes some tools for music analysis, such as automatic pitch spelling, key signature identification, and voice separation. Also, the sustain pedal information is ignored in the computation of the second piano roll 52. It is preferable to ignore the sustain pedal information to prevent the second piano roll 52 of the sustained portion from smearing since a beginning pianist has a tendency to keep the pedal depressed which causes an excessive elongation of the computed note durations.
Referring to
Basically, evaluation process of the flowchart in
Still referring to
Then in step S2 of the evaluation process, the processor 30 creates musical performance data that is indicative of the notes played by the instrument. Here, the processor 30 computes the first piano roll 50 and the second piano roll 52, which were mentioned above. Of course, the processor 30 creates the musical performance data in other formats as needed and/or desired. Here, the processor 30 computes the first piano roll 50 and the second piano roll 52 using the digital audio workstation 34.
Next, in step S3 of the evaluation process, the processor 30 identifies any errors in the musical performance using the temporal convolutional network 40. Basically, the temporal convolutional network 40 compares the musical performance data (e.g., the first piano roll 50 and the second piano roll 52) against the reference data stored in the computer-readable storage device 32 or the remote server 28 (e.g., a network server or a cloud server). In this way, the errors in the musical performance are identified.
Next, in step S4 of the evaluation process, the processor 30 determines if there are any conspicuous errors in the musical performance using the classifier head 42. More specifically, the errors in the musical performance that are identified by the temporal convolutional network 40 are inputted to the classifier head 42. The classifier head 42 then determines a conspicuous error probability for each error using the reference data. Then, the conspicuous error probability for each error is compared to a conspicuous error threshold value for classifying for each error as either a conspicuous or an inconspicuous. In other words, if the conspicuous error probability is equal to or above the conspicuous error threshold value, then the error is determined to be a conspicuous error. On the other hand, if the conspicuous error probability is below the conspicuous error threshold value, then the error is determined to be an inconspicuous error. The conspicuous error threshold value can be a single value for all types of errors or can be a different value for different types of error. For example, in the case where the classifier head 42 determines the conspicuous error probability is 50% for a particular error and the conspicuous error threshold value is set to 39% for that type of error, then the error is determined to be a conspicuous error.
Next, in step S5 of the evaluation process, the processor 30 outputs a notification of the conspicuous error, if any, in the musical performance. The notification for each segment of the musical performance can be output in real time (or nearly real time) or can be stored for output at a later time. The notification from the processor 30 can be conveyed to the performer using a notification device such as the display 16 and the external speaker 22.
After step S5 of the evaluation process, the processor 30 returns to step S1 to repeat the evaluation process on the next segment of the musical performance. The musical performance is stopped one the performer or another person stop the recording of the musical performance.
Once the first piano roll 50 and the second piano roll 52 for a given sequence of piano note events in the musical performance are inputted in the temporal convolutional network 40, the presence of errors are detected in the musical performance by comparing the first piano roll 50 and the second piano roll 52 against the reference data that has been pretrained and implemented into the temporal convolutional network 40 and the classifier head 42.
Referring to
Referring to
Referring to
Referring to
Referring to
Referring now to
One example of the evaluation of the musical performance is illustrated in
Referring now to
Referring now to
In understanding the scope of the present invention, the term “comprising” and its derivatives, as used herein, are intended to be open ended terms that specify the presence of the stated features, elements, components, groups, integers, and/or steps, but do not exclude the presence of other unstated features, elements, components, groups, integers and/or steps. The foregoing also applies to words having similar meanings such as the terms, “including”, “having” and their derivatives.
Also, it will be understood that although the terms “first” and “second” may be used herein to describe various components, these components should not be limited by these terms. These terms are only used to distinguish one component from another. Thus, for example, a first component discussed above could be termed a second component and vice versa without departing from the teachings of the present invention.
While only selected embodiments have been chosen to illustrate the present invention, it will be apparent to those skilled in the art from this disclosure that various changes and modifications can be made herein without departing from the scope of the invention as defined in the appended claims. For example, unless specifically stated otherwise, the size, shape, location or orientation of the various components can be changed as needed and/or desired so long as the changes do not substantially affect their intended function. Unless specifically stated otherwise, components that are shown directly connected or contacting each other can have intermediate structures disposed between them so long as the changes do not substantially affect their intended function. The functions of one element can be performed by two, and vice versa unless specifically stated otherwise. The structures and functions of one embodiment can be adopted in another embodiment. It is not necessary for all advantages to be present in a particular embodiment at the same time. Every feature which is unique from the prior art, alone or in combination with other features, also should be considered a separate description of further inventions by the applicant, including the structural and/or functional concepts embodied by such feature(s). Thus, the foregoing descriptions of the embodiments according to the present invention are provided for illustration only, and not for the purpose of limiting the invention as defined by the appended claims and their equivalents.