The present disclosure relates to a technique for analyzing performance by a user.
For example, in related art, there is proposed a technique of analyzing fingering when a user plays a musical instrument such as a keyboard instrument. For example, JP2007-241034A discloses a configuration for determining fingering from a note sequence of a musical composition by a probabilistic method using the Hidden Markov Model (HMM).
However, it is actually difficult to determine the fingering with high accuracy using only the note sequence. In consideration of the above circumstances, an object of one aspect of the present disclosure is to estimate fingering of a user with high accuracy.
The present disclosure provides a performance analysis method implemented by a computer system, the performance analysis method including: obtaining performance data representing performance by a performer on a musical instrument; obtaining a performance image of performer's fingers playing the musical instrument; generating finger position data representing a position of each of the performer's fingers from the performance image; generating fingering data representing fingering in a performance by the performer based on the performance data and the finger position data; and displaying a performance image based on the generated fingering data.
The present disclosure provides a performance analysis system including: an image capturing device for capturing a performance image of performer's fingers playing a musical instrument; a display device; a memory storing instructions; and a control device including at least one processor that implements the instructions to: obtain performance data representing performance of a performer playing the musical instrument; obtain the performance image; generate finger position data representing a position of each of the performer's fingers from the performance image; generate fingering data representing fingering in a performance by the performer based on the performance data and the finger position data; and display a performance image based on the generated fingering data.
The present disclosure provides a non-transitory computer-readable medium storing a program executable by a computer to execute a performance analysis method, the method including: obtaining performance data representing performance of a performer playing a musical instrument; obtaining a performance image of performer's fingers playing the musical instrument; generating finger position data representing a position of each of the performer's fingers from the performance image; generating fingering data representing fingering in a performance by the performer based on the performance data and the finger position data; and displaying a performance image based on the generated fingering data.
The present disclosure will be described in detail based on the following figures, wherein:
The performance analysis system 100 is a computer system that analyzes the performance of the keyboard instrument 200 by the user. Specifically, the performance analysis system 100 analyzes fingering of the user. The fingering is how the user uses each finger of the left hand and the right hand in the performance of the keyboard instrument 200. That is, information as to which finger the user uses to operate each key 21 of the keyboard instrument 200 is analyzed as the fingering of the user.
The performance analysis system 100 includes a control device 11, a storage device 12, an operation device 13, a display device 14 and an image capturing device 15. The performance analysis system 100 is implemented by, for example, a portable information device such as a smart phone or a tablet terminal, or a portable or stationary information device such as a personal computer. The performance analysis system 100 may be implemented as a single device, or as a plurality of devices configured separately from each other. The performance analysis system 100 may also be installed in the keyboard instrument 200.
The control device 11 includes one or more processors that control each element of the performance analysis system 100. For example, the control device 11 is implemented by one or more types of processors such as a central processing unit (CPU), a sound processing unit (SPU), a digital signal processor (DSP), a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), etc.
The storage device 12 includes one or more memories that store programs executed by the control device 11 and various types of data used by the control device 11. The storage device 12 may be implemented by a known recording medium such as a magnetic recording medium or a semiconductor recording medium, or a combination of a plurality of types of recording media. As the storage device 12, portable recording medium that can be attached to and detached from the performance analysis system 100, or a recording medium (for example, a cloud storage) that can be written or read by the control device 11 via a communication network such as the Internet may be used.
The operation device 13 is an input device that receives an instruction from the user. The operation device 13 includes, for example, an operator operated by the user or a touch panel that detects contact by the user. The operation device 13 (for example, a mouse or a keyboard), which is separated from the performance analysis system 100, may be connected to the performance analysis system 100 by wire or wirelessly.
The display device 14 displays images under control of the control device 11. For example, various display panels such as a liquid crystal display panel or an organic EL (Electroluminescence) panel are used as the display device 14. The display device 14, which is separated from the performance analysis system 100, may be connected to the performance analysis system 100 by wire or wirelessly.
The image capturing device 15 is an image input device that generates a time series of image data D1 by capturing an image of a subject. The time series of the image data D1 is moving image data representing moving images. For example, the image capturing device 15 includes an optical system such as an imaging lens, an imaging element for receiving incident light from the optical system, and a processing circuit for generating the image data D1 in accordance with an amount of light received by the imaging element. The image capturing device 15, which is separated from the performance analysis system 100, may be connected to the performance analysis system 100 by wire or wirelessly.
The user adjusts a position or an angle of the image capturing device 15 with respect to the keyboard instrument 200 so that an image capturing condition recommended by a provider of the performance analysis system 100 is achieved. Specifically, the image capturing device 15 is disposed above the keyboard instrument 200 and captures images of the keyboard 22 of the keyboard instrument 200 and the left hand and the right hand of the user. Therefore, as illustrated in
The display control unit 40 causes the display device 14 to display various images. For example, the display control unit 40 causes the display device 14 to display an image 61 indicating a result of analysis by the performance analysis unit 30. As used herein, the image 61 may also be referred to as an “analysis screen”.
In the note image 611 of each note, a code 612 corresponding to the finger number k specified for the note by the fingering data Q is arranged. As used herein the code 612 may also be referred to as a “fingering code”. The letter “L” in the fingering code 612 means the left hand, and the letter “R” in the fingering code 612 means the right hand. The number in the fingering code 612 means a corresponding finger. Specifically, a number “1” in the fingering code 612 means the thumb, and a number “2” means the index finger, and a number “3” means the middle finger, and a number “4” means the ring finger, and the number “5” means the little finger. Therefore, for example, the fingering code 612 “R2” refers to the index finger of the right hand and the fingering code 612 “L4” refers to the ring finger of the left hand. The note image 611 and the fingering code 612 are displayed in different modes (for example, different hues or different gradations) for the right hand and the left hand. The display control unit 40 causes the display device 14 to display the analysis screen 61 of
Among the plurality of note images 611 in the analysis screen 61, the note image 611 of a note with low reliability in an estimation result of the finger number k is displayed in a manner (for example, a dashed frame line) different from a normal note image 611, and a specific code, such as “??”, is displayed to indicate that the estimation result of the finger number k is invalid.
As illustrated in
A: Finger Position Data Generation Unit 31
The finger position data generation unit 31 includes an image extraction unit 311, a matrix generation unit 312, a finger position estimation unit 313 and a projective transformation unit 314.
Finger Position Estimation Unit 313
The finger position estimation unit 313 estimates the position c[h, f] of each finger of the left hand and the right hand of the user by analyzing the performance image G1 represented by the image data D1. The position c[h, f] of each finger is a position of each fingertip in an x-y coordinate system set in the performance image G1. The position c[h, f] is expressed by a combination (x[h, f], y[h, f]) of a coordinate x[h, f] on an x-axis and a coordinate on a y-axis y[h,f] in the x-y coordinate system of the performance image G1. A positive direction of the x-axis corresponds to a right direction of the keyboard 22 (a direction from low tones to high tones), and a negative direction of the x-axis corresponds to a left direction of the keyboard 22 (a direction from high tones to low tones). The symbol h is a variable indicating either the left hand or the right hand (h=1, 2). Specifically, the numerical value “1” of the variable h means the left hand, and the numerical value “2” of the variable h means the right hand. The variable f is the number of each finger in each of the left hand and the right hand (f=1 to 5). The number “1” of the variable f means the thumb, and the number “2” means the index finger, and the number “3” means the middle finger, and the number “4” means the ring finger, and the number “5” means the little finger. Therefore, for example, a position c[1, 2] illustrated in
The image analysis processing Sa1 is processing of estimating the position c[h, f] of each finger on one of the left hand and the right hand of the user and the position c[h, f] of each finger on the other of the left hand and the right hand of the user by analyzing the performance image G1. As used herein, the one of the left hand and the right hand may also be referred to as a “first hand” and the other thereof may also be referred to as a “second hand”. Specifically, the finger position estimation unit 313 estimates the position c[h, 1] to c[h, 5] of each finger of the first hand and the position c[h, 1] to c[h, 5] of each finger of the second hand through image recognition processing of estimating a skeleton or joints of the user through image analysis. For the image analysis processing Sa1, known image recognition processing such as the MediaPipe or the OpenPose may be used. When no fingertip is detected from the performance image G1, the coordinate x[h, f] of the fingertip on the x-axis is set to an invalid value such as “0”.
In the image analysis processing Sa1, the positions c[h, 1] to c[h, 5] of the fingers of the first hand and the positions c[h, 1] to c[h, 5] of the fingers of the second hand of the user are estimated, but it is not possible to specify whether the first hand or the second hand corresponds to the left hand or the right hand of the user. Since in the performance of the keyboard instrument 200, a right arm and a left arm of the user may cross, it is not appropriate to determine the left hand or the right hand from only the coordinate x[h, f] of each position c[h, f] estimated by the image analysis processing Sa1. If an image of a portion including arms and body of the user is captured by the image capturing device 15, the left hand or the right hand of the user can be estimated from the performance image G1 based on coordinates of shoulders and arms of the user. However, there may be a problem that it is necessary to capture an image with a wide range by the image capturing device 15, and a problem that processing load of the image analysis processing Sa1 increases.
In consideration of the above circumstances, the finger position estimation unit 313 of the first embodiment executes the left-right determination processing Sa2 shown in
The symbol μ[h] in Equation (1) is a mean value (for example, simple mean) of the coordinates x[h, 1] to x[h, 5] of the five fingers of each of the first hand and the second hand. As can be understood from Equation (1), when the coordinate x[h, f] decreases from the thumb to the little finger (left hand), the determination index γ[h] is a negative number, and when the coordinate x[h, f] increases from the thumb to the little finger (right hand), the determination index γ[h] is a positive number. Therefore, the finger position estimation unit 313 determines that the hand, of the first hand and the second hand, having a negative determination index γ[h] is the left hand, and sets the variable h to the numerical value “1” (Sa22). The finger position estimation unit 313 determines that the hand, of the first hand and the second hand, having a positive determination index γ[h] is the right hand, and sets the variable h to the numerical value “2” (Sa23). According to the left-right determination processing Sa2 described above, the position c[h, f] of each finger of the user can be distinguished between the right hand and the left hand by simple processing using a relation between the position of the thumb and the position of the little finger.
The position c[h, f] of each finger of the user is estimated for each unit period by the image analysis processing Sa1 and the left-right determination processing Sa2. However, the position c[h, f] may not be properly estimated due to various circumstances such as noise existing in the performance image G1. Therefore, when the position c[h, f] is missing in a specific unit period (hereinafter referred to as “missing period”), the finger position estimation unit 313 calculates the position c[h, f] in the missing period by the interpolation processing Sa3 using the positions c[h, f] in the unit periods before and after the missing period. For example, when the position c[h, f] is missing in a central unit period (missing period) among three consecutive unit periods on the time axis, a mean of the position c[h, f] in the unit period immediately before the missing period and the position c[h, f] in the unit period immediately after that is calculated as the position in the missing period.
Image Extraction Unit 311
As described above, the performance image G1 includes the keyboard image g1 and the finger image g2. The image extraction unit 311 shown in
The area estimation processing Sb1 is processing of estimating the specific area B for the performance image G1 represented by the image data D1. Specifically, the image extraction unit 311 generates an image processing mask M indicating the specific area B from the image data D1 by the area estimation processing Sb1. As illustrated in
As illustrated in
Additional elements such as long short-term memory (LSTM) may also be included in the estimation model 51.
A plurality of pieces of learning data T is used for the machine learning of the estimation model 51. Each of the plurality of pieces of learning data T is a combination of image data Dt for learning and image processing mask Mt for learning. The image data Dt represents an already-captured image including the keyboard image g1 of the keyboard instrument and an image around the keyboard instrument. A model of the keyboard instrument and the image capturing condition (for example, the image capturing range or the image capturing direction) differ for each piece of image data Dt. That is, the image data Dt is prepared in advance by capturing an image of each of a plurality of types of keyboard instruments under different image capturing conditions. The image data Dt may be prepared by a known image synthesizing technique. The image processing mask Mt of each piece of learning data T is a mask indicating the specific area B in the already-captured image represented by the image data Dt of the learning data T. Specifically, elements in an area corresponding to the specific area B in the image processing mask Mt are set to the numerical value “1”, and elements in an area other than the specific area B are set to the numerical value “0”. That is, the image processing mask Mt means a correct answer that the estimation model 51 is to output in response to input of the image data Dt.
The machine learning system 900 calculates an error function representing an error between the image processing mask M output by an initial or provisional model 51a in response to input of the image data Dt of each piece of learning data T and the image processing mask M of the learning data T. As used herein the model 51a may also be referred to as a “provisional model”. The machine learning system 900 then updates a plurality of variables of the provisional model 51a so that the error function is reduced. The provisional model 51a after the above processing is repeated for each of the plurality pieces of learning data T is determined as the estimation model 51. Therefore, the estimation model 51 can output a statistically valid image processing mask M for an image data D1 to be captured in the future under a latent relation between the image data Dt and the image processing mask Mt in the plurality of pieces of learning data T. That is, the estimation model 51 is a trained model that learns the relation between the image data Dt and the image processing mask Mt.
As described above, in the first embodiment, the image processing mask M indicating the specific area B is generated by inputting the image data D1 of the performance image G1 into the machine-learned estimation model 51. Therefore, the specific area B can be specified with high accuracy for various performance images G1 to be captured in the future.
The area extraction processing Sb2 shown in
Projective Transformation Unit 314
The position c[h, f] of each finger estimated by the finger position estimation processing is a coordinate in the x-y coordinate system set in the performance image G1. The image capturing condition for the keyboard instrument 200 by the image capturing device 15 may differ depending on various circumstances such as usage environment of the keyboard instrument 200. For example, compared with the ideal image capturing condition illustrated in
The X-Y coordinate system is set in a predetermined image Gref, as illustrated in
The auxiliary data A is data specifying a combination of an area Rn of the reference image Gref and the pitch n corresponding to the key 21. The area Rn is an area in which each key 21 of the reference instrument exists. As used herein, the area Rn may also be referred to as a “unit area”. That is, the auxiliary data A can also be said to be data defining the unit area Rn corresponding to each pitch n in the reference image Gref.
In the transformation from the position c[h, f] in the x-y coordinate system to the position C[h, f] in the X-Y coordinate system, projective transformation using a transformation matrix W, as expressed by the following Equation (2), is used. The symbol X in Equation (2) means a coordinate on an X-axis, and the symbol Y means a coordinate on a Y-axis in the X-Y coordinate system. The symbol s is an adjustment value for matching the scale between the x-y coordinate system and the X-Y coordinate system.
Matrix Generation Unit 312
The matrix generation unit 312 shown in
The matrix generation processing includes initialization processing Sc1 and matrix updating processing Sc2. The initialization processing Sc1 is processing of setting an initial matrix W0, which is an initial setting of the transformation matrix W. Details of the initialization processing Sc1 will be described later.
The matrix updating processing Sc2 is processing of generating a transformation matrix W by iteratively updating the initial matrix W0. That is, the projective transformation unit 314 iteratively updates the initial matrix W0 to generate the transformation matrix W such that the keyboard image g1 of the performance image G2 approximates the reference image Gref by projective transformation using the transformation matrix W. For example, the transformation matrix W is generated so that a coordinate X/s on the X-axis of a specific point in the reference image Gref approximates or matches a coordinate x on the x-axis of a point corresponding to the point in the keyboard image g1, and a coordinate Y/s on the Y axis of a specific point in the reference image Gref approximates or matches a coordinate y on the y axis of a point corresponding to the point in the keyboard image g1. That is, the transformation matrix W is generated so that a coordinate of the key 21 corresponding to a specific pitch in the keyboard image g1 is transformed into a coordinate of the key 21 corresponding to the pitch in the reference image Gref by the projective transformation to which the transformation matrix W is applied. An element (matrix generation unit 312) for generating the transformation matrix W is implemented by the control device 11 executing the matrix updating processing Sc2 illustrated above.
As the matrix updating processing Sc2, processing (such as the Scale-Invariant Feature Transform (SIFT)) of updating the transformation matrix W so that an image feature amount of the reference image Gref and that of the keyboard image g1 approximate each other is assumed. However, in the keyboard image g1, since a pattern in which the plurality of keys 21 are arranged in the similar manner is repeated, there is a possibility that the transformation matrix W cannot be properly estimated in the embodiment of using the image feature amount.
Considering the above circumstances, in the matrix updating processing Sc2, the matrix generation unit 312 of the first embodiment iteratively updates the initial matrix W0 so as to increase (ideally maximize) an enhanced correlation coefficient (ECC) between the reference image Gref and the keyboard image g1. According to the present embodiment, as compared with the above-described configuration using the image feature amount, it is possible to generate an appropriate transformation matrix W capable of approximating the keyboard image g1 to the reference image Gref with high accuracy. The generation of the transformation matrix W using the enhanced correlation coefficient is also disclosed in Georgios D. Evangelidis and Emmanouil Z. Psarakis, “Parametric Image Alignment Using Enhanced Correlation Coefficient Maximization”, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 10, October 2008. As described above, the enhanced correlation coefficient is suitable for generating the transformation matrix W used for the transformation of the keyboard image g1, but the transformation matrix W may be generated by processing such as SIFT so that the image feature amount of the reference image Gref and that of the keyboard image g1 approximate each other.
The projective transformation unit 314 shown in
The display control unit 40 causes the display device 14 to display the transformed image generated by the projective transformation processing. For example, the display control unit 40 causes the display device 14 to display the transformed image and the reference image Gref in an overlapping state. As described above, the area corresponding to the key 21 of each pitch n in the transformed image and the unit area Rn corresponding to the pitch n in the reference image Gref overlap each other.
As described above, in the first embodiment, the transformation matrix W is generated so that the keyboard image g1 of the performance image G1 approximates the reference image Gref, and the projective transformation processing using the transformation matrix W is performed on the performance image G1. Therefore, the performance image G1 of the keyboard instrument 200 played by the user can be transformed into the transformed image corresponding to the image capturing condition of the reference instrument in the reference image Gref.
The projective transformation unit 314 specifies one or more unit areas Rn designated by the auxiliary data A for the target pitch n in the reference image Gref represented by the reference data Dref (Sc13). Then, the projective transformation unit 314 calculates, as the initial matrix W0, a matrix for applying a projective transformation to transform the target area 621 in the performance image G1 into one or more unit areas Rn specified from the reference image Gref (Sc14). As can be understood from the above description, the initialization processing Sc1 of the first embodiment is processing of setting the initial matrix W0 so as to approximate the target area 621 instructed by the user in the keyboard image g1 to the unit area Rn corresponding to the target pitch n in the reference image Gref by projective transformation using the initial matrix W0.
The setting of the initial matrix W0 is important for generating an appropriate transformation matrix W by the matrix updating processing Sc2. Especially in the embodiment of using the enhanced correlation coefficient for the matrix updating processing Sc2, there is a tendency that suitability of the initial matrix W0 is likely to affect suitability of the final transformation matrix W. In the first embodiment, the initial matrix W0 is set so that the target area 621 corresponding to the instruction from the user in the performance image G1 approximates the unit area Rn corresponding to the target pitch n in the reference image Gref. Therefore, it is possible to generate an appropriate transformation matrix W that can approximate the keyboard image g1 to the reference image Gref with high accuracy. In the first embodiment, the area designated by the user by operating the operation device 13 in the performance image G1 is used as the target area 621 for setting the initial matrix W0. Therefore, an appropriate initial matrix W0 can be generated while reducing the processing load, as compared with, for example, a configuration in which the area corresponding to the target pitch n in the performance image G1 is estimated by arithmetic processing. In the above description, the initialization processing Sc1 is executed for the performance image G1, but the initialization processing Sc1 may be executed for the performance image G2.
B: Fingering Data Generation Unit 32
The fingering data generation unit 32 shown in
Probability Calculation Unit 321
The probability calculation unit 321 calculates, for each finger number k, a probability p that the pitch n specified by the performance data P is played by the finger with each finger number k. The probability p is an index of a probability (likelihood) that the finger with the finger number k operates the key 21 with the pitch n. The probability calculation unit 321 calculates the probability p in accordance with whether the position C[k] of the finger with the finger number k exists within the unit area Rn of the pitch n. The probability p is calculated for each unit period on the time axis. Specifically, when the performance data P specifies the pitch n, the probability calculation unit 321 calculates the probability p (C[k]ηk=n) by the calculation of Equation (3) exemplified below.
The condition “ηk=n” in the probability p (C[k]|ηk=n) means a condition that the finger with the finger number k plays the pitch n. That is, the probability p (C[k]|ηk=n) means a probability that the position C[k] is observed for the finger under the condition that the finger with the finger number k plays the pitch n.
The symbol I (C[k]∈Rn) in Equation (3) is an indicator function that is set to a numerical value of “1” when the position C[k] exists within the unit area Rn, and is set to a numerical value of “0” when the position C[k] exists outside the unit area Rn. The symbol |Rn| means an area of the unit area Rn. The symbol v (0, σ2E) means observation noise, and is expressed by a normal distribution of a mean 0 and a variance σ2. The symbol E is a unit matrix of 2 rows and 2 columns. The symbol * means a convolution the observation noise v (0, σ2E).
As can be understood from the above description, the probability p (C[k]|ηk=n) calculated by the probability calculation unit 321 is a probability that, under a condition that the pitch n specified by the performance data P is played by a finger with the finger number k, the position of the finger is the position C[k] specified by the finger position data F for the finger. Therefore, the probability p (C[k]|ηk=n) is maximized when the position C[k] of the finger with the finger number k is within the unit area Rn in a playing state, and decreases as the position C[k] is further away from the unit area Rn.
On the other hand, when the performance data P does not specify any pitch n, that is, when the user does not operate any of the N keys 21, the probability calculation unit 321 calculates the probability p (C[k]|ηk=0) of each finger by the following Equation (4).
The symbol |R| in Equation (4) means a total area of N unit areas R1 to RN in the reference image Gref. As can be understood from Equation (4), when the user does not operate any key 21, the probability p (C[k]|ηk=0) is set to a common numerical value (1/|R|) for all finger number k.
As described above, within a period in which the performance data P specifies the pitch n, a plurality of probabilities p (C[k]|ηk=n) corresponding to different fingers are calculated for each unit period on the time axis. On the other hand, in each unit period in a period in which the performance data P does not specify any pitch n, the plurality of probabilities p (C[k]|ηk=0) corresponding to the different fingers is a sufficiently small fixed value (1/|R|).
Fingering Estimation Unit 322
The fingering estimation unit 322 estimates the fingering of the user. Specifically, the fingering estimation unit 322 estimates, based on the probability p (C[k]|ηk=n) of each finger, the finger (finger number k) that plays the pitch n specified by the performance data P. The fingering estimation unit 322 estimates the finger number k (generates the fingering data Q) every time the probability p (C[k]|ηk=n) of each finger is calculated (that is, for every unit period). Specifically, the fingering estimation unit 322 specifies the finger number k corresponding to the maximum value among the plurality of probabilities p (C[k]|ηk=n) corresponding to the different fingers. Then, the fingering estimation unit 322 generates the fingering data Q that specifies the pitch n specified by the performance data P and the finger number k specified from the probability p (C[k]|ηk=n).
When the maximum value among the plurality of probabilities p (C[k]|ηk=n) falls below a predetermined threshold within the period in which the performance data P specifies the pitch n, it means that reliability of a fingering estimation result is low. Therefore, the fingering estimation unit 322 sets the finger number k to an invalid value meaning invalidity of the estimation result in the unit period in which the maximum value among the plurality of probabilities p (C[k]|ηk=n) is below the threshold. For the note with the finger number k set to an invalid value, the display control unit 40 displays the note image 611 in a manner different from the normal note image 611, as illustrated in
When the performance analysis processing is started, the control device 11 (image extraction unit 311) executes the image extraction processing shown in
After executing the image extraction processing, the control device 11 (matrix generation unit 312) executes the matrix generation processing shown in
After the transformation matrix W is generated, the control device 11 repeats processing (S13 to S18) exemplified below for each unit period. First, the control device 11 (finger position estimation unit 313) executes the finger position estimation processing shown in
The control device 11 (projective transformation unit 314) executes the projective transformation processing (S14). That is, the control device 11 generates the transformed image by projective transformation of the performance image G1 using the transformation matrix W. In the projective transformation processing, the control device 11 transforms the position c[h, f] of each finger of the user into the position C[h, f] in the X-Y coordinate system, and generates the finger position data F representing the position C[h, f] of each finger.
After generating the finger position data F by the above processing, the control device 11 (probability calculation unit 321) executes the probability calculation processing (S15). That is, the control device 11 calculates the probability p (C[k]|ηk=n) that the pitch n specified by the performance data P is played by each finger with the finger number k. Then, the control device 11 (fingering estimation unit 322) executes the fingering estimation processing (S16). That is, the control device 11 estimates the finger number k of the finger that plays the pitch n from the probability p (C[k]|ηk=n) of each finger, and generates the fingering data Q that specifies the pitch n and the finger number k.
After the fingering data Q is generated by the above processing, the control device 11 (display control unit 40) updates the analysis screen 61 in accordance with the fingering data Q (S17). The control device 11 determines whether a predetermined end condition is satisfied (S18). For example, when the user inputs an instruction to end the performance analysis processing by operating the operation device 13, the control device 11 determines that the end condition is satisfied. If the end condition is not satisfied (S18: NO), the control device 11 repeats the processing after the finger position estimation processing (S13 to S18) for the immediately following unit period. On the other hand, if the end condition is satisfied (S18: YES), the control device 11 ends the performance analysis processing.
As described above, in the first embodiment, the finger position data F generated by analyzing the performance image G1 and the performance data P representing the performance by the user are used to generate the fingering data Q. Therefore, the fingering can be estimated with high accuracy compared with a configuration in which the fingering is estimated only from the performance data P.
In the first embodiment, the position c[h, f] of each finger estimated by the finger position estimation processing is transformed using the transformation matrix W for the projective transformation that approximates the keyboard image g1 to the reference image Gref. That is, the position C[h, f] of each finger is estimated based on the reference image Gref. Therefore, the fingering can be estimated with high accuracy compared with a configuration in which the position c[h, f] of each finger is not transformed to a position based on the reference image Gref.
In the first embodiment, the specific area B including the keyboard image g1 is extracted from the performance image G1. Therefore, as described above, it is possible to generate an appropriate transformation matrix W that can approximate the keyboard image g1 to the reference image Gref with high accuracy. Extracting the specific area B can improve usability of the performance image G1. In the first embodiment, the specific area B including the keyboard image g1 and the finger image g2 is particularly extracted from the performance image G1. Therefore, it is possible to generate the performance image G2 in which appearance of the keyboard 22 of the keyboard instrument 200 and appearance of the fingers of the user can be efficiently and visually confirmed.
The second embodiment will be described. In each embodiment exemplified below, elements having the same functions as those of the first embodiment are denoted by the same reference numerals as those used in the description of the first embodiment, and detailed descriptions thereof are appropriately omitted.
In the first embodiment, the probability p (C[k]|ηk=n) is calculated in accordance with whether the position C[k] of the finger with the finger number k exists within the unit area Rn of the pitch n. Assuming that only one finger exists in the unit area Rn, the fingering can be estimated with high accuracy even in the first embodiment. However, in an actual performance of the keyboard instrument 200, it is assumed that the positions C[k] of a plurality of fingers exist within one unit area Rn.
For example, as illustrated in
The control data generation unit 323 generates N pieces of control data Z[1] to Z[N] corresponding to the different pitches n.
In addition to the pitch n, the control data Z[n] corresponding to the pitch n includes an position mean Za[n, k], a position variance Zb[n, k], a velocity mean Zc[n, k], and a velocity variance Zd[n, k] for each of the plurality of fingers. The position mean Za[n, k] is a mean of the relative positions C′[k] within a period of a predetermined length including the current unit period. As used herein the period of the predetermined length may also be referred to as an “observation period”. The observation period is, for example, a period corresponding to a plurality of unit periods arranged forward on the time axis with the current unit period assumed as a tail end. The position variance Zb[n, k] is a variance of the relative positions C′[k] within the observation period. The velocity mean Zc[n, k] is a mean of velocities (that is, rate of change) at which the relative position C′[k] changes within the observation period. The velocity variance Zd[n, k] is a variance of the velocities at which the relative position C′[k] changes within the observation period.
As described above, the control data Z[n] includes information (Za[n, k], Zb[n, k], Zc[n, k], and Zd[n, k]) about the relative position C′[k] for each of the plurality of fingers. Therefore, the control data Z[n] is data reflecting the positional relationship among the plurality of fingers of the user. The control data Z[n] also includes information (Zb[n, k], Zd[n, k]) about variation in the relative position C′[k] for each of the plurality of fingers. Therefore, the control data Z[n] is data reflecting the variation over time in the position of each finger.
In the probability calculation processing by the probability calculation unit 321 of the second embodiment, a plurality of estimation models 52[k] (52[1] to 52[10]) prepared in advance for different fingers are used. The estimation model 52[k] of each finger is a trained model that learns a relation between the control data Z[n] and a probability p[k] of the finger. The probability p[k] is an index (probability) of a likelihood of playing the pitch n specified by the performance data P by the finger with the finger number k. The probability calculation unit 321 calculates the probability p[k] by inputting the N pieces of control data Z[1] to Z[N] to the estimation model 52[k] for each of the plurality of fingers.
The estimation model 52[k] corresponding to any one finger number k is a logistic regression model represented by Equation (5) below.
The variable βk and the variable ωk, n in Equation (5) are set by machine learning by the machine learning system 900. That is, each estimation model 52[k] is established by machine learning by the machine learning system 900, and each estimation model 52[k] is provided to the performance analysis system 100. For example, the variable βk and the variable ωk, n of each estimation model 52[k] are sent from the machine learning system 900 to the performance analysis system 100.
A finger positioned above a key-pressing finger or a finger moving above or below a key-pressing finger tends to move more easily than the key-pressing finger. Considering the above tendency, the estimation model 52[k] learns the relation between the control data Z[n] and the probability p[k] so that the probability p[k] becomes small for the fingers with a high rate of change in the relative position C′[k]. The probability calculation unit 321 calculates a plurality of probabilities p[k] regarding different fingers for each unit period by inputting the control data Z[n] to each of the plurality of estimation models 52[k].
The fingering estimation unit 322 estimates the fingering of the user through the fingering estimation processing to which the plurality of probabilities p[k] are applied. Specifically, the fingering estimation unit 322 estimates the finger (finger number k) that plays the pitch n specified by the performance data P from the probability p[k] of each finger. The fingering estimation unit 322 estimates the finger number k (generates the fingering data Q) every time the probability p[k] of each finger is calculated (that is, for every unit period). Specifically, the fingering estimation unit 322 specifies the finger number k corresponding to the maximum value among the plurality of probabilities p[k] corresponding to the different fingers. Then, the fingering estimation unit 322 generates the fingering data Q that specifies the pitch n specified by the performance data P and the finger number k specified from the probability p[k].
The control device 11 (probability calculation unit 321) calculates the probability p[k] corresponding to the finger number k by the probability calculation processing of inputting the N pieces of control data Z[1] to Z[N] into each estimation model 52[k] (S15). The control device 11 (fingering estimation unit 322) estimates the fingering of the user by the fingering estimation processing to which the plurality of probabilities p[k] are applied (S16). The operations (S11 to S14, S17, and S18) of elements other than the fingering data generation unit 32 are the same as those in the first embodiment.
The second embodiment also achieves effects including the same effect as the first embodiment. In the second embodiment, the control data Z[k] input to the estimation model 52[k] includes the mean Za[n, k] and the variance Zb[n, k] of the relative position C′[k], and the mean Zc[n, k] and the variance Zd[n, k] of the rate of change in the relative position C′[k] of each finger. Therefore, even when a plurality of fingers overlap each other due to, for example, finger crossing, the fingering of the user can be estimated with high accuracy.
In the above description, the logistic regression model is exemplified as the estimation model 52[k], but the type of the estimation model 52[k] is not limited to the above example. For example, a statistical model such as a multilayer perceptron may be used as the estimation model 52[k]. A deep neural network such as a convolutional neural network or a recurrent neural network may also be used as the estimation model 52[k]. A combination of a plurality of types of statistical models may be used as the estimation model 52[k]. The various estimation models 52[k] exemplified above are comprehensively expressed as trained models that learn the relation between the control data Z[n] and the probability p[k].
If the keyboard instrument 200 is being played (S21: YES), the control device 11 generates the finger position data F (S13 and S14), generates the fingering data Q (S15 and S16), and updates the analysis screen 61 (S17) as in the first embodiment. On the other hand, if the keyboard instrument 200 is not being played (S21: NO), the control device 11 proceeds with the processing to the step S18. That is, the generation of the finger position data F (S13 and 14), the generation of the fingering data Q (S15 and S16), and the update of the analysis screen 61 (S17) are not executed.
The third embodiment achieves effects including the same effect as the first embodiment. In the third embodiment, when the keyboard instrument 200 is not being played, the generation of the finger position data F and the fingering data Q is stopped. Therefore, the processing load necessary for generating the fingering data Q can be reduced compared with a configuration in which the generation of the finger position data F is continued regardless of whether the keyboard instrument 200 is being played. The third embodiment can also be applied to the second embodiment.
The fourth embodiment is an embodiment in which the initialization processing Sc1 in each of the above-described embodiments is modified.
When the initialization processing Sc1 is started, the user operates, by a specific finger, the key 21 corresponding to a desired pitch n among the plurality of keys 21 of the keyboard instrument 200. As used herein, the desired pitch may also be referred to as a “specific pitch”. The specific finger is, for example, a finger (for example, the index finger of the right hand) of which the user is notified by the display on the display device 14 or an instruction manual or the like of the keyboard instrument 200. As a result of the performance by the user, the performance data P specifying the specific pitch n is supplied from the keyboard instrument 200 to the performance analysis system 100. The control device 11 acquires the performance data P from the keyboard instrument 200, thereby recognizing the performance of the specific pitch n by the user (Sc15). The control device 11 specifies the unit area Rn corresponding to the specific pitch n among the N unit areas R1 to RN of the reference image Gref (Sc16).
On the other hand, the finger position data generation unit 31 generates the finger position data F through the finger position estimation processing. The finger position data F includes the position C[h, f] of the specific finger used by the user to play the specific pitch n. The control device 11 acquires the finger position data F to specify the position C[h, f] of the specific finger (Sc17).
The control device 11 sets the initial matrix W0 by using the unit area Rn corresponding to the specific pitch n and the position C[h, f] of the specific finger represented by the finger position data F (Sc18). That is, the control device 11 sets the initial matrix W0 so that the position C[h, f] of the specific finger represented by the finger position data F approximates the unit area Rn of the specific pitch n in the reference image Gref. Specifically, as the initial matrix W0, a matrix for applying a projective transformation to transform the position C[h, f] of the specific finger into a center of the unit area Rn is set.
The fourth embodiment also achieves effects including the same effect as the first embodiment. In the fourth embodiment, when the user plays the desired specific pitch n with the specific finger, the initial matrix W0 is set so that the position c[h, f] of the specific finger in the performance image G1 approximates a portion (unit area Rn) corresponding to the specific pitch n in the reference image Gref. Since the user only needs to play the desired pitch n, compared with the first embodiment in which the user needs to select the target area 621 by operating the operation device 13, working load required for the user to set the initial matrix W0 is reduced. On the other hand, according to the first embodiment in which the user designates the target area 621, it is not necessary to estimate the position C[h, f] of the finger of the user, and therefore, an appropriate initial matrix W0 can be set while reducing influence of estimation error as compared with the second embodiment. The fourth embodiment can be similarly applied to the second embodiment or the third embodiment.
In the fourth embodiment, it is assumed that the user plays one specific pitch n, but the user may play a plurality of specific pitches n with specific fingers. The control device 11 sets the initial matrix W0 for each of the plurality of specific pitches n so that the position C[h, f] of the specific finger when playing the specific pitch n approximates the unit area Rn of the specific pitch n.
The control device 11 of the performance analysis system 100 functions as the performance analysis unit 30 by executing the programs stored in the storage device 12. The performance analysis unit 30 generates the fingering data Q using the sound signal V supplied from the sound collection device 16 and the image data D1 supplied from the image capturing device 15. As in the first embodiment, the fingering data Q specifies the pitch n corresponding to the key 21 operated by the user and the finger number k of the finger used to operate the key 21 by the user. Although the pitch n is specified by the performance data P in the first embodiment, the sound signal V in the fifth embodiment is not a signal that directly specifies the pitch n. Therefore, the performance analysis unit 30 simultaneously estimates the pitch n and the finger number k using the sound signal V and the image data D1.
For the estimation of the pitch n and the finger number k, a latent variable wt, n, k is assumed. The symbol t is a variable indicating time. A single unit period on the time axis may be indicated by the variable t. The finger number k in the fifth embodiment can be set to one of 11 numbers including 10 numbers (k=1 to 10) corresponding to the different fingers and a predetermined invalid value (k=0).
The latent variable wt, n, k is prepared for each combination of the pitch n and the finger number k. The latent variable wt, n, k is a variable for a one-hot expression that is set to either of the binary values “0” and “1”. A value “1” of the latent variable wt, n, k means that the pitch n is played by the finger with the finger number k, and a value “0” of the latent variable wt, n, k means that none of the fingers are used for playing.
A posterior probability Ut, n and a probability at, n, k are also assumed. The posterior probability Ut, n is a posterior probability that the pitch n is sounded at a time t under a condition that the sound signal V is observed. Therefore, a probability (1−Ut, n) corresponds to a probability that a latent variable wt, n, 0 is the value “1” under the condition that the sound signal V is observed (a probability that no pitch n is being played). The posterior probability Ut, n may be estimated by a known estimation model that learns a relation between the sound signal V and the posterior probability Ut, n. The estimation model is a trained model for automatic transcription. A deep neural network, such as a convolutional neural network or a recurrent neural network, is used as the estimation model for estimating the posterior probability Ut, n. The probability πt, n, k is a probability that the pitch n is played by the finger with the finger number k when the pitch n is being played.
A probability p(w|V, π) of the latent variable wt, n, k when the sound signal V and the probability πt, n, k are observed is expressed by the following Equation (6).
The first term on the right side of Equation (6) means a probability that none of the pitches n is sounded, and the second term means a probability that the pitch n is played by the finger with the finger number k when the pitch n is sounded.
A probability p (C[k]|w) of observing the position C[k] from the performance image G1 when the latent variable wt, n, k is observed is expressed by the following Equation (7).
The probability p (C[k]|σ2, Rn) in Equation (7) is a probability expressed by Equation (3) or Equation (4) above.
A symmetric Dirichlet distribution (Dir) expressed by the following Equation (8) is assumed as prior distribution of the probability πt, n, k.
The symbol a in Equation (8) is a variable that defines a shape of the symmetric Dirichlet distribution.
Under the above assumption, by executing the Maximum A Posteriori (MAP) that maximizes the posterior probability p (z|V, π, C[k]) of the latent variable wt, n, k, presence or absence of the pitch n and the finger number k can be simultaneously estimated. However, since it is difficult to estimate the probability distribution of the posterior probability p (z|V, π, C[k]), mean field approximation (variational Bayesian estimation) is considered in the fifth embodiment.
Specifically, the distribution that is most approximate to the probability distribution of the posterior probability p(z|V, π, C[k]) among the distributions that are factorized as in the following Expression (9) is specified. For example, a distribution that minimizes the Kullback-Leibler (KL) distance from the posterior probability p(z|V, π, C[k]) is specified.
Specifically, the performance analysis unit 30 repeats calculation of the following Equation (10) and Equation (11).
The symbol c in Equation (10) is a coefficient that normalizes probability distributions ρt, n, k over the plurality of finger numbers k so that the sum of the probability distributions ρt, n, k becomes “1”. The symbol < > means an expected value.
Specifically, the performance analysis unit 30 repeats the calculation of Equation (10) and Equation (11) for all possible combinations of the pitch n and the finger number k for one time t on the time axis. The performance analysis unit 30 determines, as the probability distribution ρt, n, k of the latent variable wt, n, k, a calculation result of Equation (10) when the calculation of Equation (10) and Equation (11) is repeated for a predetermined number of times. The probability distribution ρt, n, k is calculated for each time t on the time axis.
However, in the embodiment in which the pitch n and the finger number k are calculated for each time t from the probability distribution ρt, n, k calculated individually for each time t on the time axis, during the period in which the user plays one note, the finger number k may change before and after the time t, or the period during which the pitch n continues may be excessively short. Therefore, the performance analysis unit 30 of the fifth embodiment uses the Hidden Markov Model (HMM), to which the probability distribution ρt, n, k is applied, to generate a time series of combinations of the pitch n and the finger number k (that is, to generate a time series of the fingering data Q).
Specifically, the HMM for fingering estimation includes a latent state corresponding to each of the sounding (key pressing) of the pitch n and silence, and a plurality of latent states corresponding to the different finger numbers k. Only three types of state transitions, (1) self-transition, (2) silence to any finger number k, and (3) any finger number k to silence, are allowed, and transition probabilities for other state transitions are set to “0”. The above condition is a constraint for keeping the finger number k unchanged during the period in which one note is sounded. The expected value of the probability distribution ρt, n, k calculated by the calculation of Equation (10) and Equation (11) is set as an observation probability for each latent state of the HMM. The performance analysis unit 30 uses the HMM described above to estimate a state series by dynamic programming such as the Viterbi algorithm. The performance analysis unit 30 generates a time series of the fingering data Q in accordance with a result of estimating the state series.
According to the fifth embodiment, the fingering data Q is generated using the sound signal V and the image data D1. That is, the fingering data Q can be generated even in a situation where the performance data P cannot be obtained. In the fifth embodiment, since the pitch n and the finger number k are simultaneously estimated using the sound signal V and the image data D1, the fingering can be estimated with high accuracy while reducing the processing load as compared with the embodiment in which the pitch n and the finger number k are individually estimated. The fifth embodiment can also be applied to the second embodiment to the fourth embodiment.
As exemplified in each of the above embodiments, the projective transformation unit 314 generates a transformed image from the performance image G1. That is, the projective transformation unit 314 changes the image capturing condition of the performance image G1. The sixth embodiment is an image processing system 700 that uses the above function of changing the image capturing condition of the performance image G1. The performance analysis system 100 of the first embodiment to the fifth embodiment can also be expressed as the image processing system 700 when focusing on the processing on the performance image G1 by the projective transformation unit 314. However, in the sixth embodiment, the estimation of the fingering of the user is not essential.
The storage device 12 stores a plurality of pieces of the reference data Dref. Each of the plurality of pieces of the reference data Dref represents the reference image Gref obtained by capturing an image of the reference instrument, which is a keyboard of a standard keyboard instrument. The image capturing condition of the reference instrument is different for each reference image Gref (for each piece of the reference data Dref). Specifically, for example, one or more conditions of the image capturing range and the image capturing direction are different for each reference image Gref. The storage device 12 also stores the auxiliary data A for each piece of the reference data Dref.
The control device 11 implements the matrix generation unit 312, the projective transformation unit 314, and the display control unit 40 by executing the programs stored in the storage device 12. The matrix generation unit 312 selectively uses any one of the plurality of pieces of the reference data Dref to generate the transformation matrix W. The projective transformation unit 314 generates image data D3 of a transformed image G3 from the image data D1 of the performance image G1 by the projective transformation using the transformation matrix W. The display control unit 40 causes the display device 14 to display the transformed image G3 represented by the image data D3.
By operating the operation device 13, the user selects any one of a plurality of image capturing conditions corresponding to the different reference images Gref. The control device 11 (matrix generation unit 312) determines whether a selection of the image capturing condition is received from the user (S31). If the selection of the image capturing condition is received (S31: YES), the control device 11 (matrix generation unit 312) acquires the reference data Dref corresponding to the image capturing condition selected by the user from the plurality of pieces of the reference data Dref stored in the storage device 12 (S32). As used herein, the reference data Dref corresponding to the selected image capturing condition may also be referred to as “selected reference data Dref”. The selection of the image capturing condition by the user corresponds to an operation of selecting any one of the plurality of reference images Gref (reference data Dref) corresponding to different image capturing conditions.
The control device 11 (matrix generation unit 312) uses the selected reference data Dref to execute the same matrix generation processing as in the first embodiment (S33). Specifically, the control device 11 sets the initial matrix W0 by the initialization processing Sc1 using the selected reference data Dref. The control device 11 generates the transformation matrix W through the matrix updating processing Sc2 of iteratively updating the initial matrix W0 so that the keyboard image g1 of the performance image G1 approximates the reference image Gref of the selected reference data Dref. On the other hand, if the selection of the image capturing condition is not received (S31: NO), the selection of the reference data Dref (S32) and the matrix generation processing (S33) are not executed.
The control device 11 (projective transformation unit 314) generates the transformed image G3 by performing the projective transformation processing using the transformation matrix W on the performance image G1 (S34). The projective transformation processing is the same as in the first embodiment. As a result of the projective transformation processing, the image data D3 representing the transformed image G3 is generated. Specifically, the transformed image G3 corresponding to the same image capturing condition as the reference image Gref of the selected reference data Dref is generated from the performance image G1. That is, the transformed image G3 is an image obtained by transforming the image capturing condition of the performance image G1 into the same image capturing condition as the reference image Gref. As can be understood from the above description, according to the sixth embodiment, the transformed image G3 corresponding to the image capturing condition selected by the user is generated.
The control device 11 (display control unit 40) causes the display device 14 to display the transformed image G3 generated by the projective transformation processing (S35). The control device 11 determines whether an end condition is satisfied (S36). For example, when the user inputs an instruction to end the first image processing by operating the operation device 13, the control device 11 determines that the end condition is satisfied. If the end condition is not satisfied (S36: NO), the control device 11 proceeds with the processing to the step S31. That is, the generation of the transformation matrix W (S32 to S33) on the condition that the selection of the image capturing condition is received (S31: YES), and the generation and display of the transformed image G3 (S34-S35) are executed. On the other hand, if the end condition is satisfied (S36: YES), the control device 11 ends the first image processing.
As described above, in the sixth embodiment, the transformation matrix W is generated so that the keyboard image g1 of the performance image G1 approximates the reference image Gref, and the projective transformation processing using the transformation matrix W is performed on the performance image G1. Therefore, the performance image G1 of the keyboard instrument 200 played by the user can be transformed into the transformed image G3 corresponding to the image capturing condition of the reference instrument in the reference image Gref.
In the sixth embodiment, any one of the plurality of pieces of the reference data Dref having different image capturing conditions is selectively used for the matrix generation processing. Therefore, the transformed image G3 corresponding to various image capturing conditions can be generated from the performance image G1 captured under a specific image capturing condition. Especially in the sixth embodiment, the reference data Dref corresponding to the image capturing condition selected by the user from the plurality of pieces of the reference data Dref is used for the matrix generation processing, so that the transformed image G3 corresponding to an image capturing condition desired by the user can be generated. By changing the image capturing condition of the performance image G1 as described above, it is possible to generate the transformed image G3 that can be used for various purposes. For example, by performing the first image processing of the sixth embodiment on each of a plurality of performance images G1 captured by a music teacher for his own performance, a plurality of transformed images G3 with a uniform image capturing condition can be generated as a teaching material for music lessons.
As exemplified in each of the above embodiments, the image extraction unit 311 extracts the specific area B including the keyboard image g1 and the finger image g2 from the performance image G1. The seventh embodiment is the image processing system 700 that uses the above function of extracting the specific area B of the performance image G1. The performance analysis system 100 of the first embodiment to the fifth embodiment can also be expressed as the image processing system 700 when focusing on the processing on the performance image G1 by the image extraction unit 311. However, in the seventh embodiment, the estimation of the fingering of the user is not essential.
The control device 11 functions as the image extraction unit 311 and the display control unit 40 by executing the programs stored in the storage device 12. The image extraction unit 311 generates the image data D2 representing the performance image G2 obtained by extracting a partial area from the performance image G1. Specifically, as in the first embodiment, the image extraction unit 311 executes the area estimation processing Sb1 of generating the image processing mask M and the area extraction processing Sb2 of applying the image processing mask M to the performance image G1. The display control unit 40 causes the display device 14 to display the performance image G2 represented by the image data D2.
In the first embodiment, the estimation model 51 is exemplified as a single unit. On the other hand, the estimation model 51 used in the area estimation processing Sb1 in the seventh embodiment includes a first model 511 and a second model 512. Each of the first model 511 and the second model 512 is implemented by a deep neural network such as a convolutional neural network or a recurrent neural network.
The first model 511 is a statistical model for generating a first mask indicating a first area of the performance image G1. The first area is an area including the keyboard image g1 in the performance image G1. The finger image g2 is not included in the first area. The first mask is, for example, a binary mask in which each element in the first area is set to the numerical value “1” and each element in an area other than the first area is set to the numerical value “0”. The image extraction unit 311 generates the first mask by inputting the image data D1 representing the performance image G1 to the first model 511. That is, the first model 511 is a trained model obtained by machine-learning a relation between the image data D1 and the first mask (first area).
The second model 512 is a statistical model for generating a second mask indicating a second area of the performance image G1. The second area is an area including the finger image g2 in the performance image G1. The keyboard image g1 is not included in the second area. The second mask is, for example, a binary mask in which each element in the second area is set to the numerical value “1” and each element in an area other than the second area is set to the numerical value “0”. The image extraction unit 311 generates the second mask by inputting the image data D1 representing the performance image G1 to the second model 512. That is, the second model 512 is a trained model obtained by machine-learning a relation between the image data D1 and the second mask (second area).
When the second image processing is started, the control device 11 (image extraction unit 311) executes the area estimation processing Sb1 (S41 to S43). The area estimation processing Sb1 of the seventh embodiment includes first estimation processing (S41), second estimation processing (S42), and area synthesis processing (S43).
The first estimation processing is processing of estimating the first area of the performance image G1. Specifically, the control device 11 inputs the image data D1 representing the performance image G1 to the first model 511 to generate the first mask indicating the first area (S41). The second estimation processing is processing of estimating the second area of the performance image G2. Specifically, the control device 11 inputs the image data D1 representing the performance image G1 to the second model 512 to generate the second mask indicating the second area (S42).
The area synthesis processing is processing of generating the image processing mask M indicating the specific area B including the first area and the second area. Specifically, the specific area B indicated by the image processing mask M corresponds to the sum of the first area and the second area. That is, the control device 11 generates the image processing mask M by synthesizing the first mask and the second mask (S43). As can be understood from the above description, the image processing mask M is a binary mask for extracting the specific area B including the keyboard image g1 and the finger image g2 from the performance image G1, as in the first embodiment.
The control device 11 (image extraction unit 311) uses the image processing mask M generated in the area estimation processing Sb1 to execute the area extraction processing Sb2 as in the first embodiment (S44). That is, the control device 11 extracts the specific area B from the performance image G1 represented by the image data D1 using the image processing mask M, thereby generating the image data D2 representing the performance image G2.
The control device 11 (display control unit 40) causes the display device 14 to display the performance image G2 generated by the area extraction processing Sb2 (S45). The control device 11 determines whether an end condition is satisfied (S46). For example, when the user inputs an instruction to end the second image processing by operating the operation device 13, the control device 11 determines that the end condition is satisfied. If the end condition is not satisfied (S46: NO), the control device 11 proceeds with the processing to the step S41. That is, the area estimation processing Sb1 (S41 to S43), the area extraction processing Sb2 (S44), and the display of the performance image G2 (S45) are executed. On the other hand, if the end condition is satisfied (S46: YES), the control device 11 ends the second image processing.
In the seventh embodiment, as in the first embodiment, the specific area B including the keyboard image g1 is extracted from the performance image G1. Therefore, it is possible to improve convenience of the performance image G1. Especially in the seventh embodiment, the specific area B including the keyboard image g1 and the finger image g2 is extracted from the performance image G1. Therefore, it is possible to generate the performance image G2 in which appearance of the keyboard 22 of the keyboard instrument 200 and appearance of the fingers of the user can be efficiently and visually confirmed.
According to the seventh embodiment, the first area of the performance image G1 including the keyboard image g1 is estimated by the first model 511, and the second area of the performance image G1 including the finger image g2 is estimated by the second model 512. Therefore, the specific area B including the keyboard image g1 and the finger image g2 can be extracted with high accuracy as compared with a configuration of using the single estimation model 51 that collectively extracts both the keyboard image g1 and the finger image g2. Moreover, since each of the first model 511 and the second model 512 is established by individual machine learning, the processing load related to the machine learning of the first model 511 and the second model 512 is reduced.
A configuration in which the image extraction unit 311 can switch between a first mode and a second mode is also assumed. The first mode is an operation mode of extracting both the keyboard image g1 and the finger image g2 from the performance image G1. That is, in the first mode, the image extraction unit 311 executes both the first estimation processing and the second estimation processing. Therefore, the image processing mask M indicating the specific area B is generated as in the seventh embodiment. That is, in the first mode, the specific area B including both the keyboard image g1 and the finger image g2 is extracted from the performance image G1.
The second mode is an operation mode of extracting the keyboard image g1 from the performance image G1. That is, in the second mode, the image extraction unit 311 executes the first estimation processing but does not execute the second estimation processing. That is, the first mask generated by the first estimation processing is determined as the image processing mask M applied to the area extraction processing Sb2. Therefore, in the second mode, the keyboard image g1 is extracted from the performance image G1.
As described above, according to the embodiment of being capable of switching between the first mode and the second mode, it is possible to easily switch an extraction target from the performance image G1. In the above description, the image extraction unit 311 executes the first estimation processing in the second mode. Alternatively, another embodiment, in which the image extraction unit 311 in the second mode executes the second estimation processing but does not execute the first estimation processing, is also assumed. In the above embodiment, the finger image g2 is extracted from the performance image G1. As can be understood from the above description, the second mode is expressed as an operation mode in which one of the first estimation processing and the second estimation processing is executed.
Specific modified aspects added to the above-exemplified aspects will be exemplified below. Two or more aspects freely selected from the following examples may be combined as appropriate within a mutually consistent range.
(1) In each of the above-described embodiments, the matrix generation processing (
In each of the above embodiments, the finger position estimation processing using the performance image G1 is exemplified, but the finger position estimation processing may be executed using the performance image G2 after the image extraction processing. That is, the position C[h, f] of each finger of the user may be estimated by analyzing the performance image G2. In each of the above embodiments, the projective transformation processing is executed for the performance image G1, but the projective transformation processing may be executed for the performance image G2 after the image extraction processing. That is, the transformed image may be generated by performing projective transformation on the performance image G2.
(2) In each of the above embodiments, the position c[h, f] of each finger of the user is transformed into the position C[h, f] in the X-Y coordinate system by projective transformation processing, but the finger position data F representing the position c[h, f] of each finger may be generated. That is, the projective transformation processing (projective transformation unit 314) for transforming the position c[h, f] into the position C[h, f] may be omitted.
(3) In the first embodiment to the fifth embodiment, the transformation matrix W generated immediately after the start of the performance analysis processing is used continuously in subsequent processing, but the transformation matrix W may be updated at an appropriate timing during the execution of the performance analysis processing. For example, when the position of the image capturing device 15 with respect to the keyboard instrument 200 is changed, it is assumed that the transformation matrix W may be updated. Specifically, when a change in the position of the image capturing device 15 is detected by analyzing the performance image G1, or when the user instructs a change in the position of the image capturing device 15, the transformation matrix W will be updated. As used herein, the change in the position may also be referred to as a “positional change”.
Specifically, the matrix generation unit 312 generates a transformation matrix δ indicating the positional change (displacement) of the image capturing device 15. For example, a relation expressed by the following Equation (12) is assumed for a coordinate (x, y) in the performance image G (G1, G2) after the positional change.
The matrix generation unit 312 generates the transformation matrix δ so that a coordinate x′/ε calculated by Equation (12) from an x-coordinate of a specific position after the positional change approximates or matches an x-coordinate of a position corresponding to the position in the performance image G before the positional change, and a coordinate y′/ε calculated by Equation (12) from a y-coordinate of the specific point after the positional change approximates or matches a y-coordinate of the position corresponding to the position in the performance image G before the positional change. Then, the matrix generation unit 312 generates, as the initial matrix W0, a product Wδ of the transformation matrix W before the positional change and the transformation matrix δ indicating the positional change, and updates the initial matrix W0 by the matrix updating processing Sc2 to generate the transformation matrix W.
In the above configuration, the transformation matrix W after the positional change is generated using the transformation matrix W calculated before the positional change and the transformation matrix δ indicating the positional change. Therefore, it is possible to generate the transformation matrix W that can specify the position C[h, f] of each finger with high accuracy while reducing the load of the matrix generation processing. In the above description, the first embodiment to the fifth embodiment are assumed, but in the sixth embodiment as well, the transformation matrix W may be updated at an appropriate timing during the execution of the first image processing.
(4) In each of the above-described embodiments, the keyboard instrument 200 including the keyboard 22 is illustrated, but the present disclosure can be applied to any type of musical instrument. For example, for any musical instrument that can be manually operated by the user, such as a stringed instrument, a wind instrument, or a percussion instrument, each of the above embodiments can be similarly applied. A typical example of the musical instrument is a type of musical instrument played by the user with fingers of one hand or both hands.
(5) The performance analysis system 100 may be implemented by a server device that communicates with an information device such as a smartphone or a tablet terminal. For example, the performance data P generated by the keyboard instrument 200 connected to the information device and the image data D1 generated by the image capturing device 15 mounted on or connected to the information device are sent from the information device to the performance analysis system 100. The performance analysis system 100 generates the fingering data Q by executing the performance analysis processing on the performance data P and the image data D1 received from the information device, and sends the fingering data Q to the information device. Similarly, the image processing system 700 exemplified in the sixth embodiment or the seventh embodiment may also be implemented by a server device that communicates with an information device.
(6) As described above, the functions of the performance analysis system 100 according to the first embodiment to the fifth embodiment or the image processing system 700 according to the sixth embodiment to the seventh embodiment are implemented by cooperation of one or more processors constituting the control device 11 and programs stored in the storage device 12. The programs according to the present disclosure may be provided in a form stored in a computer-readable recording medium and installed in a computer. The recording medium is, for example, a non-transitory recording medium, and is preferably an optical recording medium (optical disc) such as a CD-ROM, and may include any known type of recording medium such as a semiconductor recording medium or a magnetic recording medium. The non-transitory recording medium includes any recording medium other than transitory propagating signals, and does not exclude volatile recording media. In a configuration in which a distribution device distributes programs via a communication network, the storage device 12 that stores the programs in the distribution device corresponds to the above-described non-transitory recording medium.
For example, the following configurations can be understood from the embodiments described above.
A performance analysis method according to one aspect (Aspect 1) of the present disclosure includes: generating finger position data representing a position of each of a plurality of fingers of a user who plays a musical instrument by analyzing a performance image indicating the plurality of fingers of the user; and generating fingering data representing fingering in a performance by the user by using performance data representing the performance and the finger position data. In the above aspect, the fingering data is generated by using the finger position data generated by analyzing the performance image and the performance data representing the performance. Therefore, the fingering can be estimated with high accuracy compared with a configuration in which the fingering is estimated only from the performance data.
In a specific example (Aspect 2) of Aspect 1, the finger position data represents a position of each of fingers of a right hand and a position of each of fingers of a left hand of the user. In the above aspect, the fingering can be estimated while being distinguished between the right hand and the left hand of the user.
In a specific example (Aspect 3) of Aspect 2, the generating the finger position data includes: image analysis processing of estimating a position of each of fingers of a first hand of the user and a position of each of fingers of a second hand of the user by analyzing the performance image, and left-right determination processing of determining that, of the first hand and the second hand, a hand with a thumb positioned on a left side of a little finger is the right hand, and a hand with a thumb positioned on a right side of a little finger is the left hand. In the above aspect, the position of each of fingers of the user can be distinguished between the right hand and the left hand by simple processing using a relation between the position of the thumb and the position of the little finger.
In a specific example (Aspect 4) of any one of Aspects 1 to 3, the performance analysis method further includes: determining whether the musical instrument is played by the user in accordance with the performance data; and not generating the finger position data in a case in which the musical instrument is not played. In the above aspect, the generation of the finger position data is stopped in a case in which the musical instrument is not being played. Therefore, the processing load necessary for generating the fingering data can be reduced compared with a configuration in which the generation of the finger position data is continued regardless of whether the musical instrument is being played.
In a specific example (Aspect 5) of any one of Aspects 1 to 4, the musical instrument is a keyboard instrument including a keyboard, and the performance image includes an image of the plurality of fingers of the user and an image of the keyboard, and the generating the finger position data includes: finger position estimation processing of estimating a position of each of the plurality of fingers of the user by analyzing the performance image; matrix generation processing of generating a transformation matrix for applying a projective transformation to the performance image such that an image of the keyboard in the performance image approximates a reference image indicating a reference instrument; and projective transformation processing of performing a projective transformation using the transformation matrix on the position of each of the plurality of fingers estimated in the finger position estimation processing, and then generating the finger position data representing a position of each of the plurality of fingers after the transformation. In the above aspect, the position of each of the plurality of fingers estimated by the finger position estimation processing is transformed using the transformation matrix for the projective transformation that approximates the keyboard image to the reference image. That is, the position of each of the plurality of fingers is estimated based on the reference image. Therefore, the fingering can be estimated with high accuracy compared with a configuration in which the position of each of the plurality of fingers is not transformed to a position based on the reference image.
In a specific example (Aspect 6) of Aspect 5, the matrix generation processing includes: initialization processing of setting an initial matrix that is an initial setting of the transformation matrix; and matrix updating processing of iteratively updating the initial matrix so as to increase an enhanced correlation coefficient between the reference image and the image of the keyboard. The keyboard image contains repetition of similar patterns of arrangements of the plurality of keys. Consequently, the transformation matrix may not be estimated appropriately in an embodiment of using the image feature amount such as the Scale-Invariant Feature Transform (SIFT). According to the embodiment in which the initial matrix is iteratively updated so that the enhanced correlation coefficient (ECC) increases, even when an image including repetition of the similar patterns is targeted, an appropriate transformation matrix that can approximate the keyboard image to the reference image with high accuracy can be generated.
In a specific example (Aspect 7) of Aspect 6, in the initialization processing, the initial matrix is set such that a target area corresponding to an instruction from the user in the image of the keyboard in the performance image approximates an area corresponding to a specific pitch in the reference image. In the processing of iteratively updating the initial matrix so as to increase the enhanced correlation coefficient, there is a tendency that the suitability of the initial matrix is likely to affect the suitability of the final transformation matrix. According to the configuration in which the initial matrix is set in accordance with the target area in response to the instruction from the user, an appropriate transformation matrix that can approximate the keyboard image to the reference image with high accuracy can be generated.
In a specific example (Aspect 8) of any one of Aspects 5 to 7, the generating the finger position data includes image extraction processing of extracting a specific area including the image of the keyboard in the performance image, and the matrix generation processing is performed with the specific area of the performance image as a processing target. In the above aspect, the matrix generation processing is executed with a part of the performance image within a specific area including the keyboard image as the processing target. Therefore, as compared with a configuration in which the matrix generation processing is executed with the entire performance image including areas other than the keyboard image as a processing target, it is possible to generate an appropriate transformation matrix that can approximate the keyboard image in the performance image to the reference image with high accuracy.
In a specific example (Aspect 9) of any one of Aspects 1 to 8, the performance data is data specifying a pitch played by the user, and the generating the fingering data includes: probability calculation processing of calculating, for each of the plurality of fingers of the user, a probability that a position of a finger when the pitch specified by the performance data is played by the finger is a position of a finger represented by the finger position data; and fingering estimation processing of estimating, based on the probability calculated for each of the plurality of fingers, a finger that plays the pitch specified by the performance data. According to the above aspect, it is possible to appropriately estimate the fingering of the user.
In a specific example (Aspect 10) of any one of Aspects 1 to 8, the performance data is data specifying a pitch played by the user, the generating the fingering data includes: control data generation processing of generating control data for each of a plurality of pitches; probability calculation processing of calculating, for each of the plurality of fingers of the user, a probability that the pitch specified by the performance data is played by a finger by inputting the control data into a machine-learned estimation model; and fingering estimation processing of estimating, based on the probability calculated for each of the plurality of fingers, a finger that plays the pitch specified by the performance data, and the control data for each of the plurality of pitches includes: a corresponding pitch; and an mean and a variance of a position of a finger represented by the finger position data and a mean and a variance of a rate of change in the position of the finger, for each of the plurality of fingers of the user. According to the above aspect, the control data input to the estimation model includes the mean and the variance of the position and the mean and the variance of the rate of change in the position (that is, a velocity at which the position changes), for each of the plurality of fingers. Therefore, even when a plurality of fingers overlap each other due to, for example, finger crossing, the fingering of the user can be estimated with high accuracy.
A performance analysis system according to one aspect (Aspect 11) of the present disclosure includes: a finger position data generation unit configured to generate finger position data representing a position of each of a plurality of fingers of a user who plays a musical instrument, by analyzing a performance image indicating the plurality of fingers of the user; and a fingering data generation unit configured to generate fingering data representing fingering in a performance by the user by using performance data representing the performance and the finger position data.
A program according to one aspect (Aspect 12) of the present disclosure causes a computer system to function as: a finger position data generation unit configured to generate finger position data representing a position of each of a plurality of fingers of a user who plays a musical instrument, by analyzing a performance image indicating the plurality of fingers of the user; and a fingering data generation unit configured to generate fingering data representing fingering in a performance by the user by using performance data representing the performance and the finger position data.
Number | Date | Country | Kind |
---|---|---|---|
2021-051179 | Mar 2021 | JP | national |
This is a continuation of International Application No. PCT/JP2022/009828 filed on Mar. 7, 2022, and claims priority from Japanese Patent Application No. 2021-051179 filed on Mar. 25, 2021, the entire content of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP22/09828 | Mar 2022 | US |
Child | 18472387 | US |