The present invention relates to a technique of searching for a music track.
There is known in the art a technique of searching for a music track desired by a user from among a large number of music tracks recorded in a database. For example, Patent Document 1 discloses an incremental music track search device in which a music track that includes a note sequence corresponding to a note sequence designated by a user is searched for in a sequential manner each time a note is designated. Although Patent Document 2 and Non-Patent Document 1 do not relate to a search for a music track, each of these documents do disclose techniques for searching for sequence data, which are in part similar to a search query.
Patent document 1: Japanese Patent Application Laid-Open Publication No. 2012-48619
Patent document 2: Japanese Patent Application Laid-Open Publication No. 2008-134706
Non-Patent Document 1: SAKURAI Yasushi et al. “Stream Processing under the Dynamic Time Warping Distance”, The IEICE transactions on information and systems (Japanese edition), The Institute of Electronics, Information and Communication Engineers, 92(3), 338-350, Mar. 1, 2009
In the technology described in Patent Document 1, a music track that includes a note sequence that matches an input note sequence is obtained as a search result. A drawback of this technique is that an appropriate search result cannot be obtained in a case that an input is a singing voice, since the voice may not be an accurate representation of a desired music track. Neither Patent Document 2 nor Non-Patent Document 1 is directed to a music track search.
In view of the above, it is an object of the present invention to provide a technology for rapid searching of a desired music track based on input of a voice.
In one aspect, the present invention provides a computer-implemented music track search method including: symbolizing a change in pitch over time in a user's input voice; and acquiring results of partial sequence matching performed on a plurality of music tracks recorded in a database by using as a query a symbol sequence including the symbolized input voice, the partial sequence matching being based on an edit distance.
In another aspect, the present invention provides a music track search device including: a symbolizer configured to symbolize a change in pitch over time in a user's input voice; and an acquirer configured to acquire results of partial sequence matching performed on a plurality of music tracks recorded in a database by using as a query a symbol sequence including the symbolized input voice, the partial sequence matching being based on an edit distance.
In still another aspect, the present invention can also be understood as a program executable by a computer to execute the above music track search method, as well as a computer-readable recording medium storing such a program.
The voice inputter 11 accepts input of a user's voice. The symbolizer 12 symbolizes a change in pitch over time in the voice accepted by the voice inputter 11. The query generator 13 generates a search query that includes the input voice symbolized by the symbolizer 12.
The storage unit 14 stores a database in which information on a plurality of music tracks is recorded. The searcher 15 searches the database stored in the storage unit 14 for a music track that includes a part similar to the search query generated by the query generator 13. The searcher 15 employs a search algorithm that uses partial sequence matching based on an edit distance. The partial sequence matching involves identifying a part similar to the search query within an object with which the search query is to be matched (hereafter, “matching object”), which in this example is a music track. The part similar to the search query is referred to as a “similar section”. The modifier 17 modifies a search result obtained by the searcher 15 by use of a method that is different from the partial sequence matching based on the edit distance. This modification is performed for music tracks of a prescribed number, and the prescribed number of music tracks are ranked in order starting from that having a highest degree of similarity among the search result provided by the searcher 15. The modifier 17 modifies the search result based on onset time differences.
The outputter 16 outputs the search results of the searcher 15, and the modified search results of the modifier 17.
The memory 101 and the storage 102 are examples of non-transitory storage media. In the present description, the term “non-transitory” storage medium includes all types of computer-readable storage media, including volatile storage media, with the only exception being that of transitory, propagating signals.
The storage 102 stores an application program (hereafter, “client program”) that causes a computer device to function as a client device in the music track search service. The functions shown in
The memory 201 and the storage 202 are examples of non-transitory storage media.
The storage 202 stores a program (hereafter, “server program”) that causes a computer device to operate as a server device for the music track search service. The functions shown in
At step S6, the terminal device 10 requests the server device 20 to carry out a more detailed matching search. At step S7, the server device 20 modifies the search result based on onset time differences for music tracks of a prescribed number in order from the one with the highest degree of similarity, with degrees of similarity of the music tracks having been determined by use of partial sequence matching based on an edit distance. At step S8, the server device 20 transmits the modified search result. At step S9, the terminal device 10 displays the search result.
At step S11, the client program determines whether the pitch of the input voice is stable. The input voice is a singing voice of the user that is input via the microphone, and the microphone is the input device 103. The input singing voice of the user is likely to fluctuate in pitch for a variety of reasons, and thus become unstable. In a case that the pitch of the input voice satisfies a prescribed stability condition, the client program determines that the pitch of the input voice is stable. To establish a prescribed stability condition, there may be employed, for example, a condition where an index of pitch fluctuation falls below a set threshold. As the index of pitch fluctuation, there may be used, for example, dispersion of pitches in an immediately previous prescribed time period, or a difference between a maximum pitch value and a minimum pitch value in an immediately previous prescribed time period. In a case that the pitch of the input voice is determined as being stable (S11: YES), the client program proceeds to step S12. In a case that the pitch of the input voice is determined as not being stable (S11: NO), the client program waits until the pitch becomes stable. The process at step S11 is executed by the voice inputter 11.
At step S12, the client program quantifies the pitch. Specifically, a note that has a pitch range in which the pitch has been determined as being stable at step S11 is quantified at step S12. More specifically, the note is a single note that has a pitch range regarded as a single pitch. The client program stores the quantified pitch in the memory 101.
At step S13, the client program calculates a relative pitch difference between the newly quantified note and the note that was quantified immediately before the new note. The pitch difference ΔP is expressed as
ΔP=P[i]−P[i−1] (1)
where the pitch of the newly quantified note (the i-th note in the input voice) is expressed as P[i].
At step S14, the client program symbolizes the pitch difference AP. The pitch difference is expressed, for example, by appending a symbol (+ or −) representative of a direction of change, to a numerical value based on intervals (relative pitch) in a twelve-tone equal temperament. The symbolized pitch difference ΔP[i] is expressed as S[i]. For example, in a case that P[i] and P[i−1] indicate the same pitch (unison), S[i]=±0. In a case that P[i] is higher than P[i−1] by a minor third, S[i]=+3. In a case that P[i] is lower than P[i−1] by a perfect fifth, S[i]=−7 . The process at steps S12 to S14 is executed by the symbolizer 12.
At step S15, the client program generates a search query. The search query includes, in time sequence, pitch differences that have been detected up to a current time point from a start of voice input. When the i-th note is detected in the input voice, for example, the search query includes symbols indicative of (i−1) pitch differences from S[2] to S[i]. The process at step S15 is executed by the query generator 13.
In the flow in
Q(t3)=(+2, +1) (2)
as a sequence of symbolized pitch differences. The search query Q(t7) generated at time point t7 includes
Q(t7)=(+2, +1, ±0, −1, +1, −2) (3)
as a sequence of symbolized pitch differences.
The sequence of symbolized pitch differences does not include information on note lengths, i.e., a duration of each note; namely, note durations are disregarded. Accordingly, correspondence of length of a newly detected note to a sixteenth note, a half note, and so on, has no influence on the sequence of pitch differences. The only information recorded is that of a pitch difference between one note and an immediately preceding note. Likewise, a rest between notes has no influence on the sequence of pitch differences. Accordingly, whether a note and an immediately following note are contiguous or are separated by a rest, has no influence on symbolized pitch differences.
Reference is again made to
Each time the client program generates a new search query, the client program transmits the newly generated search query to the server device 20 (Step S2). In the example shown in
Before describing operations in detail, the search algorithm will first be described in outline. As stated above, searching involves partial sequence matching based on an edit distance. Before describing the search algorithm of the present embodiment, description of the partial sequence matching based on the edit distance will be given. For an edit distance, a distance defined as the Levenshtein distance is used. The Levenshtein distance is a distance that indicates a degree by which two symbol sequences differ from each other, and is expressed as a least number of edit operations (insertion, deletion, or substitution of a character) required to change one symbol sequence into another symbol sequence. A fuzzy search based on the Levenshtein distance, as compared to other methods, such as a method based on a regular expression or n-gram similarity, is better suited to carrying out a search for a music track based on a voice input, due to the occurrence of fluctuations in accuracy of the input voice.
In the i-th row, j-th column cell (hereinafter, “cell (j, i)”) of this matrix, it is assumed that a symbol sequence consists of a symbol sequence corresponding to the i-th and subsequent symbols of the search query appended to the end of the symbol sequence up to the j-th symbol of the matching object. Hereafter, for each cell such a symbol sequence will be referred to as a “target symbol sequence”. For example, the target symbol sequence in cell (1, 1) is a symbol sequence “GABC” that consists of a symbol sequence “ABC” corresponding to the first and subsequent symbols of the search query appended to the end of a symbol sequence “G” up to the first symbol of the matching object. The target symbol sequence in cell (6, 2) is a symbol sequence “GAHCDBBC” that consists of a symbol sequence “BC” corresponding to the second and subsequent symbols of the search query appended to the end of a symbol sequence “GAHCDB” up to the sixth symbol of the matching object. In
Next, the Levenshtein distance is calculated for the target symbol sequence in each cell in relation to the search query. For example, in cell (1, 1), since the target symbol sequence is obtained by entering “G” in front of the search query, the edit distance is “1”. In cell (6, 2), the target symbol sequence is obtained by entering “G” in front of the search query and also entering “HCDB” between the first character “A” and the second character “B” of the search query, as a result of which the edit distance is “5”. In
Generally, in applying the Levenshtein distance, when there is match between symbol sequences, movement occurs in a matrix from a current cell to a lower right cell (downward and to the right); when an addition (insertion) is made in a symbol sequence, movement occurs from the current cell to a right cell (horizontally); and when a deletion is made in a symbol sequence, movement is made to a cell directly below the current cell. Utilization of movement in matrix form in this way, enables an optimal path for editing to be obtained (as indicated by the arrows in
In this regard, a method referred to as SPRING, and which relates to Patent Document 2 and Non-Patent Document 1 is used in the present embodiment. In this method, the Levenshtein distance d is set to zero in the first and last rows of the search query.
In a cell (j, i) in any of the second and following rows, an edit distance D(j, i) is calculated as follows.
D(j, i)=d(j, i)+min[D(j−1, i−1), D(j−1,i), D(j, i−1)] (4)
Here, d(j, i) stands for the Levenshtein distance between a target symbol sequence in the cell (j, i) and a symbol sequence that consists of a symbol sequence up to the (j−1)th symbol of the matching object appended to the front of a symbol sequence corresponding to the (i−1)th and subsequent symbols of the search query. For example, the target symbol sequence in cell (5, 3) is “GAHCDC”, and a symbol sequence that consists of a symbol sequence “GAHC” up to the fourth symbol of the matching target appended to the front of a symbol sequence “BC” corresponding to the second and subsequent symbols of the search query is “GAHCBC”, and comparison between the two yields d(5, 3)=1. The function “min” represents the least one of arguments. In other words, in the expression above, the right-hand-side second term indicates the least value among the edit distances D among the diagonally upper left, horizontally adjacent left, and horizontally adjacent right cells of the cell in question. For example, there stands:
In the lowest bottom row of the matching matrix (the fifth row in the example in
In the flow chart of
At step S32, the server program identifies, in accordance with a prescribed order, one music track as the matching object from among the music tracks in the database stored in the storage unit 14. The database includes information on each music track, specifically, attribute information, such as an identifier of each music track, and music data for reproducing each music track (e.g., musical instrument digital interface (MIDI) data, uncompressed audio data, such as linear pulse-code modulation (PCM) data, or compressed audio data, such as, for example, data in MP3 format). The database also includes data in which the main melody of each music track is symbolized; for example, the melody of the main vocal section where the music track is a song.
At step S33, the server program calculates a matching matrix for the music track to be matched (specifically, the edit distance in each cell and the least distance (i.e., score) between the search query and the music track). The method of calculating a matching matrix is as described earlier. In calculating a matching matrix, the server program reads the database and uses data of the symbolized melody of the music track to be matched.
At step S34, the server program determines whether the score of the music track to be matched is smaller than a threshold. The threshold may be set in advance, for example. When the score is determined to be equal to or greater than the threshold (S34: NO), the server program deletes the calculated matching matrix from the memory 201 (step S35). When the score is determined to be smaller than the threshold (S34: YES), the server program proceeds to step S36.
At step S36, the server program stores in a result table the identifier and the score of the music track to be matched. The result table is a table that stores information on music tracks having high degrees of similarity (scores smaller than the threshold). The result table also includes information that specifies a similar section in each music track.
At step S37, the sever program determines whether calculation of a matching matrix has been completed for all music tracks recorded in the database. When the server program determines that a music track exists for which a matching matrix has not yet been calculated (S37: NO), the server program proceeds to step S32. At step S32, the next music track becomes a new matching object, and the process from steps S33 to S36 is carried out for the new music track to be matched. When the server determines that calculation of a matching matrix has been completed for all music tracks (S37: YES), the server program proceeds to step S4. At step S4, the server program transmits the result table as a search result to the terminal device 10 having transmitted the search query.
The diagram in
The method of displaying a search result is not limited to the example shown in
As described in the foregoing, the process from steps S1 to S5 is repeated, and therefore, the search result is updated continuously for as long as input of the voice continues. In an initial period at which input of the voice has just started, it is highly likely that a search result will be influenced by noise due to the shortness of the search query. However, as the search query lengthens with continuing input of the voice, noise will be filtered out enabling searched music tracks to be narrowed down.
When a condition for starting detailed matching is satisfied, the terminal device 10 requests the server device 20 to carry out more detailed matching, i.e., requests that the accuracy of the search result be increased (step S6). The condition for starting detailed matching may be, for example, termination of voice input, or receiving input of an explicit instruction for detailed matching by the user. When the condition is satisfied, the terminal device 10 transmits a request for detailed matching (hereafter, “request for increased accuracy”). The request for increased accuracy includes: information indicating that the request is for detailed matching; a search query; information for specifying target music tracks; and information for specifying a similar section in each music track. The information for specifying the target music tracks includes identifiers of at least some of the music tracks included in the result table obtained at step S4. The phrase, at least some of the music tracks, means, for example, music tracks in the result table ranked from one with a highest degree of similarity to one with a low degree of similarity corresponding to a prescribed rank; for example a first ranked music track to a tenth ranked music track.
The search query included in the request for increased accuracy is information that differs from the search query generated at steps S14 and S15, and includes information on a length of each note. Note length information includes, for example, information indicative of an onset time difference. By onset time difference is meant a time length from a time point at which input of one note starts to a time point at which input of a next note starts. Hereafter, when there is a need to distinguish the search query transmitted at step S6 from the search query generated at steps S14 and S15, the former will be referred to as the “second search query” and the latter will be referred to as the “first search query”. The second search query may be uncompressed audio data or compressed audio data indicative of a waveform of an input voice, or may be data in which an input voice is symbolized together with an onset time difference. The client program stores the input voice in the form of data and uses the stored data to generate the second search query. In the search using the first search query, a time length of a note is disregarded, whereas in the search using the second search query, music tracks are narrowed down by also taking into account a time length of a note.
The flowchart in
At step S72, the server program compares, within the music track to be matched, the similar section relative to the first search query with the second search query, and quantifies a degree of similarity between the two. In quantifying the degree of similarity, the onset time difference is taken into account. It is of note that instead of an onset time difference, a time length of a section for a voiced sound among an input voice (i.e., time length of a section in which a pitch is detected) may be symbolized in the second search query.
In the diagram in
The server program first calculates an onset time difference between the music track 1 and the search query. Here, a square of an onset time difference is obtained for each note, and the obtained squares for all notes in a similar section are added up. For example, the onset time difference ΔL(1) between the music track 1 and the search query is
Likewise, the onset time difference ΔL(2) between the music track 2 and the search query is
ΔL(2)=0.0 (7)
An onset time difference ΔL indicates that a music track is closely similar to the search query when the value of the onset time difference ΔL is small. In other words, this example shows that the music track 2 is more similar to the search query than the music track 1 (i.e., the degree of similarity to the music track 2 is higher than that to the music track 1). Thus, it can be said that the onset time difference ΔL is a second index value indicative of a degree (high or low) of similarity between a music track to be matched and the second search query. In contrast, it can be said that the score is a first index value indicative of a degree (high or low) of similarity between a music track to be matched and the first search query.
Reference is again made to
At step S74, the server program determines whether modification of scores has been completed for all music tracks to be matched that are specified in the request for increased accuracy. When the server program determines that there is a music track for which score modification has not yet been completed (S74: NO), the server program proceeds to step S71. At step S71, the server program specifies a new music track that is to be the matching object, and executes the process at steps S72 and S73 that follow. When the server program determines that score modification has been completed for all music tracks to be matched (S74: YES), the server program transmits a list of the modified scores to the terminal device 10 from which the request for increased accuracy was transmitted (step S8). The terminal device 10 displays the search result (step S9). Display of the result is carried out substantially in the same way as the display of a result at step S5, for example. The result may also be displayed together with information that indicates that the result is final (no more incremental search need be carried out).
Next, an example will be described in which the music track search system 1 is applied to a karaoke device. In the present example, a music track is searched for among karaoke pieces (music tracks) recorded in a database, where input of a singing voice by a user serves as a search query. The music track identified by the search is played so as to follow the singing voice of the user. In other words, in this karaoke device, when the user starts a music track a cappella, a search is made for a music track that accords with the melody of the karaoke piece sung, and karaoke (accompaniment) is played so as to follow the singing of the user.
The karaoke device 50 includes a voice inputter 11, a symbolizer 12, a query generator 13, an outputter 16, an identifier 51, a communicator 52, and a player 53. The karaoke device 50 corresponds to the terminal device 10 in the music track search system 1 (i.e., to a music track search device). The voice inputter 11, the symbolizer 12, the query generator 13, and the outputter 16 are as described earlier. From the input singing voice of the user, the identifier 51 acquires a tempo and a key of the singing voice. The communicator 52 communicates with the server device 60. In the present example, the communicator 52 transmits to the server device 60 a search query generated by the query generator 13 and a request for finding one music track, and receives music data from the server device 60. The player 53 plays a music track according to the music data received from the server device 60. The player 53 includes, for example, a speaker and an amplifier.
The server device 60 includes the storage unit 14, the searcher 15, the modifier 17, and a communicator 61. The server device 60 corresponds to the server device 20 in the music track search system 1. The storage unit 14, the searcher 15, and the modifier 17 are as described earlier. The database stored in the storage unit 14 is a database for karaoke pieces. The communicator 61 communicates with the karaoke device 50. The communicator 61 in the present example transmits a search result and music data to the karaoke device 50.
At step S600, the karaoke device 50 selects one music track from among the music tracks obtained as the search result. The music track may be selected in accordance with an instruction input by the user, or may be automatically selected by the karaoke device 50 without explicit instruction from the user (e.g., the music track with the highest degree of similarity (lowest score) may be selected by the karaoke device 50 automatically).
At step S700, the karaoke device 50 transmits to the server device 60 a request for the selected music track. The request includes an identifier that specifies the selected music track. The server device 60 transmits to the karaoke device 50 music data of the requested music track. At step S800, the karaoke device 50 receives the music data from the server device 60.
At step S900, the karaoke device 50 plays a karaoke piece in accordance with the received music data. The karaoke device 50 plays the karaoke piece in a tempo and a key extracted from the input voice of the user. The karaoke device 50 extracts from the input singing voice a tempo and key at a freely-selected timing from step S100 to step S800. The karaoke device 50 plays the karaoke piece at the tempo and in the key extracted from the input voice. The karaoke device 50 plays the karaoke piece from a playback point (playback time point) that follows the singing voice of the user. The playback point that follows the singing voice of the user means a playback point that is identified within the selected karaoke piece based on a section similar to the search query. For example, in an ideal system, the karaoke device 50 plays the karaoke piece from a time point at which the similar section ends. In such a system, a time required from which the karaoke device 50 transmits the search query to the server device 60, followed by a request for transmission of the music data to the server device 60, until receipt of music data is complete (i.e., a time difference between the transmission of the search query and the reception of the music data) is almost zero. If the time difference is too large to be ignored, the karaoke device 50 plays the karaoke piece from a time point that is obtained by adding a predicted value of the time difference to the time point at which the similar section ends.
By use of the karaoke system 5, all effort required of a user to search for a desired music track from an enormous list can be eliminated. Moreover, by use of the karaoke system 5, a karaoke piece (accompaniment) is played so as to follow the a cappella singing voice of the user, and thus a new form of entertainment is provided.
It is also possible to terminate a search, for example, at a time point at which the user selects any one of a plurality of music tracks obtained as a search result. For example, the outputter 16 displays a list of a plurality of music tracks found in the search. Specifically, a list is displayed in which names of the plurality of music tracks are listed in ascending order of scores. The manner in which the music tracks are displayed (e.g., by color or size) may differ among the music tracks depending on their scores.
The user can thus select an intended music track from the list. The outputter 16 highlights the music track selected by the user. For example, the music track selected by the user is moved to the top of the list and is displayed in a manner differing from the manner(s) in which the other music tracks are displayed (e.g., in a different color). When a music track is selected in this manner, the search for a music track is terminated, and the search result at this point is deemed the final result. Specifically, generation and transmission of search queries are terminated when the music track is selected by the user, and from this point on, no further search for a music track is performed.
The present invention is not limited to the embodiment described above, and a variety of modifications are possible. Some modifications are described below. Two or more of the modifications below may be used in combination.
The method of calculating an edit distance is not limited to the example described in the embodiment. For example, edit costs of insertion, deletion, and substitution may not be equivalent, and may be weighted. Specifically, edit costs of substitution may differ depending on a pitch difference between before and after substitution. For example, an edit cost may be set to decrease as a pitch difference between before and after substitution decreases. When using simple Levenshtein distances, a pitch difference is not taken into consideration, and the edit cost (i.e., the score) remains the same irrespective of whether the difference in pitch from the search query is a semi tone, a fifth, etc. In this example, however, an edit cost decreases in proportion to a decrease in a pitch difference, hence a score value also decreases (a degree of similarity becomes higher) in proportion to a decrease in a pitch difference from the search query. Thus, the degree of similarity can be determined with greater accuracy. Edit costs may be different for different types of edit operations. For example, an edit cost of deletion may be less than that of insertion.
In the mode in which different edit costs are set depending on pitch differences or types of edit operations, edit costs may be determined in accordance with a history of past search queries. For example, with reference to past search queries it may be statistically apparent that a pitch (input) at a specific part of a music track tends to be consistently lower. In a case that a pitch at a specific part of a search query is lower than a pitch in a specific part of the music track, edit costs are set so as to be lower than a case in which the pitch in the specific part in a search query is higher than the pitch in the specific part of the music track. Alternatively, where it is statistically apparent that a specific divergence in pitch tends to occur more readily if a pitch difference in a search query satisfies a specific condition (e.g., when there is an increase in an interval of an octave or more between one note and the next note), edit costs are set in accordance with this tendency.
An event that triggers generation of a search query is not limited to detection of a new note in the input voice. For example, following passage of a prescribed time period from generation of a most recent search query during input of a voice, a next search query may be generated. Particularly, in an initial voice input period, a search query may be generated when an amount of data of the symbolized input voice exceeds a set threshold. Alternatively, a search query may be generated when a prescribed number of new pitch differences are detected in the input voice. As yet another alternative, a search query may be generated when input of the voice is terminated. In this case, an incremental search is not performed.
A search query for performing partial sequence matching based on an edit distance may include information on an onset time difference. In other words, the symbolizer 12 may symbolize a voice together with information on an onset time difference. The symbolizer 12 may also symbolize an actual pitch instead of a pitch difference. In this case, the searcher 15 converts transition of pitches included in the search query into transition of changes in pitch.
The method of symbolizing pitch differences is not limited to the example described in the embodiment. Symbolization may be carried out based on criteria that are not reliant on intervals in scales in a twelve-tone equal temperament or the like.
The method of increasing accuracy of a search result is not limited to the example described in the embodiment. Any method may be employed so long as the method uses information that is not used in partial sequence matching based on an edit distance.
A part of the functions of the music track search system 1 shown in
The timing at which the modifier 17 modifies a search result is not limited to the example described in the embodiment. For example, in the flowchart shown in
The hardware configuration of the music track search system 1 is not limited to the examples shown in
The method of calculating a degree of similarity at step S72 is not limited to the example described in the embodiment. The terminal device 10, when symbolizing an onset time difference within an input voice, may symbolize the input voice after extending the duration of the input voice such that a length thereof becomes equal to a length of a part of the music track to be matched, and corresponding to the input voice (i.e., a length of an input voice may be standardized). In this method, even if there is a difference in tempo between music tracks, a degree of similarity between those music tracks can be determined based on differences in distribution of notes, rests, etc., within a given frame (i.e., a bar). Moreover, the above embodiment employs, as an index of a degree of similarity, a sum of squares of onset time differences between notes in a search query and corresponding notes in a music track to be matched (expression (6)). Alternatively, there may be used a value obtained by averaging absolute values of the onset time differences by the number of notes. By averaging the onset time differences by the number of notes, an onset time difference irrespective of the number of notes can be evaluated. Furthermore, instead of, or in addition to, onset time differences between notes in a search query and corresponding notes in a music track to be matched, there may be used differences in note lengths between a note in the search query and a corresponding note in the music track. If note lengths are to be used, rests also have to be taken into account.
A section within an input voice in which a pitch is not detected may also be reflected in a search query Q. A section in which a pitch is not detected may be a section in which a pitch cannot be detected accurately due to insufficient volume or the like (a silent section), or a section in which a consonant having no harmonic structure is voiced (consonant section).
For example, in a case that a pitch is identical between a section “a” immediately preceding and a section “b” immediately following a silent section or a consonant section, the search query Q is configured to include: a symbol that represents a pitch difference between the section “a” and a section immediately preceding the section “a”; and a symbol that represents a pitch difference (i.e., zero) between the section “b” and the section “a”, for which a pitch is detected immediately before the section “b”. The silent section or the consonant section may be symbolized as a section in which a pitch is absent. Moreover, in the search query contained in the request for increased accuracy, the consonant section may be incorporated into a section immediately following the consonant section, with the immediately following section containing a vowel corresponding to the consonant, so that a time length (onset time difference) of the section may be determined.
The software configuration for providing the music track search service is not limited to the example described in the embodiment. The functions described in the embodiment may be provided not by a single program but by an assembly of a plurality of software components.
The program (e.g., the client program and the server program) for providing the music track search service may be provided by way of a recording medium, such as an optical disc, a magnetic disc, and a semiconductor memory, or may be downloaded via a communication network, such as the Internet.
The application example of the music track search system 1 is not limited to a karaoke system. For example, the music track search system may be applied to a music track search in a music distribution service provided via a network, or to a music track search in a music player.
An invention as set forth in the configurations described below is derived from the above description.
That is, a music track search method according to one aspect of the present invention includes: symbolizing a change in pitch over time in an input voice of a user; and acquiring a result of partial sequence matching performed on a plurality of music tracks recorded in a database by using as a query a symbol sequence including the symbolized input voice, with the partial sequence matching being performed based on an edit distance. In this aspect, a desired music track can be rapidly searched for, based on the input voice.
In a preferred mode, the symbolizing of the change in pitch over time in the input voice may include symbolizing the change as a difference in relative pitch over time. In this mode, the input voice is symbolized as relative pitch differences (e.g., differences in intervals in a twelve-tone equal temperament). Thus, even when pitches of notes in the input voice differ from those in a music track, it is possible to search for a music track that accords with a transition of pitches of notes that are sequential in a melody of the input voice.
Preferably, the symbolizing of the change in pitch over time in the input voice may include symbolizing the time change while disregarding information on time lengths of notes in the input voice. In this mode, it is possible to search for a music track including pitches that match the pitches in the input voice even in a case that durations of notes in the input voice inputted by the user differ from those of corresponding notes in a music track.
In a preferred mode, the music track search method may be configured such that the symbolizing of the change in pitch over time in the input voice is repeated in conjunction with reception of the input voice, and the acquiring of the result of the partial sequence matching is repeated in conjunction with reception of the input voice, and further the method may include outputting the result repeatedly in conjunction with reception of the input voice. In this mode, symbolization of the input voice and acquisition of the result of the partial sequence matching are performed in conjunction with reception of the input voice, and the result also is output in conjunction with reception of the input voice. Thus, the search result can be updated following reception of the input voice, and the user can be informed of a search result by display of a music track that includes matching pitches, while inputting his/her singing voice.
In a preferred mode, in the partial sequence matching, an edit cost used when calculating the edit distance may be weighted depending on a difference between a pitch in the query and a pitch in each music track recorded in the database. In this mode, edit costs decrease in proportion to a decrease in pitch differences. Thus, a music track having smaller pitch differences from the query will have a smaller score value (higher degree of similarity). In this case, a degree of similarity can be determined more accurately.
In a preferred mode, the result of the partial sequence matching may include index values each of which indicates a degree of similarity to the query for a corresponding one of the plurality of music tracks, and the method may further include: modifying the result of the partial sequence matching, for a prescribed number of music tracks with the highest similarity, within the result, in order from a high degree of similarity to a low degree of similarity as indicated by the index values, with the result being modified based on a difference between a time length of a note included in the query and a time length of a note that corresponds to the note in the query in each of the prescribed number of music tracks. In this mode, time lengths of notes are taken into account in addition to changes in pitch over time. Thus, accuracy in a search result can be enhanced.
The present invention can also be understood as a music track search device that executes the music track search method as set forth in the above modes, or as a program that causes a computer to execute the music track search method, or as a computer-readable recording medium that stores the program. Substantially the same effects as those described above are attained by the music track search device, the program, or the recording medium. As described earlier, the music track search device may be realized by the terminal device 10 or the server device 20, or may be realized by these devices working in coordination with each other.
Number | Date | Country | Kind |
---|---|---|---|
2015-192967 | Sep 2015 | JP | national |
This application is a Continuation Application of PCT Application No. PCT/JP2016/077041, filed Sep. 14, 2016, and is based on and claims priority from Japanese Patent Application No. 2015-192967, filed Sep. 30, 2015, the entire contents of each of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2016/077041 | Sep 2016 | US |
Child | 15925088 | US |