The invention relates generally to capture and/or processing of vocal performances and, in particular, to techniques suitable for use in portable device implementations of pitch correcting vocal capture.
The installed base of mobile phones and other portable computing devices grows in sheer number and computational power each day. Hyper-ubiquitous and deeply entrenched in the lifestyles of people around the world, they transcend nearly every cultural and economic barrier. Computationally, the mobile phones of today offer speed and storage capabilities comparable to desktop computers from less than ten years ago, rendering them surprisingly suitable for real-time sound synthesis and other musical applications. Partly as a result, some modern mobile phones, such as the iPhone™ handheld digital device, available from Apple Inc., support audio and video playback quite capably.
Like traditional acoustic instruments, mobile phones can be intimate sound producing devices. However, by comparison to most traditional instruments, they are somewhat limited in acoustic bandwidth and power. Nonetheless, despite these disadvantages, mobile phones do have the advantages of ubiquity, strength in numbers, and ultramobility, making it feasible to (at least in theory) bring together artists for jam sessions, rehearsals, and even performance almost anywhere, anytime. The field of mobile music has been explored in several developing bodies of research. See generally, G. Wang, Designing Smule's iPhone Ocarina, presented at the 2009 on New Interfaces for Musical Expression, Pittsburgh (June 2009). Moreover, recent experience with applications such as the Smule Ocarina™ and Smule Leaf Trombone: World Stage™ has shown that advanced digital acoustic techniques may be delivered in ways that provide a compelling user experience.
As digital acoustic researchers seek to transition their innovations to commercial applications deployable to modern handheld devices such as the iPhone® handheld and other platforms operable within the real-world constraints imposed by processor, memory and other limited computational resources thereof and/or within communications bandwidth and transmission latency constraints typical of wireless networks, significant practical challenges present. Improved techniques and functional capabilities are desired.
It has been discovered that, despite many practical limitations imposed by mobile device platforms and application execution environments, vocal musical performances may be captured and continuously pitch-corrected for mixing and rendering with backing tracks in ways that create compelling user experiences. In some cases, the vocal performances of individual users are captured on mobile devices in the context of a karaoke-style presentation of lyrics in correspondence with audible renderings of a backing track. Such performances can be pitch-corrected in real-time at the mobile device (or more generally, at a portable computing device such as a mobile phone, personal digital assistant, laptop computer, notebook computer, pad-type computer or netbook) in accord with pitch correction settings. In some cases, pitch correction settings code a particular key or scale for the vocal performance or for portions thereof. In some cases, pitch correction settings include a score-coded melody and/or harmony sequence supplied with, or for association with, the lyrics and backing tracks. Harmony notes or chords may be coded as explicit targets or relative to the score coded melody or even actual pitches sounded by a vocalist, if desired.
In these ways, user performances (typically those of amateur vocalists) can be significantly improved in tonal quality and the user can be provided with immediate and encouraging feedback. Typically, feedback includes both the pitch-corrected vocals themselves and visual reinforcement (during vocal capture) when the user/vocalist is “hitting” the (or a) correct note. In general, “correct” notes are those notes that are consistent with a key and which correspond to a score-coded melody or harmony expected in accord with a particular point in the performance. That said, in a capella modes without an operant score and to facilitate ad-libbing off score or with certain pitch correction settings disabled, pitches sounded in a given vocal performance may be optionally corrected solely to nearest notes of a particular key or scale (e.g., C major, C minor, E flat major, etc.)
In addition to melody cues, score-coded harmony note sets allow the mobile device to also generate pitch-shifted harmonies from the user/vocalist's own vocal performance. Unlike static harmonies, these pitch-shifted harmonies follow the user/vocalist's own vocal performance, including embellishments, timbre and other subtle aspects of the actual performance, but guided by a score coded selection (typically time varying) of those portions of the performance at which to include harmonies and particular harmony notes or chords (typically coded as offsets to target notes of the melody) to which the user/vocalist's own vocal performance may be pitch-shifted as a harmony. The result, when audibly rendered concurrent with vocal capture or perhaps even more dramatically on playback as a stereo imaged rendering of the user's pitch corrected vocals mixed with pitch shifted harmonies and high quality backing track, can provide a truly compelling user experience.
In some exploitations of techniques described herein, we determine from our score the note (in a current scale or key) that is closest to that sounded by the user/vocalist. Pitch shifting computational techniques are then used to synthesize either the other portions of the desired score-coded chord by pitch-shifted variants of the captured vocals (even if user/vocalist is intentionally singing a harmony) or a harmonically correct set of notes based on pitch of the captured vocals. Notably, a user/vocalist can be off by an octave (male vs. female), or can choose to sing a harmony, or can exhibit little skill (e.g., if routinely off key) and appropriate harmonies will be generated using the key/score/chord information to make a chord that sounds good in that context.
Based on the compelling and transformative nature of the pitch-corrected vocals and score-coded harmony mixes, user/vocalists typically overcome an otherwise natural shyness or angst associated with sharing their vocal performances. Instead, even mere amateurs are encouraged to share with friends and family or to collaborate and contribute vocal performances as part of virtual “glee clubs.” In some implementations, these interactions are facilitated through social network- and/or eMail-mediated sharing of performances and invitations to join in a group performance. Using uploaded vocals captured at clients such as the aforementioned portable computing devices, a content server (or service) can mediate such virtual glee clubs by manipulating and mixing the uploaded vocal performances of multiple contributing vocalists. Depending on the goals and implementation of a particular system, uploads may include pitch-corrected vocal performances (with or without harmonies), dry (i.e., uncorrected) vocals, and/or control tracks of user key and/or pitch correction selections, etc.
Virtual glee clubs can be mediated in any of a variety of ways. For example, in some implementations, a first user's vocal performance, typically captured against a backing track at a portable computing device and pitch-corrected in accord with score-coded melody and/or harmony cues, is supplied to other potential vocal performers. The supplied pitch-corrected vocal performance is mixed with backing instrumentals/vocals and forms the backing track for capture of a second user's vocals. Often, successive vocal contributors are geographically separated and may be unknown (at least a priori) to each other, yet the intimacy of the vocals together with the collaborative experience itself tends to minimize this separation. As successive vocal performances are captured (e.g., at respective portable computing devices) and accreted as part of the virtual glee club, the backing track against which respective vocals are captured may evolve to include previously captured vocals of other “members.”
Depending on the goals and implementation of a particular system (or depending on settings for a particular virtual glee club), prominence of particular vocals (particularly on playback) may be adapted for individual contributing performers. For example, in an accreted performance supplied as an audio encoding to a third contributing vocal performer, that third performer's vocals may be presented more prominently than other vocals (e.g., those of first, second and fourth contributors); whereas, when an audio encoding of the same accreted performance is supplied to another contributor, say the first vocal performer, that first performer's vocal contribution may be presented more prominently.
In general, any of a variety of prominence indicia may be employed. For example, in some systems or situations, overall amplitudes of respective vocals of the mix may be altered to provide the desired prominence. In some systems or situations, amplitude of spatially differentiated channels (e.g., left and right channels of a stereo field) for individual vocals (or even phase relations thereamongst) may be manipulated to alter the apparent positions of respective vocalists. Accordingly, more prominently featured vocals may appear in a more central position of a stereo field, while less prominently featured vocals may be panned right- or left-of-center. In some systems or situations, slotting of individual vocal performances into particular lead melody or harmony positions may also be used to manipulate prominence. Upload of dry (i.e., uncorrected) vocals may facilitate vocalist-centric pitch-shifting (at the content server) of a particular contributor's vocals (again, based score-coded melodies and harmonies) into the desired position of a musical harmony or chord. In this way, various audio encodings of the same accreted performance may feature the various performers in respective melody and harmony positions. In short, whether by manipulation of amplitude, spatialization and/or melody/harmony slotting of particular vocals, each individual performer may optionally be afforded a position of prominence in their own audio encodings of the glee club's performance.
In some cases, captivating visual animations and/or facilities for listener comment and ranking, as well as glee club formation or accretion logic are provided in association with an audible rendering of a vocal performance (e.g., that captured and pitch-corrected at another similarly configured mobile device) mixed with backing instrumentals and/or vocals. Synthesized harmonies and/or additional vocals (e.g., vocals captured from another vocalist at still other locations and optionally pitch-shifted to harmonize with other vocals) may also be included in the mix. Geocoding of captured vocal performances (or individual contributions to a combined performance) and/or listener feedback may facilitate animations or display artifacts in ways that are suggestive of a performance or endorsement emanating from a particular geographic locale on a user manipulable globe. In this way, implementations of the described functionality can transform otherwise mundane mobile devices into social instruments that foster a unique sense of global connectivity, collaboration and community.
Accordingly, techniques have been developed for capture, pitch correction and audible rendering of vocal performances on handheld or other portable devices using signal processing techniques and data flows suitable given the somewhat limited capabilities of such devices and in ways that facilitate efficient encoding and communication of such captured performances via ubiquitous, though typically bandwidth-constrained, wireless networks. The developed techniques facilitate the capture, pitch correction, harmonization and encoding of vocal performances for mixing with additional captured vocals, pitch-shifted harmonies and backing instrumentals and/or vocal tracks as well as the subsequent rendering of mixed performances on remote devices.
In some embodiments of the present invention, a method includes using a portable computing device for vocal performance capture, the portable computing device having a display, a microphone interface and a communications interface. Responsive to a user selection, via the communications interface, a vocal score temporally synchronizable with a corresponding backing track and lyrics is retrieved, the vocal score encoding (i) a sequence of notes for a vocal melody and (ii) at least a first set of harmony notes for at least some portions of the vocal melody. At the portable computing device, the backing track is audibly rendered and corresponding portions of the lyrics are concurrently presenting on the display in temporal correspondence therewith. At the portable computing device, a vocal performance of the user is captured and pitch corrected in accord with the score-encoded vocal melody to produce a first version of the user's vocal performance. At the portable computing device, at least some portions of the user's captured vocal performance are pitch shifted in accord with the score-encoded harmony notes to produce at least a second version of the user's vocal performance. The audible rendering at the portable computing device is in real-time correspondence with the user's vocal performance and mixes either or both of first and second versions of the user's vocal performance with the backing track.
In some embodiments, the method further includes mixing at least the first and second versions of the user's vocal performance with the backing track, wherein the resulting mixed performance includes both pitch corrected vocal melody and accompanying pitch shifted vocal harmony versions of the user's vocal performance. In some cases, for at least some portions of the vocal melody, the vocal score encodes a second set of harmony notes; and the audibly rendered mix includes a third version of the user's vocal performance as an additional pitch corrected vocal harmony.
In some cases, the pitch correcting and pitch shifting are based on continuous time-domain estimation of pitch for the user's captured vocal performance. In some cases, the continuous time-domain pitch estimation includes computing, for a current block of a sampled signal corresponding to the user's captured vocal performance, a lag-domain periodogram. In some cases, the lag-domain periodogram computation includes, for an analysis window of the sampled signal, at least one of: evaluations of an average magnitude difference function (AMDF) for a range of lags; and evaluations of an autocorrelation function for a range of lags.
In some embodiments, the method further includes transmitting from the portable computing device to a remote content server via the communications interface, an audio encoding of one or more of (i) the captured vocal performance of the user, (ii) a pitch corrected vocal melody or harmony version of the user's vocal performance, and (iii) the mixed performance including both pitch corrected vocal melody and accompanying pitch corrected vocal harmony versions of the user's vocal performance.
In some embodiments, the method further includes evaluating throughout the user's vocal performance whether the user's current vocals more closely correspond to the score-encoded vocal melody or to a score-encoded harmony; and based on the evaluation, synthesizing either remaining portions of a score-coded chord as pitch-shifted variants of the captured vocal performance or a harmonically correct set of notes rooted on corrected pitch of the users vocal performance.
In some embodiments, the method further includes, responsive to the user selection, also retrieving the backing track via the data communications interface. In some cases, the backing track resides in storage local to the portable computing device, and the retrieving identifies the vocal score temporally synchronizable with the corresponding backing track and lyrics using an identifier ascertainable from the locally stored backing track.
In some cases, the backing track includes either or both of instrumentals and backing vocals and is rendered in multiple versions; and the version of the backing track audibly rendered in correspondence with the lyrics is a monophonic scratch version, and the version of the backing track mixed with pitch-corrected vocal melody and harmony versions of the user's vocal performance is a polyphonic version of higher quality or fidelity than the scratch version. In some cases, the vocal score further encodes the backing track and the lyrics. In some cases, the vocal score further encodes one or more keys in which respective portions of the vocals are to be performed.
In some cases, the portable computing device is selected from the group of: a mobile phone; a personal digital assistant; a laptop computer, notebook computer, tablet computer or netbook.
In some embodiments, the method further includes audibly rendering a second mixed performance at the portable computing device, wherein the second mixed performance includes an encoding of a pitch corrected vocal performance captured and pitch corrected at a second remote device and mixed with the backing track.
In some embodiments, the method further includes geocoding the transmitted audio encoding; and displaying a geographic origin for, and in correspondence with audible rendering of, a third mixed performance of a pitch corrected vocal performance captured and pitch corrected at a third remote device and mixed with the backing track, the third mixed performance received via the communications interface directly or indirectly from a third remote device. In some cases, the display of geographic origin is by display animation suggestive of a performance emanating from a particular location on a globe. In some cases, the method further includes capturing and conveying back to the remote server one or more of (i) listener comment on and (ii) ranking of the third mixed performance for inclusion as metadata in association with subsequent supply and rendering thereof.
In some cases, the backing track encodes a background instrumental performance. In some cases, the backing track further encodes one or more accompanying vocal performances.
In some embodiments in accordance with the present invention, a portable computing device includes a display; a microphone interface; an audio transducer interface; a data communications interface; user interface code executable on the portable computing device to capture user interface gestures selective for a backing track and to initiate retrieval of at least a vocal score corresponding thereto, the vocal score encoding (i) a sequence of notes for a vocal melody and (ii) at least a first set of harmony notes for at least some portions of the vocal melody; the user interface code further executable to capture user interface gestures to initiate (i) audible rendering of the backing track, (ii) concurrent presentation lyrics on the display and (iii) capture of the user's vocal performance using the microphone interface; pitch correction code executable on the portable computing device to, concurrent with said audible rendering, continuously pitch correct the user's vocal performance in accord with the score-encoded vocal melody to produce a first version of the user's vocal performance; the pitch correction code further executable on the portable computing device to, concurrent with said audible rendering, continuously pitch shift at least some portions of the user's vocal performance in accord with the score-encoded harmony notes to produce at least a second version of the user's vocal performance; and a rendering pipeline executable to mix at least the first and second versions of the user's vocal performance with the backing track, such that the resulting mixed performance includes the user's own vocal performance captured in correspondence with the lyrics and backing track, but pitch-corrected and harmonized in accord with the retrieved vocal score.
In some cases, the rendering pipeline is executable to mix either or both of first and second versions of the user's vocal performance with the backing track and render a resulting mixed performance via the audio transducer interface in real-time correspondence with the user's vocal performance. In some cases, the pitch correction code includes a time-domain implementation of pitch estimation. In some cases, the time-domain implementation of pitch estimation includes code executable to compute, for a current block of a sampled signal corresponding to the user's captured vocal performance, a lag-domain periodogram. In some cases, the lag-domain periodogram computation includes, for an analysis window of the sampled signal, at least one of evaluations of an average magnitude difference function (AMDF) for a range of lags and evaluations of an autocorrelation function for a range of lags.
In some embodiments, the portable computing device further includes code executable thereon (i) to evaluate throughout the user's vocal performance whether the user's current vocals more closely correspond to the score-encoded vocal melody or to a score-encoded harmony and (ii) based on the evaluation, to synthesize either remaining portions of a score-coded chord as pitch-shifted variants of the captured vocal performance or a harmonically correct set of notes rooted on corrected pitch of the users vocal performance.
In some embodiments, the portable computing device further includes local storage, wherein the initiated retrieval includes checking instances, if any, of the vocal score information in the local storage against instances available from a remote server and retrieving from the remote server if instances in local storage are unavailable or out-of-date. In some cases, the user interface code further executable to initiate retrieval of either or both of the backing track and corresponding lyrics.
In some embodiments in accordance with the present invention, a computer program product is encoded in one or more media and includes instructions executable on a processor of the portable computing device to cause the portable computing device to: retrieve via a communications interface, a vocal score temporally synchronizable with a corresponding backing track and lyrics, the vocal score encoding (i) a sequence of notes for a vocal melody and (ii) at least a first set of harmony notes for at least some portions of the vocal melody; audibly render the backing track and present in temporal correspondence therewith corresponding portions of the lyrics on a display of the portable computing device; capture and pitch correct a vocal performance of the user in accord with the score-encoded vocal melody to produce a first version of the user's vocal performance; pitch shift at least some portions of the user's captured vocal performance in accord with the score-encoded harmony notes to produce at least a second version of the user's vocal performance, wherein the audible rendering is in real-time correspondence with the user's vocal performance and mixes either or both of first and second versions of the user's vocal performance with the backing track.
In some cases, the instructions encoded therein are executable on the processor of the portable computing device to further cause the portable computing device to: mix at least the first and second versions of the user's vocal performance with the backing track, wherein the resulting mixed performance includes both pitch corrected vocal melody and accompanying pitch shifted vocal harmony versions of the user's vocal performance.
In some cases, the pitch correcting and pitch shifting are implemented using a first subset of the instructions executable on the processor of the portable computing device to provide continuous time-domain estimation of pitch for the user's captured vocal performance. In some cases, the continuous time-domain pitch estimation provided by execution of the first subset of the instructions includes computing a lag-domain periodogram for a respective blocks of a sampled signal corresponding to the user's captured vocal performance.
These and other embodiments in accordance with the present invention(s) will be understood with reference to the description and appended claims which follow.
The present invention is illustrated by way of example and not limitation with reference to the accompanying figures, in which like references generally indicate similar elements or features.
Skilled artisans will appreciate that elements or features in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions or prominence of some of the illustrated elements or features may be exaggerated relative to other elements or features in an effort to help to improve understanding of embodiments of the present invention.
Techniques have been developed to facilitate the capture, pitch correction, harmonization, encoding and audible rendering of vocal performances on handheld or other portable computing devices. Building on these techniques, mixes that include such vocal performances can be prepared for audible rendering on targets that include these handheld or portable computing devices as well as desktops, workstations, gaming stations and even telephony targets. Implementations of the described techniques employ signal processing techniques and allocations of system functionality that are suitable given the generally limited capabilities of such handheld or portable computing devices and that facilitate efficient encoding and communication of the pitch-corrected vocal performances (or precursors or derivatives thereof) via wireless and/or wired bandwidth-limited networks for rendering on portable computing devices or other targets.
Pitch detection and correction of a user's vocal performance are performed continuously and in real-time with respect to the audible rendering of the backing track at the handheld or portable computing device. In this way, pitch-corrected vocals may be mixed with the audible rendering to overlay (in real-time) the very instrumentals and/or vocals of the backing track against which the user's vocal performance is captured. In some implementations, pitch detection builds on time-domain pitch correction techniques that employ average magnitude difference function (AMDF) or autocorrelation-based techniques together with zero-crossing and/or peak picking techniques to identify differences between pitch of a captured vocal signal and score-coded target pitches. Based on detected differences, pitch correction based on pitch synchronous overlapped add (PSOLA) and/or linear predictive coding (LPC) techniques allow captured vocals to be pitch shifted in real-time to “correct” notes in accord with pitch correction settings that code score-coded melody targets and harmonies. Frequency domain techniques, such as FFT peak picking for pitch detection and phase vocoding for pitch shifting, may be used in some implementations, particularly when off-line processing is employed or computational facilities are substantially in excess of those typical of current generation mobile devices. Pitch detection and shifting (e.g., for pitch correction, harmonies and/or preparation of composite multi-vocalist, virtual glee club mixes) may also be performed in a post-processing mode.
In general, “correct” notes are those notes that are consistent with a specified key or scale or which, in some embodiments, correspond to a score-coded melody (or harmony) expected in accord with a particular point in the performance. That said, in a capella modes without an operant score (or that allow a user to, during vocal capture, dynamically vary pitch correction settings of an existing score) may be provided in some implementations to facilitate ad-libbing. For example, user interface gestures captured at the mobile phone (or other portable computing device) may, for particular lyrics, allow the user to (i) switch off (and on) use of score-coded note targets, (ii) dynamically switch back and forth between melody and harmony note sets as operant pitch correction settings and/or (iii) selectively fall back (at gesture selected points in the vocal capture) to settings that cause sounded pitches to be corrected solely to nearest notes of a particular key or scale (e.g., C major, C minor, E flat major, etc.) In short, user interface gesture capture and dynamically variable pitch correction settings can provide a Freestyle mode for advanced users.
In some cases, pitch correction settings may be selected to distort the captured vocal performance in accord with a desired effect, such as with pitch correction effects popularized by a particular musical performance or particular artist. In some embodiments, pitch correction may be based on techniques that computationally simplify autocorrelation calculations as applied to a variable window of samples from a captured vocal signal, such as with plug-in implementations of Auto-Tune® technology popularized by, and available from, Antares Audio Technologies.
Based on the compelling and transformative nature of the pitch-corrected vocals, user/vocalists typically overcome an otherwise natural shyness or angst associated with sharing their vocal performances. Instead, even mere amateurs are encouraged to share with friends and family or to collaborate and contribute vocal performances as part of an affinity group. In some implementations, these interactions are facilitated through social network- and/or eMail-mediated sharing of performances and invitations to join in a group performance or virtual glee club. Using uploaded vocals captured at clients such as the aforementioned portable computing devices, a content server (or service) can mediate such affinity groups by manipulating and mixing the uploaded vocal performances of multiple contributing vocalists. Depending on the goals and implementation of a particular system, uploads may include pitch-corrected vocal performances, dry (i.e., uncorrected) vocals, and/or control tracks of user key and/or pitch correction selections, etc.
Often, first and second encodings (often of differing quality or fidelity) of the same underlying audio source material may be employed. For example, use of first and second encodings of a backing track (e.g., one at the handheld or other portable computing device at which vocals are captured, and one at the content server) can allow the respective encodings to be adapted to data transfer bandwidth constraints or to needs at the particular device/platform at which they are employed. In some embodiments, a first encoding of the backing track audibly rendered at a handheld or other portable computing device as an audio backdrop to vocal capture may be of lesser quality or fidelity than a second encoding of that same backing track used at the content server to prepare the mixed performance for audible rendering. In this way, high quality mixed audio content may be provided while limiting data bandwidth requirements to a handheld device used for capture and pitch correction of a vocal performance.
Notwithstanding the foregoing, backing track encodings employed at the portable computing device may, in some cases, be of equivalent or even better quality/fidelity those at the content server. For example, in embodiments or situations in which a suitable encoding of the backing track already exists at the mobile phone (or other portable computing device), such as from a music library resident thereon or based on prior download from the content server, download data bandwidth requirements may be quite low. Lyrics, timing information and applicable pitch correction settings may be retrieved for association with the existing backing track using any of a variety of identifiers ascertainable, e.g., from audio metadata, track title, an associated thumbnail or even fingerprinting techniques applied to the audio, if desired.
Karaoke-Style Vocal Performance Capture
Although embodiments of the present invention are not necessarily limited thereto, mobile phone-hosted, pitch-corrected, karaoke-style, vocal capture provides a useful descriptive context. For example, in some embodiments such as illustrated in
User vocals 103 are captured at handheld 101, pitch-corrected continuously and in real-time (again at the handheld) and audibly rendered (see 104, mixed with the backing track) to provide the user with an improved tonal quality rendition of his/her own vocal performance. Pitch correction is typically based on score-coded note sets or cues (e.g., pitch and harmony cues 105), which provide continuous pitch-correction algorithms with performance synchronized sequences of target notes in a current key or scale. In addition to performance synchronized melody targets, score-coded harmony note sequences (or sets) provide pitch-shifting algorithms with additional targets (typically coded as offsets relative to a lead melody note track and typically scored only for selected portions thereof) for pitch-shifting to harmony versions of the user's own captured vocals. In some cases, pitch correction settings may be characteristic of a particular artist such as the artist that performed vocals associated with the particular backing track.
In the illustrated embodiment, backing audio (here, one or more instrumental and/or vocal tracks), lyrics and timing information and pitch/harmony cues are all supplied (or demand updated) from one or more content servers or hosted service platforms (here, content server 110). For a given song and performance, such as “Can't Fight the Feeling,” several versions of the background track may be stored, e.g., on the content server. For example, in some implementations or deployments, versions may include:
In addition, lyrics, melody and harmony track note sets and related timing and control information may be encapsulated as a score coded in an appropriate container or object (e.g., in a Musical Instrument Digital Interface, MIDI, or Java Script Object Notation, json, type format) for supply together with the backing track(s). Using such information, handheld 101 may display lyrics and even visual cues related to target notes, harmonies and currently detected vocal pitch in correspondence with an audible performance of the backing track(s) so as to facilitate a karaoke-style vocal performance by a user.
Thus, if an aspiring vocalist selects on the handheld device “Can't Fight This Feeling” as originally popularized by the group REO Speedwagon, feeling.json and feeling.m4a may be downloaded from the content server (if not already available or cached based on prior download) and, in turn, used to provide background music, synchronized lyrics and, in some situations or embodiments, score-coded note tracks for continuous, real-time pitch-correction shifts while the user sings. Optionally, at least for certain embodiments or genres, harmony note tracks may be score coded for harmony shifts to captured vocals. Typically, a captured pitch-corrected (possibly harmonized) vocal performance is saved locally on the handheld device as one or more way files and is subsequently compressed (e.g., using lossless Apple Lossless Encoder, ALE, or lossy Advanced Audio Coding, AAC, or vorbis codec) and encoded for upload (106) to content server 110 as an MPEG-4 audio, m4a, or ogg container file. MPEG-4 is an international standard for the coded representation and transmission of digital multimedia content for the Internet, mobile networks and advanced broadcast applications. OGG is an open standard container format often used in association with the vorbis audio format specification and codec for lossy audio compression. Other suitable codecs, compression techniques, coding formats and/or containers may be employed if desired.
Depending on the implementation, encodings of dry vocal and/or pitch-corrected vocals may be uploaded (106) to content server 110. In general, such vocals (encoded, e.g., as way, m4a, ogg/vorbis content or otherwise) whether already pitch-corrected or pitch-corrected at content server 110 can then be mixed (111), e.g., with backing audio and other captured (and possibly pitch shifted) vocal performances, to produce files or streams of quality or coding characteristics selected accord with capabilities or limitations a particular target (e.g., handheld 120) or network. For example, pitch-corrected vocals can be mixed with both the stereo and mono way files to produce streams of differing quality. In some cases, a high quality stereo version can be produced for web playback and a lower quality mono version for streaming to devices such as the handheld device itself.
As described elsewhere in herein, performances of multiple vocalists may be accreted in a virtual glee club performance. In some embodiments, one set of vocals (for example, in the illustration of
Score-Coded Harmony Generation
Synthetic harmonization techniques have been employed in voice processing systems for some time (see e.g., U.S. Pat. No. 5,231,671 to Gibson and Bertsch, describing a method for analyzing a vocal input and producing harmony signals that are combined with the voice input to produce a multivoice signal). Nonetheless, such systems are typically based on statically-coded harmony note relations and may fail to generate harmonies that are pleasing given less than idea tonal characteristics of an input captured from an amateur vocalist or in the presence of improvisation. Accordingly, some design goals for the harmonization system described herein involve development of techniques that sound good despite wide variations in what a particular user/vocalist choose to sing.
As will be apparent to persons of ordinary skill in the art, it is generally desirable to limit feedback loops from transducer(s) 202 to microphone 201 (e.g., through the use of head- or earphones). Indeed, while much of the illustrative description herein builds upon features and capabilities that are familiar in mobile phone contexts and, in particular, relative to the Apple iPhone handheld, even portable computing devices without a built-in microphone capabilities may act as a platform for vocal capture with continuous, real-time pitch correction and harmonization if headphone/microphone jacks are provided. The Apple iPod Touch handheld and the Apple iPad tablet are two such examples.
Both pitch correction and added harmonies are chosen to correspond to a score 207, which in the illustrated configuration, is wirelessly communicated (261) to the device (e.g., from content server 110 to an iPhone handheld 101 or other portable computing device, recall
In some embodiments of techniques described herein, we determine from our score the note (in a current scale or key) that is closest to that sounded by the user/vocalist. While this closest note may typically be a main pitch corresponding to the score-coded vocal melody, it need not be. Indeed, in some cases, the user/vocalist may intend to sing harmony and sounded notes may more closely approximate a harmony track. In either case, pitch corrector 252 and/or harmony generator 255 may synthesize the other portions of the desired score-coded chord by generating appropriate pitch-shifted versions of the captured vocals (even if user/vocalist is intentionally singing a harmony). One or more of the resulting pitch-shifted versions may be optionally combined (254) or aggregated for mix (253) with the audibly-rendered backing track and/or wirelessly communicated (262) to content server 110 or a remote device (e.g., handheld 120). In some cases, a user/vocalist can be off by an octave (male vs. female) or may simply exhibit little skill as a vocalist (e.g., sounding notes that are routinely well off key), and the pitch corrector 252 and harmony generator 255 will use the key/score/chord information to make a chord that sounds good in that context. In a capella modes (or for portions of a backing track for which note targets are not score-coded), captured vocals may be pitch-corrected to a nearest note in the current key or to a harmonically correct set of notes based on pitch of the captured vocals.
In some embodiments, a weighting function and rules are used to decide what notes should be “sung” by the harmonies generated as pitch-shifted variants of the captured vocals. The primary features considered are content of the score and what a user is singing. In the score, for those portions of a song where harmonies are desired, score 207 defines a set of notes either based on a chord or a set of notes from which (during a current performance window) all harmonies will choose. The score may also define intervals away from what the user is singing to guide where the harmonies should go.
So, if you wanted two harmonies, score 207 could specify (for a given temporal position vis-a-vis backing track 209 and lyrics 208) relative harmony offsets as +2 and −3, in which case harmony generator 255 would choose harmony notes around a major third above and a perfect fourth below the main melody (as pitch-corrected from actual captured vocals by pitch corrector 252 as described elsewhere herein). In this case, if the user/vocalist were singing the root of the chord (i.e., close enough to be pitch-corrected to the score-coded melody), these notes would sound great and result in a major triad of “voices” exhibiting the timbre and other unique qualities of the user's own vocal performance. The result for a user/vocalist is a harmony generator that produces harmonies which follow his/her voice and give the impression that harmonies are “singing” with him/her rather than being statically scored.
In some cases, such as if the third above the pitch actually sung by the user/vocalist is not in the current key or chord, this could sound bad. Accordingly, in some embodiments, the aforementioned weighting functions or rules may restrict harmonies to notes in a specified note set. A simple weighting function may choose the closest note set to the note sung and apply a score-coded offset. Rules or heuristics can be used to eliminate or at least reduce the incidence of bad harmonies. For example, in some embodiments, one such rule disallows harmonies to sing notes less than 3 semitones (a minor third) away from what the user/vocalist is singing.
Although persons of ordinary skill in the art will recognize that any of a variety of score-coding frameworks may be employed, exemplary implementations described herein build on extensions to widely-used and standardized musical instrument digital interface (MIDI) data formats. Building on that framework, scores may be coded as a set of tracks represented in a MIDI file, data structure or container including, in some implementations or deployments:
Turning specifically to control track features, in some embodiments, the following text markers may be supported:
Chord track events, in some embodiments, include the following text markers that notate a root and quality (e.g., C min7 or Ab maj) and allow a note set to be defined. Although desired harmonies are set in the harmony track(s), if the user's pitch differs from the scored pitch, relative offsets may be maintained by proximity to notes that are in the current chord. As used relative to a chord track of the score, the term “chord” will be understood to mean a set of available pitches, since chord track events need not encode standard chords in the usual sense. These and other score-coded pitch correction settings may be employed furtherance of the inventive techniques described herein.
Additional Effects
Further effects may be provided in addition to the above-described generation of pitch-shifted harmonies in accord with score codings and the user/vocalists own captured vocals. For example, in some embodiments, a slight pan (i.e., an adjustment to left and right channels to create apparent spatialization) of the harmony voices is employed to make the synthetic harmonies appear more distinct from the main voice which is pitch corrected to melody. When using only a single channel, all of the harmonized voices can have the tendency to blend with each other and the main voice. By panning, implementations can provide significant psychoacoustic separation. Typically, the desired spatialization can be provided by adjusting amplitude of respective left and right channels. For example, in some embodiments, even a coarse spatial resolution pan may be employed, e.g.,
Left signal=x*pan; and
Right signal=x*(1.0−pan),
where 0.0≤pan ≤1.0. In some embodiments, finer resolution and even phase adjustments may be made to pull perception toward the left or right.
In some embodiments, temporal delays may be added for harmonies (based either on static or score-coded delay). In this way, a user/vocalist may sing a line and a bit later a harmony voice would sing back the captured vocals, but transposed to a new pitch or key in accord with previously described score-coded harmonies. Based on the description herein, persons of skill in the art will appreciate these and other variations on the described techniques that may be employed to afford greater or lesser prominence to a particular set (or version) of vocals.
Computational Techniques for Pitch Detection, Correction and Shifts
As will be appreciated by persons of ordinary skill in the art having benefit of the present description, pitch-detection and correction techniques may be employed both for correction of a captured vocal signal to a target pitch or note and for generation of harmonies as pitch-shifted variants of a captured vocal signal.
Based on the description herein, persons of ordinary skill in the art will appreciate suitable allocations of signal processing techniques (sampling, filtering, decimation, etc.) and data representations to functional blocks (e.g., decoder(s) 352, digital-to-analog (D/A) converter 351, capture 253 and encoder 355) of a software executable to provide signal processing flows 350 illustrated in
Building then on any of a variety of suitable implementations of the forgoing signal processing constructs, we turn to pitch detection and correction/shifting techniques that may be employed in the various embodiments described herein, including in furtherance of the pitch correction, harmony generation and combined pitch correction/harmonization blocks (252, 255 and 354) illustrated in
As will be appreciated by persons of ordinary skill in the art, pitch-detection and pitch-correction have a rich technological history in the music and voice coding arts. Indeed, a wide variety of feature picking, time-domain and even frequency-domain techniques have been employed in the art and may be employed in some embodiments in accord with the present invention. The present description does not seek to exhaustively inventory the wide variety of signal processing techniques that may be suitable in various design or implementations in accord with the present description; rather, we summarize certain techniques that have proved workable in implementations (such as mobile device applications) that contend with CPU-limited computational platforms.
Accordingly, in view of the above and without limitation, certain exemplary embodiments operate as follows:
with samples {a, b, c, . . . } and indices 0, 1, 2, . . . (wherein the 0.1 symbology represents deviations from periodicity) and wanted to jump back or forward somewhere, we might pick the positive going c-d transitions at indices 2 and 10, and instead of just jumping, ramp:
(1*c+0*c), (d*7/8+(d.1)/8), (e*6/8+(e.2)*2/8) until we reached (o*c+1*c.1) at index 10/18, having jumped forward a period (8 indices) but made the aperiodicity less evident at the edit point. It is pitch synchronous because we do it at 8 samples, the closest period to what we can detect. Note that the cross-fade is a linear/triangular overlap-add, but (more generally) may employ complimentary cosine, 1-cosine, or other functions as desired.
As will be appreciated by persons of skill in the art, AMDF calculations are but one time-domain computational technique suitable for measuring periodicity of a signal. More generally, the term lag-domain periodogram describes a function that takes as input, a time-domain function or series of discrete time samples x(n) of a signal, and compares that function or signal to itself at a series of delays (i.e., in the lag-domain) to measure periodicity of the original function x. This is done at lags of interest. Therefore, relative to the techniques described herein, examples of suitable lag-domain periodogram computations for pitch detection include subtracting, for a current block, the captured vocal input signal x(n) from a lagged version of same (a difference function), or taking the absolute value of that subtraction (AMDF), or multiplying the signal by it's delayed version and summing the values (autocorrelation).
AMDF will show valleys at periods that correspond to frequency components of the input signal, while autocorrelation will show peaks. If the signal is non-periodic (e.g., noise), periodograms will show no clear peaks or valleys, except at the zero lag position. Mathematically,
AMDF(k)=Σn|x(n)−x(n−k)|
autocorrelation(k)=Σnx(n)*x(n−k).
For implementations described herein, AMDF-based lag-domain periodogram calculations can be efficiently performed even using computational facilities of current-generation mobile devices. Nonetheless, based on the description herein, persons of skill in the art will appreciate implementations that build any of a variety of pitch detection techniques that may now, or in the future become, computational tractable on a given target device or platform.
Accretion of Vocal Performances into Virtual Glee Club
Once a vocal performance is captured at the handheld device, the captured vocal performance audio (typically pitch corrected) is compressed using an audio codec (e.g., an Advanced Audio Coding (AAC) or ogg/vorbis codec) and uploaded to a content server.
In general, the resulting master may, in turn, be encoded using an appropriate codec (e.g., an AAC codec) at various bit rates and/or with selected vocals afforded prominence to produce compressed audio files which are suitable for streaming back to the capturing handheld device (and/or other remote devices) and for streaming/playback via the web. In general, relative to capabilities of commonly deployed wireless networks, it can be desirable from an audio data bandwidth perspective to limit the uploaded data to that necessary to represent the vocal performance, while mixing when and where needed. In some cases, data streamed for playback or for use as a second (or Nth) generation backing track may separately encode vocal tracks for mix with a first generation backing track at an audible rendering target. In general, vocal and/or backing track audio exchange between the handheld device and content server may be adapted to the quality and capabilities of an available data communications channel.
Relative to certain social network constructs that, in some embodiments of the present invention, facilitate formation of virtual glee clubs and/or interactions amongst members or potential members thereof, additional or alternative mixes may be desirable. For example, in some embodiments, an accretion of pitch-corrected vocals captured from an initial, or prior, contributor may form the basis of a backing track used in a subsequent vocal capture from another user/vocalist (e.g., at another handheld device). Accordingly, where supply and use of backing tracks is illustrated and described herein, it will be understood, that vocals captured, pitch-corrected (and possibly, though not typically, harmonized) may themselves be mixed to produce a “backing track” used to motivate, guide or frame subsequent vocal capture.
In general, additional vocalists may be invited to sing a particular part (e.g., tenor, part B in duet, etc.) or simply to sign, whereupon content server 110 may pitch shift and place their captured vocals into one or more positions within a virtual glee club. Although mixed vocals may be included in such a backing track, it will be understood that because the illustrated and described systems separately capture and pitch-correct individual vocal performances, the content server (e.g., content server 110) is in position to manipulate (112) mixes in ways that further objectives of a virtual glee club or accommodate sensibilities of its members.
For example, in some embodiments of the present invention, alternative mixes of three different contributing vocalists may be presented in a variety of ways. Mixes provided to (or for) a first contributor may feature that first contributor's vocals more prominently than those of the other two. Likewise, mixes provided to (or for) a second contributor may feature that second contributor's vocals more prominently than those of the other two. Likewise, with the third contributor. In general, content server 110 may alter the mixes to make one vocal performance more prominent than others by manipulating overall amplitude of the various captured and pitch-corrected vocals therein. In mixes supplied in some embodiments, manipulation of respective amplitudes for spatially differentiated channels (e.g., left and right channels) or even phase relations amongst such channels may be used to pan less prominent vocals left or right of more prominent vocals.
Furthermore, in some embodiments, uploaded dry vocals 106 may be pitch corrected and shifted at content server 110 (e.g., based on pitch harmony cues 105, previously described relative to pitch correction and harmony generation at the handheld 101) to afford the desired prominence. Thus as an example,
Adaptation of the previously-described signal processing techniques (for pitch detection and shifting to produce pitch-corrected and harmonized vocal performances at computationally-limited handheld device platforms) for execution at content server 110 will be understood by persons of ordinary skill in the art. Indeed, given the significantly expanded computational facilities available to typical implementations or deployments of a web- or cloud-based content service platform, persons of ordinary skill in the art having benefit of the present description will appreciate an even wider range of computationally tractable techniques that may be employed.
World Stage
Although much of the description herein has focused on vocal performance capture, pitch correction and use of respective first and second encodings of a backing track relative to capture and mix of a user's own vocal performances, it will be understood that facilities for audible rendering of remotely captured performances of others may be provided in some situations or embodiments. In such situations or embodiments, vocal performance capture occurs at another device and after a corresponding encoding of the captured (and typically pitch-corrected) vocal performance is received at a present device, it is audibly rendered in association with a visual display animation suggestive of the vocal performance emanating from a particular location on a globe.
When a user executes the handheld application and accesses this play (or listener) mode, a world stage is presented. More specifically, a network connection is made to content server 110 reporting the handheld's current network connectivity status and playback preference (e.g., random global, top loved, my performances, etc). Based on these parameters, content server 110 selects a performance (e.g., a pitch-corrected vocal performance such as may have been captured at handheld device instance 101 or 301 and transmits metadata associated therewith. In some implementations, the metadata includes a uniform resource locator (URL) that allows handheld 120 to retrieve the actual audio stream (high quality or low quality depending on the size of the pipe), as well as additional information such as geocoded (using GPS) location of the vocal performance capture (including geocodes for additional vocal performances included as harmonies or backup vocals) and attributes of other listeners who have loved, tagged or left comments for the particular performance. In some embodiments, listener feedback is itself geocoded. During playback, the user may tag the performance and leave his own feedback or comments for a subsequent listener and/or for the original vocal performer. Once a performance is tagged, a relationship may be established between the performer and the listener. In some cases, the listener may be allowed to filter for additional performances by the same performer and the server is also able to more intelligently provide “random” new performances for the user to listen to based on an evaluation of user preferences.
Although not specifically illustrated in the snapshot, it will be appreciated that geocoded listener feedback indications are, or may optionally be, presented on the globe (e.g., as stars or “thumbs up” or the like) at positions to suggest, consistent with the geocoded metadata, respective geographic locations from which the corresponding listener feedback was transmitted. It will be further appreciated that, in some embodiments, the visual display animation is interactive and subject to viewpoint manipulation in correspondence with user interface gestures captured at a touch screen display of handheld 120. For example, in some embodiments, travel of a finger or stylus across a displayed image of the globe in the visual display animation causes the globe to rotate around an axis generally orthogonal to the direction of finger or stylus travel. Both the visual display animation suggestive of the vocal performance emanating from a particular location on a globe and the listener feedback indications are presented in such an interactive, rotating globe user interface presentation at positions consistent with their respective geotags.
An Exemplary Mobile Device
Summarizing briefly, mobile device 400 includes a display 402 that can be sensitive to haptic and/or tactile contact with a user. Touch-sensitive display 402 can support multi-touch features, processing multiple simultaneous touch points, including processing data related to the pressure, degree and/or position of each touch point. Such processing facilitates gestures and interactions with multiple fingers, chording, and other interactions. Of course, other touch-sensitive display technologies can also be used, e.g., a display in which contact is made using a stylus or other pointing device.
Typically, mobile device 400 presents a graphical user interface on the touch-sensitive display 402, providing the user access to various system objects and for conveying information. In some implementations, the graphical user interface can include one or more display objects 404, 406. In the example shown, the display objects 404, 406, are graphic representations of system objects. Examples of system objects include device functions, applications, windows, files, alerts, events, or other identifiable system objects. In some embodiments of the present invention, applications, when executed, provide at least some of the digital acoustic functionality described herein.
Typically, the mobile device 400 supports network connectivity including, for example, both mobile radio and wireless internetworking functionality to enable the user to travel with the mobile device 400 and its associated network-enabled functions. In some cases, the mobile device 400 can interact with other devices in the vicinity (e.g., via Wi-Fi, Bluetooth, etc.). For example, mobile device 400 can be configured to interact with peers or a base station for one or more devices. As such, mobile device 400 may grant or deny network access to other wireless devices.
Mobile device 400 includes a variety of input/output (I/O) devices, sensors and transducers. For example, a speaker 460 and a microphone 462 are typically included to facilitate audio, such as the capture of vocal performances and audible rendering of backing tracks and mixed pitch-corrected vocal performances as described elsewhere herein. In some embodiments of the present invention, speaker 460 and microphone 662 may provide appropriate transducers for techniques described herein. An external speaker port 464 can be included to facilitate hands-free voice functionalities, such as speaker phone functions. An audio jack 466 can also be included for use of headphones and/or a microphone. In some embodiments, an external speaker and/or microphone may be used as a transducer for the techniques described herein.
Other sensors can also be used or provided. A proximity sensor 468 can be included to facilitate the detection of user positioning of mobile device 400. In some implementations, an ambient light sensor 470 can be utilized to facilitate adjusting brightness of the touch-sensitive display 402. An accelerometer 472 can be utilized to detect movement of mobile device 400, as indicated by the directional arrow 474. Accordingly, display objects and/or media can be presented according to a detected orientation, e.g., portrait or landscape. In some implementations, mobile device 400 may include circuitry and sensors for supporting a location determining capability, such as that provided by the global positioning system (GPS) or other positioning systems (e.g., systems using Wi-Fi access points, television signals, cellular grids, Uniform Resource Locators (URLs)) to facilitate geocodings described herein. Mobile device 400 can also include a camera lens and sensor 480. In some implementations, the camera lens and sensor 480 can be located on the back surface of the mobile device 400. The camera can capture still images and/or video for association with captured pitch-corrected vocals.
Mobile device 400 can also include one or more wireless communication subsystems, such as an 802.11b/g communication device, and/or a Bluetooth™ communication device 488. Other communication protocols can also be supported, including other 802.x communication protocols (e.g., WiMax, Wi-Fi, 3G), code division multiple access (CDMA), global system for mobile communications (GSM), Enhanced Data GSM Environment (EDGE), etc. A port device 490, e.g., a Universal Serial Bus (USB) port, or a docking port, or some other wired port connection, can be included and used to establish a wired connection to other computing devices, such as other communication devices 400, network access devices, a personal computer, a printer, or other processing devices capable of receiving and/or transmitting data. Port device 490 may also allow mobile device 400 to synchronize with a host device using one or more protocols, such as, for example, the TCP/IP, HTTP, UDP and any other known protocol.
While the invention(s) is (are) described with reference to various embodiments, it will be understood that these embodiments are illustrative and that the scope of the invention(s) is not limited to them. Many variations, modifications, additions, and improvements are possible. For example, while pitch correction vocal performances captured in accord with a karaoke-style interface have been described, other variations will be appreciated. Furthermore, while certain illustrative signal processing techniques have been described in the context of certain illustrative applications, persons of ordinary skill in the art will recognize that it is straightforward to modify the described techniques to accommodate other suitable signal processing techniques and effects.
Embodiments in accordance with the present invention may take the form of, and/or be provided as, a computer program product encoded in a machine-readable medium as instruction sequences and other functional constructs of software, which may in turn be executed in a computational system (such as a iPhone handheld, mobile or portable computing device, or content server platform) to perform methods described herein. In general, a machine readable medium can include tangible articles that encode information in a form (e.g., as applications, source or object code, functionally descriptive information, etc.) readable by a machine (e.g., a computer, computational facilities of a mobile device or portable computing device, etc.) as well as tangible storage incident to transmission of the information. A machine-readable medium may include, but is not limited to, magnetic storage medium (e.g., disks and/or tape storage); optical storage medium (e.g., CD-ROM, DVD, etc.); magneto-optical storage medium; read only memory (ROM); random access memory (RAM); erasable programmable memory (e.g., EPROM and EEPROM); flash memory; or other types of medium suitable for storing electronic instructions, operation sequences, functionally descriptive information encodings, etc.
In general, plural instances may be provided for components, operations or structures described herein as a single instance. Boundaries between various components, operations and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the invention(s). In general, structures and functionality presented as separate components in the exemplary configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements may fall within the scope of the invention(s).
The present application is a continuation of U.S. application Ser. No. 14/517,647, filed Oct. 17, 2014 which in turn claims the benefit of U.S. Non-Provisional Ser. No. 13/085,413, filed Apr. 12, 2011, which in turn claims the benefit of U.S. Provisional Application No. 61/323,348, filed Apr. 12, 2010, and which is also a continuation-in-part of U.S. application Ser. No. 12/876,132, filed Sep. 4, 2010, entitled “CONTINUOUS SCORE CODED PITCH CORRECTION,” and naming Salazar, Fiebrink, Wang, Ljungström, Smith and Cook as inventors, which in turn claims priority of U.S. Provisional Application No. 61/323,348, filed Apr. 12, 2010. Each of the foregoing applications is incorporated herein by reference. In addition, the present application is related to the following co-pending applications each filed on even date herewith: (1) U.S. application Ser. No. 13/085,414, entitled “COORDINATING AND MIXING VOCALS CAPTURED FROM GEOGRAPHICALLY DISTRIBUTED PERFORMERS” and naming Cook, Lazier, Lieber and Kirk as inventors; and (2) U.S. application Ser. No. 13/085,415, entitled “COMPUTATIONAL TECHNIQUES FOR CONTINUOUS PITCH CORRECTION AND HARMONY GENERATION” and naming Cook, Lazier, Lieber as inventors. Each of the aforementioned co-pending applications is incorporated by reference herein.
Number | Name | Date | Kind |
---|---|---|---|
4688464 | Gibson et al. | Aug 1987 | A |
5231671 | Gibson et al. | Jul 1993 | A |
5301259 | Gibson et al. | Apr 1994 | A |
5477003 | Muraki et al. | Dec 1995 | A |
5641927 | Pawate | Jun 1997 | A |
5719346 | Yoshida et al. | Feb 1998 | A |
5753845 | Nagata et al. | May 1998 | A |
5811708 | Matsumoto | Sep 1998 | A |
5817965 | Matsumoto | Oct 1998 | A |
5889223 | Matsumoto | Mar 1999 | A |
5902950 | Kato et al. | May 1999 | A |
5939654 | Anada | Aug 1999 | A |
5966687 | Ojard | Oct 1999 | A |
5974154 | Nagata et al. | Oct 1999 | A |
6121531 | Kato | Sep 2000 | A |
6300553 | Kumamoto | Oct 2001 | B2 |
6307140 | Iwamoto et al. | Oct 2001 | B1 |
6336092 | Gibson et al. | Jan 2002 | B1 |
6353174 | Schmidt et al. | Mar 2002 | B1 |
6369311 | Iwamoto | Apr 2002 | B1 |
6535269 | Sherman et al. | Mar 2003 | B2 |
6643372 | Ford et al. | Nov 2003 | B2 |
6653545 | Redmann et al. | Nov 2003 | B2 |
6657114 | Iwamoto et al. | Dec 2003 | B2 |
6661496 | Sherman et al. | Dec 2003 | B2 |
6751439 | Tice et al. | Jun 2004 | B2 |
6816833 | Iwamoto | Nov 2004 | B1 |
6898637 | Curtin | May 2005 | B2 |
6917912 | Chang et al. | Jul 2005 | B2 |
6928261 | Hasegawa et al. | Aug 2005 | B2 |
6971882 | Kumar et al. | Dec 2005 | B1 |
6975995 | Kim | Dec 2005 | B2 |
7003496 | Ishii et al. | Feb 2006 | B2 |
7068596 | Mou | Jun 2006 | B1 |
7096080 | Asada et al. | Aug 2006 | B2 |
7102072 | Kitayama | Sep 2006 | B2 |
7129408 | Uehara et al. | Oct 2006 | B2 |
7164075 | Tada et al. | Jan 2007 | B2 |
7164076 | McHale et al. | Jan 2007 | B2 |
7294776 | Tohgi et al. | Nov 2007 | B2 |
7297858 | Paepcke | Nov 2007 | B2 |
7483957 | Sako et al. | Jan 2009 | B2 |
7606709 | Yoshioka et al. | Oct 2009 | B2 |
7806759 | McHale et al. | Oct 2010 | B2 |
7825321 | Bloom et al. | Nov 2010 | B2 |
7853342 | Redmann | Dec 2010 | B2 |
7899389 | Magnum | Mar 2011 | B2 |
7928310 | Georges et al. | Apr 2011 | B2 |
7974838 | Lukin et al. | Jul 2011 | B1 |
7989689 | Sitrick et al. | Aug 2011 | B2 |
8290769 | Taub et al. | Oct 2012 | B2 |
8315396 | Schreiner et al. | Nov 2012 | B2 |
8772621 | Wang et al. | Jul 2014 | B2 |
8868411 | Cook et al. | Oct 2014 | B2 |
8983829 | Cook et al. | Mar 2015 | B2 |
8996364 | Cook et al. | Mar 2015 | B2 |
9082380 | Hamilton et al. | Jul 2015 | B1 |
20010013270 | Kumamoto et al. | Aug 2001 | A1 |
20010037196 | Iwamoto | Nov 2001 | A1 |
20020004191 | Tice et al. | Jan 2002 | A1 |
20020032728 | Sako et al. | Mar 2002 | A1 |
20020051119 | Sherman et al. | May 2002 | A1 |
20020056117 | Hasegawa et al. | May 2002 | A1 |
20020177994 | Chang et al. | Nov 2002 | A1 |
20030014262 | Kim | Jan 2003 | A1 |
20030117531 | Rovner et al. | Jun 2003 | A1 |
20030164924 | Sherman et al. | Sep 2003 | A1 |
20040159215 | Tohgi et al. | Aug 2004 | A1 |
20040221710 | Kitayama | Nov 2004 | A1 |
20040263664 | Aratani et al. | Dec 2004 | A1 |
20050123887 | Joung et al. | Jun 2005 | A1 |
20050182504 | Bailey et al. | Aug 2005 | A1 |
20050252362 | McHale et al. | Nov 2005 | A1 |
20060165240 | Bloom et al. | Jul 2006 | A1 |
20060206582 | Finn | Sep 2006 | A1 |
20070028750 | Darcie et al. | Feb 2007 | A1 |
20070065794 | Mangum | Mar 2007 | A1 |
20070098368 | Carley et al. | May 2007 | A1 |
20070150082 | Yang et al. | Jun 2007 | A1 |
20070245881 | Egozy et al. | Oct 2007 | A1 |
20070245882 | Odenwald | Oct 2007 | A1 |
20070250323 | Dimkovic et al. | Oct 2007 | A1 |
20070260690 | Coleman | Nov 2007 | A1 |
20070287141 | Milner | Dec 2007 | A1 |
20070294374 | Tamori | Dec 2007 | A1 |
20080033585 | Zopf | Feb 2008 | A1 |
20080105109 | Li et al. | May 2008 | A1 |
20080156178 | Georges et al. | Jul 2008 | A1 |
20080184870 | Toivola | Aug 2008 | A1 |
20080190271 | Taub et al. | Aug 2008 | A1 |
20080312914 | Rajendran et al. | Dec 2008 | A1 |
20090003659 | Forstall et al. | Jan 2009 | A1 |
20090038467 | Brennan | Feb 2009 | A1 |
20090106429 | Siegal et al. | Apr 2009 | A1 |
20090107320 | Willacy et al. | Apr 2009 | A1 |
20090164034 | Cohen | Jun 2009 | A1 |
20090165634 | Mahowald | Jul 2009 | A1 |
20090317783 | Noguchi | Dec 2009 | A1 |
20100087240 | Egozy et al. | Apr 2010 | A1 |
20100126331 | Golovkin et al. | May 2010 | A1 |
20100142926 | Coleman | Jun 2010 | A1 |
20100192753 | Gao et al. | Aug 2010 | A1 |
20100203491 | Yoon | Aug 2010 | A1 |
20100255827 | Jordan | Oct 2010 | A1 |
20100326256 | Emmerson | Dec 2010 | A1 |
20110126103 | Cohen | May 2011 | A1 |
20110144981 | Salazar et al. | Jun 2011 | A1 |
20110144982 | Salazar et al. | Jun 2011 | A1 |
20110144983 | Salazar et al. | Jun 2011 | A1 |
20110203444 | Yamauchi | Aug 2011 | A1 |
Number | Date | Country |
---|---|---|
2493470 | Feb 2013 | GB |
WO2009003347 | Jan 2009 | WO |
Entry |
---|
Gaye, L et al., “Mobile music technology: Report on an emerging community,” Proceedings of the International Conference on New Interfaces for Musical Expression, pp. 22-25, Paris, France, 2006. |
G. Wang et al., “MoPhO: Do Mobile Phones Dream of Electric Orchestras?” In Proceedings of the International Computer Music Conference, Belfast, Aug. 2008. |
Jason Snell, “Best 3D Touch Apps for the iPhone 6s and 6s Plus,” Nov. 6, 2015 (retrieved Sep. 26, 2016), Tom's Guide, pp. 1-15, http://www.tomsguide.com/. |
Wang, Ge, “Designing Smule's iPhone Ocarina,” New Interfaces for Musical Experssion (NIME09), Jun. 3-6, 2009, Pittsburg, PA, 5 pages. |
International Search Report and Written Opinion mailed in International Application No. PCT/US1060135 dated Feb. 8, 2011, 17 pages. |
“Auto-Tune: Intonation Correcting Plug-In.” User's Manual. Antares Audio Technologies. 2000. Print. p. 1-52. |
Antares, “Auto-Tune Real Time Auto-Tune Vocal Effect and Pitch Correcting Plug-In”, Antares Audio Technologies, 2008. |
Ananthapadmanabha, Tirupattur V. et al. “Epoch Extraction from Linear Prediction Residual for Identification of Closed Glottis Interval.” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-27:4. Aug. 1979. Print. p. 309-319. |
Atal, Bishnu S. “The History of Linear Prediction.” IEEE Signal Processing Magazine. vol. 154, Mar. 2006. Print. p. 154-161. |
Baran, Tom,Autotalent vo.2 Digital Signal Processing Group, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, http://web.mit.edu/tbaran/www/autotalent.html, Jan. 31, 2011. |
Baran, Tom. “Autotalent v0.2: Pop Music in a Can!” Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology. May 22, 2011. Web. <http://web.mit.edu/tbaran/www/autotalent.html>. Accessed Jul. 5, 2011. p. 1-5. |
Cheng, M.J. “Some Comparisons Among Several Pitch Detection Algorithms.” Bell Laboratories. Murray Hill, NJ. 1976. p. 332-335. |
Clark, Don. “MuseAmi Hopes to Take Music Automation to New Level.” The Wall Street Journal, Digits, Technology News and Insights, Mar. 19, 2010 Web. Accessed Jul. 6, 2011 <http://blogs.wsj.com/digits/2010/03/19/museami-hopes-to-takes-music-automation-to-new-level/>. |
Conneally, Tim. “The Age of Egregious Auto-tuning: 1998-2009.” Tech Gear News—Betanews. Jun. 15, 2009. Web. <http://www.betanews.com/article/the-age-of-egregious-autotuning-19982009/1245090927>. Accessed Dec. 10, 2009. |
Gerhard, David. “Pitch Extraction and Fundamental Frequency: History and Current Techniques.” Department of Computer Science, University of Regina, Saskatchewan, Canada. Nov. 2003. Print. p. 1-22. |
International Search Report mailed in International Application No. PCT/US2011/032185 dated Aug. 17, 2011, 6 pages. |
Johnson, Joel. “Glee on iPhone More than Good—It's Fabulous.” Apr. 15, 2010. Web. <http://gizmodo.com/5518067/glee-on-iphone-more-than-goodits-fabulous>. Accessed Jun. 28, 2011. p. 1-3. |
Bristow-Johnson, Robert. “A Detailed Analysis of a Time-Domain Formant Corrected Pitch Shifting Alogorithm” AES: An Audio Engineering Society Preprint. Oct. 1993. Print. 24 pages. |
Kuhn, William. “A Real-Time Pitch Recognition Alogorithm for Music Applications.” Computer Music Journal, vol. 14, No. 3, Fall 1990, Massachusetts Institute of Technology, Print. p. 60-71. |
Kumparak, Greg. “Gleeks Rejoice! Smule Packs Fox's Glee Into a Fantastic iPhone Application” MobileCrunch. Apr. 15, 2010. Web. Accessed Jun. 28, 2011 <http://www.mobilecrunch.com/2010/04/15/gleeks-rejoice-smule-packs-foxs-glee-into-a-fantastic-iphone-app/>. |
Lent, Keith. “An Efficient Method for Pitch Shifting Digitally Sampled Sounds.” Departments of Music and Electrical Engineering, University of Texas at Austin. Computer Music Journal, vol. 13:4, Winter 1989, Massachusetts Institute of Technology. Print. p. 65-71. |
McGonegal, Carol A. et al. “A Semiautomatic Pitch Detector (SAPD).” Bell Laboratories. Murray Hill, NJ. May 19, 1975. Print. p. 570-574. |
Rabiner, Lawrence R. “On the Use of Autocorrelation Analysis for Pitch Detection.” IEEE Transactions on Acoustics, Speech, and Signal Processing. vol. Assp-25:1, Feb. 1977. Print. p. 24-33. |
Shaffer, H. and Ross, M. and Cohen, A. “AMDF Pitch Extractor.” 85th Meeting Acoustical Society of America. vol. 54:1, Apr. 13, 1973. Print. p. 340. |
Trueman, Daniel. et al. “PLOrk: the Princeton Laptop Orchestra, Year 1.” Music Department, Princeton University. 2009. Print. 10 pages. |
Wortham, Jenna. “Unleash Your Inner Gleek on the iPad.” Bits, The New York Times. Apr. 15, 2010. Web. <http://bits.blogs.nytimes.com/2010/04/15/unleash-your-inner-gleek-on-the-ipad/>. Accessed Jun. 28, 2011. p. 1-2. |
Ying, Goangshiuan S. et al. “A Probabilistic Approach to AMDF Pitch Detection.” School of Electrical and Computer Engineering, Purdue University. 1996. Web. <http://purcell.ecn.purdue.edu/˜speechg>. Accessed Jul. 5, 2011. 5 pages. |
Examination Report issued in Canadian Application No. 2796241, dated Dec. 20, 2017, 4 pages. |
Number | Date | Country | |
---|---|---|---|
20180204584 A1 | Jul 2018 | US |
Number | Date | Country | |
---|---|---|---|
61323348 | Apr 2010 | US | |
61323348 | Apr 2010 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14517647 | Oct 2014 | US |
Child | 15849194 | US | |
Parent | 13085413 | Apr 2011 | US |
Child | 14517647 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 12876132 | Sep 2010 | US |
Child | 13085413 | US |