This application claims priority to Korean Patent Application No. 10-2021-0183129, filed on Dec. 20, 2021, in the Korean Intellectual Property Office, the disclosure of which is incorporated by reference herein in its entirety.
Example embodiments of the present disclosure relate to apparatuses and methods for classifying speakers by using an acoustic sensor.
Acoustic sensors, which are mounted in household appliances, image display devices, virtual reality devices, augmented reality devices, artificial intelligence speakers, and the like to detect a direction from which sounds are coming and recognize voices, are used in increasingly more areas. Recently, a directional acoustic sensor that detects sound by converting a mechanical movement due to a pressure difference, into an electrical signal has been developed.
One or more example embodiments provide apparatuses and methods for classifying speakers by using an acoustic sensor.
Additional aspects will be set forth in part in the description which follows and, in part, will be apparent from the description, or may be learned by practice of example embodiments of the disclosure.
According to an aspect of an example embodiment, there is provided a speaker classifying apparatus including an acoustic sensor, and a processor configured to obtain a first direction of a sound source within an error range of −5 degrees to +5 degrees based on a first output signal output from the acoustic sensor, recognize a speech of a first speaker in the first direction, obtain a second direction of the sound source within the error range of −5 degrees to +5 degrees based on a second output signal output after the first output signal, and recognize a speech of a second speaker in the second direction based on the second direction being different from the first direction.
The processor may be further configured to recognize a change of a speaker based on the first direction or the second direction being maintained or changed with respect to continuous output signals.
The processor may be further configured to register the first speaker and a recognized voice of the first speaker based on the speech of the first speaker being recognized.
The processor may be further configured to compare a similarity between a voice corresponding to the second output signal with a registered voice of the first speaker.
The processor may be further configured to recognize a speech of a second speaker in the second direction based on the second direction being different from the first direction and the similarity being less than a first threshold.
The processor may be further configured to recognize the speech of the first speaker based on the similarity being greater than a second threshold value.
The processor may be further configured to recognize voices respectively corresponding to the speech of the first speaker and the speech of the second speaker, and classify the recognized voices based on speakers.
The acoustic sensor may include at least one directional acoustic sensor.
The acoustic sensor may include a non-directional acoustic sensor and a plurality of directional acoustic sensors.
The non-directional acoustic sensor may be provided at a center of the speaker classifying apparatus, and wherein the plurality of directional acoustic sensors may be provided adjacent to the non-directional acoustic sensor.
The first direction and the second direction may be estimated different from each other based on a number and arrangement of the plurality of directional sensors.
A directional shape of output signals of the plurality of directional acoustic sensors may include a figure-of −8 shape regardless of a frequency of a sound source.
According to another aspect of an example embodiment, there is provided a minutes taking apparatus using an acoustic sensor, the minutes taking apparatus including an acoustic sensor, and a processor configured to obtain a first direction of a sound source within an error range of −5 degrees to +5 degrees based on a first output signal output from the acoustic sensor and recognize a speech of a first speaker in the first direction, obtain a second direction of the sound source within the error range of −5 degrees to +5 degrees based on a second output signal output after the first output signal, and when the second direction is different from the first direction, recognize a speech of a second speaker in the second direction, and recognize voices respectively corresponding to the speech of the first speaker and the speech of the second speaker and take minutes by converting the recognized voices into text.
The processor may be further configured to recognize a change of a speaker based on the first direction or the second direction being maintained or changed with respect to continuous output signals.
The processor may be further configured to determine a similarity between a recognized voice of the first speaker and a voice of the second output signal.
The processor may be further configured to recognize the second output signal as the speech of the first speaker when the similarity is greater than a threshold value, and recognize the second output signal as the speech of the second speaker when the similarity is less than the threshold value.
According to another aspect of an example embodiment, there is provided a speaker classifying method using an acoustic sensor, the speaker classifying method including obtaining a first direction of a sound source within an error range from −5 degrees to +5 degrees based on a first output signal output from the acoustic sensor, recognizing a speech of a first speaker in the first direction, obtaining a second direction of the sound source within the error range from −5 degrees to +5 degrees based on a second output signal output after the first output signal, and recognizing, based on the second direction being different from the first direction, a speech of a second speaker in the second direction.
According to another aspect of an example embodiment, there is provided a minutes taking method using an acoustic sensor, the minutes taking method including obtaining a first direction of a sound source within an error range from −5 degrees to +5 degrees based on a first output signal output from the acoustic sensor, recognizing a speech of a first speaker in the first direction, obtaining a second direction of the sound source within the error range from −5 degrees to +5 degrees based on a second output signal output after the first output signal, recognizing a speech of a second speaker in the second direction based on the second direction being different from the first direction, recognizing voices respectively corresponding to the speech of the first speaker and the speech of the second speaker, and taking minutes by converting the recognized voices into text.
An electronic device may include the speaker classifying apparatus.
An electronic device may include the minutes taking apparatus.
The above and/or other aspects, features, and advantages of example embodiments of the disclosure will be more apparent from the following description taken in conjunction with the accompanying drawings, in which:
Reference will now be made in detail to embodiments, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to like elements throughout. In this regard, the example embodiments may have different forms and should not be construed as being limited to the descriptions set forth herein. Accordingly, the example embodiments are merely described below, by referring to the figures, to explain aspects. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. Expressions such as “at least one of,” when preceding a list of elements, modify the entire list of elements and do not modify the individual elements of the list. For example, the expression, “at least one of a, b, and c,” should be understood as including only a, only b, only c, both a and b, both a and c, both b and c, or all of a, b, and c.
The terms used in the example embodiments below are those general terms currently widely used in the art in consideration of functions in regard to the present embodiments, but the terms may vary according to the intention of those of ordinary skill in the art, precedents, or new technology in the art. Also, specified terms may be selected arbitrarily, and in this case, the detailed meaning thereof will be described in the detailed description of the relevant example embodiment. Thus, the terms used in the example embodiments should be understood not as simple names but based on the meaning of the terms and the overall description of the embodiments.
It will also be understood that when an element is referred to as being “on” or “above” another element, the element may be in direct contact with the other element or other intervening elements may be present. The singular forms include the plural forms unless the context clearly indicates otherwise.
In the description of the example embodiments, when a portion “connects” or is “connected” to another portion, the portion contacts or is connected to the other portion not only directly but also electrically through at least one of other portions interposed therebetween.
Herein, the terms such as “comprise” or “include” should not be construed as necessarily including various elements or processes described in the specification, and it should be construed that some of the elements or the processes may not be included, or additional elements or processes may be further included.
In the description of the example embodiments, terms including ordinal numbers such as “first”, “second”, etc. are used to describe various elements but the elements should not be defined by these terms. The terms are used only for distinguishing one element from another element.
In the example embodiments, an acoustic sensor may be a microphone, and refer to an apparatus receiving a sound wave, which is a wave in air, and converting the same to an electrical signal.
In the example embodiments, an acoustic sensor assembly may be used to indicate a device including a processor for controlling an acoustic sensor or a microphone, and calculating or obtaining necessary functions. In addition, the acoustic sensor assembly may refer to an apparatus for classifying speakers or an apparatus for taking minutes of a meeting by using the acoustic sensor according to an example embodiment.
The example embodiments relate to an acoustic sensor assembly, and detailed descriptions of matters widely known to those of ordinary skill in the art to which the following embodiments belong are omitted.
In the example embodiments, speaker classification may be recognizing a plurality of speakers by using directivity information or directions of speeches.
In the example embodiments, taking minutes may be taking minutes by recognizing a plurality of speakers by using directivity information or directions of speeches of the speakers and distinguishing between speeches of the speakers and recognizing the voices of respective speakers and converting the voices into text.
Description of the following example embodiments should not be construed as limiting or defining the scope of the present disclosure, and details that are easily derivable by one of ordinary skill in the art to which the present disclosure pertains are construed as being in the scope of the embodiments. Hereinafter, example embodiments that are just for illustration are described in detail with reference to the attached drawings.
Referring to
The plurality of resonators 102 may be arranged in the cavity 105 of the support 101 in a certain form. The resonators 102 may be arranged two-dimensionally without overlapping each other. As illustrated in
The resonators 102 may be provided to sense, for example, acoustic frequencies of different bands. For example, the resonators 102 may be provided to have different center frequencies or resonance frequencies. To this end, the resonators 102 may be provided to have different dimensions from each other. For example, the resonators 102 may be provided to have different lengths, widths or thicknesses from each other.
Dimensions, such as widths or thicknesses of the resonators 102, may be set by considering a desired resonance frequency with respect to the resonators 102. For example, the resonators 102 may have dimensions, such as a width from about several pm to several hundreds of μm, a thickness of several μm or less, and a length of about several mm or less. The resonators 102 having fine sizes may be manufactured by a micro electro mechanical system (MEMS) process.
Hereinafter, an efficient structure and operation of a speaker classifying apparatus and a minutes taking apparatus according to the present disclosure are described in detail with reference to the drawings.
Referring to
The non-directional acoustic sensor 42 may sense sound in all directions surrounding the non-directional acoustic sensor 42. The non-directional acoustic sensor 42 may have directivity for uniformly sensing sound in all directions. For example, the directivity for uniformly sensing sound in all directions may be omni-directional or non-directional.
The sound sensed using the non-directional acoustic sensor 42 may be output as a same output signal from the non-directional acoustic sensor 42, regardless of a direction in which the sound is input. Accordingly, a sound source reproduced based on the output signal of the non-directional acoustic sensor 42 may not include information on directions.
A directivity of an acoustic sensor may be expressed using a directional pattern, and the directional pattern may refer to a pattern indicating a direction in which an acoustic sensor may receive a sound source.
A directional pattern may be illustrated to identify sensitivity of an acoustic sensor according to a direction in which sound is transmitted based on a 360° space surrounding the acoustic sensor having the directional pattern. For example, a directional pattern of the non-directional acoustic sensor 42 may be illustrated in a circle to indicate that the non-directional acoustic sensor 42 has the same sensitivity to sounds transmitted 360° omni-directionally. A specific application of the directional pattern of the non-directional acoustic sensor 42 will be described later with reference to
Each of the plurality of directional acoustic sensors 43a, 43b, 43n may have a same configuration as the directional acoustic sensor 10 illustrated in
The plurality of directional acoustic sensors 43a, 43b, 43n may be arranged adjacent to and to surround the non-directional acoustic sensor 42. The number and arrangement of directional acoustic sensors 43a, 43b, 43n will be described later in detail with reference to
The processor 41 controls the overall operation of the apparatus 4 and performs signal processing. The processor 41 may select at least one of output signals of acoustic sensors having different directivities, thereby calculating an acoustic signal having a same directivity as those of the non-directional acoustic sensor 42 and the plurality of directional acoustic sensors 43a, 43b, 43n. An acoustic signal having a directional pattern of an acoustic sensor corresponding to an output signal selected by the processor 41 may be calculated based on the output signal selected by the processor 41. For example, the selected output signal may be identical to the acoustic signal. The processor 41 may adjust directivity by selecting a directional pattern of the apparatus 4 as a directional pattern of an acoustic sensor corresponding to the selected output signal, and may reduce or loudly sense sound transmitted in a certain direction according to situations.
An acoustic signal refers to a signal including information about directivity, like output signals of the non-directional acoustic sensor 42 and the plurality of directional acoustic sensors 43a, 43b, 43n, and some of the output signals may be selected and determined as acoustic signals or may be newly calculated based on calculation of some of the output signals. A directional pattern of an acoustic signal may be in a same shape as directional patterns of the non-directional acoustic sensor 42 and the plurality of directional acoustic sensors 43a, 43b, 43n or in a different shape, and have a same or different directivity. For example, there is no limitation on a directional pattern or directivity of an acoustic signal.
The processor 41 may obtain output signals of the non-directional acoustic sensor 42 and/or the plurality of directional acoustic sensors 43a, 43b, 43n, and may calculate an acoustic signal having a different directivity from those of the non-directional acoustic sensor 42 and the plurality of directional acoustic sensors 43a, 43b, 43n included in the apparatus 4 by selectively combining the obtained output signals. For example, the processor 41 may calculate an acoustic signal having a different directional pattern from directional patterns of the non-directional acoustic sensor 42 and the plurality of directional acoustic sensors 43a, 43b, 43n. The processor 41 may calculate an acoustic signal having a directional pattern oriented toward a front of a directional acoustic sensor (e.g., 43a), depending on the situation.
The processor 41 may calculate or obtain an acoustic signal by calculating at least one of a sum of and a difference between certain ratios of an output signal of the non-directional acoustic sensor 42 and output signals of the plurality of directional acoustic sensors 43a, 43b, 43n.
The processor 41 may obtain sound around the apparatus 4 by using an acoustic signal. The processor 41 may obtain ambient sound by distinguishing a direction of a sound transmitted to the apparatus 4 by using an acoustic signal. For example, when the processor 41 records a sound source transmitted from the right side of the apparatus 4 and provides the recorded sound source to a user, the user may hear the sound source as if the sound source is coming from the right side of the user. When the processor 41 records a sound source circling the apparatus 4 and provides the recorded sound source to the user, the user may hear the sound source as if the sound source is circling the user.
The processor 41 may obtain a first direction of a sound source within an error range of −5 degrees to +5 degrees based on a first output signal output from an acoustic sensor, and recognize a speech of a first speaker in the first direction, and obtain a second direction of the sound source within the error range of −5 degrees to +5 degrees based on a second output signal output after the first output signal, and when the second direction is different from the first direction, the processor 41 may recognize a speech of a second speaker in the second direction. Here, a criterion for determining whether the first direction is different from the second direction may be whether the range of +5 degrees is deviated or not. For example, when the first direction is 30 degrees, and the second direction is 36 degrees, it may be determined that the first direction is different from the second direction. However, the criterion for determining whether detected directions are the same or different is not limited thereto, and may be appropriately defined according to applications and specifications of an apparatus.
In addition, the processor 41 may obtain a first direction of a sound source within an error range of −5 degrees to +5 degrees based on a first output signal output from an acoustic sensor, and recognize a speech of a first speaker in the first direction, and obtain a second direction of the sound source within the error range of −5 degrees to +5 degrees based on a second output signal output after the first output signal. When the second direction is different from the first direction, the processor 41 may recognize a speech of a second speaker in the second direction, and may take minutes by recognizing voices respectively corresponding to the speech of the first speaker and the speech of the second speaker, and converting recognized voices into text.
The processor 41 may estimate a direction of a sound source by using various algorithms according to the number and arrangement of directional acoustic sensors.
The processor 41 may include a single processor core (single-core) or a plurality of processor cores (multi-core). The processor 41 may process or execute programs and/or data stored in a memory. In some example embodiments, the processor 41 may control a function of the apparatus 4 by executing programs stored in a memory. The processor 41 may be implemented as a central processing unit (CPU), a graphics processing unit (GPU), an application processor (AP), or the like.
The processor 41 may detect a direction of a sound source by using various methods. The method of adjusting directivity, by a directional acoustic sensor, may be referred to as time difference of arrival (TDOA).
However, the above method is based on the assumption that there is a difference in times that sound reaches each acoustic sensor. Therefore, there may be a restriction on setting a distance between acoustic sensors as the distance needs to be set by considering a wavelength of an audible frequency band. The restriction on setting a distance between acoustic sensors may also limit providing a compact size of a device performing the above method. In particular, as a low frequency has a longer wavelength, to distinguish a sound of a low frequency, a distance between acoustic sensors needs to be relatively broad and a signal-to-noise ratio (SNR) of each acoustic sensor needs to be relatively high. Moreover, as phases differ according to frequency bands of sound sensed by each acoustic sensor in the TDOA, the phases may have to be compensated for with respect to each frequency band. In order to compensate for the phase of each frequency, a complex signal processing process of applying an appropriate weight to each frequency may be necessary in the method described above.
In addition, to estimate a direction of a sound source by using TDOA, a signal in an array of a plurality of non-directional microphones is frequently used. A time delay between signals obtained by each microphone may be calculated, and a direction from which a sound source came is estimated based on the time delay. However, the accuracy of the direction estimation is dependent on the size of the array (distance between the microphones) and the time delay.
Another method is to estimate a direction of a sound source based on the intensity difference. This method uses a difference between intensities or levels measured by each microphone to estimate a direction. From which direction a sound source came may be determined based on the magnitude of a signal measured in a time domain. As a size difference between each microphone is used, gain calibration is to be done very accurately, and a large number of microphones may be needed to improve performance.
When using the TDOA-based direction estimation method, the principle of generating a difference in phases between the microphones for each frequency of a sound source according to the size of the microphone array is utilized. Therefore, the size of the array and a wavelength of a sound source to be estimated have a physical relationship, and the size of the array determines the direction estimation performance.
A method of utilizing a time difference or intensity difference between microphones requires a large number of microphones by increasing a size of the array in order to improve the direction estimation performance. In addition, in the time difference-based estimation method, a digital signal processing device is required to calculate different time delays and phase differences for each frequency, and the performance of the device may also be a factor that limits the direction estimation performance.
In addition, as a direction estimation method using an acoustic sensor, a direction estimation algorithm using a directional/non-directional microphone array may be used. For example, by using a channel module including one non-directional microphone and a plurality of, or at least two, directional microphones, a direction of a sound source coming from 360 degrees omni-directionally is detected. In an example embodiment, by utilizing the fact that a directional shape of a directional microphone is figure-of −8, regardless of frequency, a direction of a sound source may be estimated based on power of the sound source. Therefore, the direction of the sound source may be estimated by using an array having a small size, for example, an array within 3 cm, and with a relatively high accuracy, and voice separation based on spatial information may also be performed.
In an example embodiment, a direction of a speaker or a sound source may be detected through an acoustic sensor, for example, a non-directional acoustic sensor, a directional acoustic sensor, or a combination of a non-directional acoustic sensor and a plurality of directional acoustic sensors. Here, the detected direction may be detected with accuracy having an error range of −5 degrees to +5 degrees. Hereinafter, direction detection based on a directional acoustic sensor or a combination of a non-directional acoustic sensor and a directional acoustic sensor and generation of an output signal having directivity are described, but embodiments are not limited thereto, and other various direction detection methods may also be applied.
The processor 41 may calculate an acoustic signal having a directional pattern oriented toward the front direction of the directional acoustic sensor 10 (e.g., +z direction of
The non-directional acoustic sensor 42 is oriented in all directions, and thus, there may be no difference in output signals regardless of a direction in which sound is transmitted. However, for convenience of description below, the front direction of the directional acoustic sensor 10 will be assumed to be identical to a front direction of the non-directional acoustic sensor 42.
For example, the processor 41 may calculate an acoustic signal having a uni-directional pattern 83 by calculating a sum of 1:1 ratios of an output signal of the non-directional acoustic sensor 42 and an output signal of the directional acoustic sensor 10. The uni-directional pattern 83 may have a directivity facing the front of the directional acoustic sensor 10. However, the uni-directional pattern 83 may include a directional pattern covering a broader range to the left and the right, compared to a front portion of the bi-directional pattern 81. For example, the uni-directional pattern 83 may include a cardioid directional pattern.
The directional acoustic sensor 10 may include the bi-directional pattern 81, and the non-directional acoustic sensor 42 may include the omni-directional pattern 82. The directional acoustic sensor 10 may sense a sound that is in-phase with a phase of a sound sensed by the non-directional acoustic sensor 42 from a front direction of the bi-directional pattern 81 (e.g., +z direction of
Referring to
Referring back to
For example, the processor 41 may calculate an acoustic signal having a uni-directional pattern 84 by calculating a difference between 1:1 ratios of an output signal of the non-directional acoustic sensor 42 and an output signal of the directional acoustic sensor 10. Opposite to the uni-directional pattern 83 of
While a method of calculating an acoustic signal having a uni-directional pattern by calculating a sum of or a difference between an output of the directional acoustic sensor 10 and an output of the non-directional acoustic sensor 42 is described above, this is merely an example, and the control of directivity is not limited to the method described above.
The processor 41 may calculate an acoustic signal having a new bi-directional pattern differing from bi-directivity of respective directional acoustic sensors by selecting only a non-directional pattern, or selecting only a bi-directional pattern of a directional acoustic sensor oriented toward a certain direction, or calculating output signals of directional acoustic sensors, according to situations.
Example embodiments related to speaker classification for classifying speakers by using an acoustic sensor and taking of minutes based on the same. According to related art, in order to automatically take minutes, a method of recording the entire meeting and performing speaker diarization to perform speaker verification on each speech is used. Various methods from general principal components analysis (PCA) to deep learning methods are used. In the method according to related art, when there is a recording signal of all the minutes, speeches may be classified by finding disconnections in the speeches through the speaker diarization technique, and speeches may be classified for each speaker through the speaker verification technique.
The method according to related art involves processing data after acquiring all data, and thus has a security risk. From the standpoint of providing a service, data is sent to a cloud for computation to reduce deviations for each device, guarantee performance, and protect their own algorithm. For this reason, security-conscious companies and users may be reluctant to send their minutes to a server of other companies. In addition, even when an algorithm is made lightweight and applied in an on-device form, the algorithm is still additionally used, and thus, the overall system becomes heavy. Finally, the algorithm according to related art has a problem that the number of participants needs to be decided by a human.
To address the problems of taking minutes according to related art described above, the example embodiments provide a method of automatically classifying speakers by using directivity information or direction information of an acoustic sensor and enabling to take minutes in real time based on the classification.
Referring to
The speech detection unit 1000 detects that a voice is coming and travelling in a state of silence around the acoustic sensor.
The direction detection unit 1010 detects a direction from which a voice is coming, by using directivity information or direction information of the acoustic sensor. Here, the direction may be detected based on directivity information of an output signal output from the acoustic sensor. As described above, for direction detection by an acoustic sensor, a TDOA-based direction estimation technique, a direction estimation technique using a combination of a non-directional acoustic sensor and a plurality of directional acoustic sensors, and the like, may be used, but embodiments are not limited thereto.
The speaker recognition unit 1020 classifies speakers by labeling directions.
Referring to
When a voice corresponding to the first output signal is input, a direction of the first output signal, for example, 30 degrees, is detected, and the detected direction, 30 degrees, is registered as Speaker 1 (SPK1). In a next signal, it is determined that the voice of Speaker 1 is input from the 30 degree-direction. When a direction of a third output signal is changed (1110), that is, when a 90 degree-direction is detected in the third output signal, Speaker 2 (SPK 2) is registered. When a direction of a fourth output signal is still 90 degrees, it is determined that the voice of Speaker 2 is input. When a direction of a fifth output signal is changed (1120), and the fifth output signal is in the 30 degrees-direction, it is determined that the voice of Speaker 1 is input again. When a direction of a sixth output signal is changed (1130), and the sixth output signal is detected in a 180 degrees-direction, Speaker 3 (SPK 3) is registered. When a direction of a seventh output signal is still 180 degrees, it is determined that the voice of Speaker 3 is input. When a direction of an eighth output signal is changed (1140), and the eighth output signal is in the 30 degrees-direction, it is determined that the voice of Speaker 1 is input again.
In an example embodiment, speakers may be distinguished by using only directivity information of an acoustic sensor, and it is possible to classify the speakers without undergoing a complicated calculation or post-processing at the server's end. Therefore, embodiments of the present disclosure may be more effectively applied to searching for a certain sound or a certain person's voice.
Referring to
The voice recognition unit 1030 recognizes a voice with respect to an output signal output from the acoustic sensor. Here, as described with reference to
The voice recognition unit 1030 may include three steps of pre-processing, pattern recognition, and post-processing in order to receive a voice signal and calculate the same in the form of a sentence and to implement the same. Through pre-processing and feature extraction, noise is removed and features are extracted from a voice signal, and features are recognized in the form of elements necessary to construct a sentence. The elements are combined and expressed in the form of sentences.
The pre-processing process is a process of extracting features in a time domain and a frequency domain from a voice signal as in transformation and feature extraction auditory systems. The pre-processing process functions as the cochlea of the auditory system and includes extracting information about periodicity and synchronization of voice signals.
In the pattern recognition process, phonemes, syllables, and words, which are elements necessary to construct a sentence, are recognized based on the features obtained through pre-processing of a resultant value calculation voice signal. To this end, a variety of template (for example, dictionary)-based algorithms such as phonetics, phonology, phonological arrangement theory, and prosodic requirements may be used. For example, a pattern recognition process may include an approach through dynamic programming (dynamic time warping (DTW)), an approach through probability estimation (hidden Markov model (HMM)), an approach through inference using artificial intelligence, an approach through pattern classification, and the like.
The post-processing process includes restoring a sentence by reconstructing phonemes, syllables, and words that are results of language processing (sentence restoration) pattern recognition. To this end, syntax, semantics, and morphology are used. To construct a sentence, rules-based and statistics-based models are used. According to a syntactic model, sentences are constructed by limiting the types of words that can come after each word, and according to a statistical model, sentences are recognized by considering the probability of the occurrence of N words before each word.
The text conversion unit 1040 converts recognized voice into text to take minutes. The text conversion unit 1040 may be a speech-to-text (STT) module. In addition, text may be output together with labeling for each speaker recognized by the speaker recognition unit 1020 or may be output together with time information, to be suitable for minutes.
Referring to
In operation 1204, when the speaker is not changed, in operation 1212, it is determined whether the speech has ended. When the speech has ended, the method proceeds to operation 1206 to perform speaker recognition, voice recognition, and minutes taking.
As directivity information may be known through an acoustic sensor in the minutes taking method according to the example embodiment, positions of persons who are speaking may be known, and speaker diarization and speaker classification may be performed based on the positions of speaking persons. For example, the problem of related art may be addressed by asking “Is the speaker changed?” Speakers may be distinguished from each other while recording is conducted in real time, and thus, a security risk in terms of recording everything and performing post-processing on a server as in the related art may be avoided, and there is no need to perform algorithms such as speaker diarization and speaker verification, and thus, there is an advantage in terms of computation and complexity.
Referring to
When a direction of the third output signal is changed (1610), a second speaker (SPK2) is registered. Here, since a similarity between the first output signal or the second output signal and the third output signal of the first speaker is 68%, it can be confirmed that the speaker is changed. The fourth output signal is input, and a similarity with the third output signal is 93% with respect to the second speaker and 67% with respect to the first speaker.
When a direction of the fifth output signal is changed (1620), the direction of the fifth output signal is the same as that of the first output signal. Moreover, the fifth output signal has a similarity of 93% with respect to the first speaker and a similarity of 61% with respect to the second speaker.
When a direction of the sixth output signal is changed (1630), and the direction of the sixth output signal is a new direction different from the direction of the first speaker and the second speaker, a third speaker (SPK 3) is registered. A similarity between the sixth output signal and the first speaker is 73%, and a similarity with the second speaker is 62%. A direction of the seventh output signal is not changed, a similarity with the third speaker is 89%, the similarity with the second speaker is 57%, and the similarity with the first speaker is 62%. Therefore, it may be determined that the seventh output signal is that of a voice of the third speaker.
When a direction of the eighth output signal is changed (1640), the eight output signal is in the same direction as the first speaker, a similarity thereof with the first speaker is 91%, a similarity thereof with the third speaker is 71%, and a similarity thereof with the second speaker is 60%.
In an example embodiment, when the voice of a series of meetings is recorded, not only speakers may be classified but similarity between the speakers may be determined, thereby increasing the accuracy of speaker classification.
Referring to
Referring to
The speaker classifying apparatus or the minutes taking apparatus described above may be used in various electronic devices. The electronic devices may include, for example, a smartphone, a portable phone, a mobile phone, a personal digital assistant (PDA), a laptop, a PC, various portable devices, home appliances, security cameras, medical cameras, automobiles, and Internet of Things (loT) devices, or other mobile or non-mobile computing devices, and are not limited thereto.
The electronic devices may further include an AP, and may control a plurality of hardware or software components by driving an operating system or an application program through the processor, and may perform various data processing and computation. The processor may further include a GPU and/or an image signal processor.
Referring to
By executing software (e.g., a program ED40), the processor ED20 may control one or a plurality of other components (hardware, software components, etc.) of the electronic device ED01 connected to the processor ED20 and may perform various data processing or computation. As part of data processing or computation, the processor ED20 may load commands and/or data received from other components (a sensor module ED76, a communication module ED90, etc.), into a volatile memory ED32, and process the commands and/or data stored in the volatile memory ED32, and store resultant data in a nonvolatile memory ED34. The processor ED20 may include a main processor ED21 (a CPU, an AP, etc.) and an auxiliary processor ED23 (a graphics processing unit, an image signal processor, a sensor hub processor, communication processor, etc.) that may be operated independently of or together with the main processor ED21. The auxiliary processor ED23 may use less power than the main processor ED21 and may perform a specialized function.
The auxiliary processor ED23 may be configured to control functions and/or states related to some of the components of the electronic device ED01 (the display device ED60, the sensor module ED76, the communication module ED90, etc.) by replacing the main processor ED21 while the main processor ED21 is in an inactive state (sleep state), or together with the main processor ED21 when the main processor ED21 is in an active state (application execution state). The auxiliary processor ED23 (an image signal processor, a communication processor, etc.) may be implemented as a portion of other functionally related components (the camera module ED80, the communication module ED90, etc.).
The memory ED30 may store various data required by the components of the electronic device ED01 (the processor ED20, the sensor module ED76, etc.). The data may include, for example, input data and/or output data for software (e.g., the program ED40) and instructions related thereto. The memory ED30 may include a volatile memory ED32 and/or a nonvolatile memory ED34. The nonvolatile memory ED34 may include an internal memory ED36 fixedly mounted in the electronic device ED01 and a removable external memory ED38.
The program ED40 may be stored as software in the memory ED30, and may include an operating system ED42, middleware ED44, and/or an application ED46.
The input device ED50 may receive a command and/or data to be used in a component of the electronic device ED01 (e.g., the processor ED20) from the outside of the electronic device ED01 (a user, etc.). The input device ED50 may include a microphone, a mouse, a keyboard, and/or a digital pen (e.g., a stylus pen).
The sound output device ED55 may output a sound signal to the outside of the electronic device ED01. The sound output device ED55 may include a speaker and/or a receiver. The speaker may be used for general purposes, such as multimedia playback or recording playback, and the receiver may be used to receive incoming calls. The receiver may be integrated as a portion of the speaker or may be implemented as an independent separate device.
The display device ED60 may visually provide information to the outside of the electronic device ED01. The display device ED60 may include a display, a hologram device, or a projector and a control circuit for controlling these devices. The display device ED60 may include touch circuitry configured to sense a touch, and/or sensor circuitry configured to measure intensity of a force generated by the touch (e.g., a pressure sensor).
The audio module ED70 may convert sound into an electrical signal, or conversely, convert an electrical signal into a sound. The audio module ED70 may obtain a sound through the input device ED50 or output sound through a speaker and/or a headphone of other electronic devices (the electronic device ED02, etc.) directly or wirelessly connected to the sound output device ED55 and/or the electronic device ED01. The audio module ED70 may include a speaker classifying apparatus or a minutes taking apparatus according to an embodiment.
The sensor module ED76 may detect an operating state of the electronic device ED01 (power, temperature, etc.), or an external environmental state (user status, etc.), and generate an electrical signal and/or data corresponding to the sensed state value. The sensor module ED76 may include a gesture sensor, a gyro sensor, a barometric pressure sensor, a magnetic sensor, an acceleration sensor, a grip sensor, a proximity sensor, a color sensor, an infrared (IR) sensor, a biometric sensor, a temperature sensor, a humidity sensor, and/or an illuminance sensor.
The interface ED77 may support one or a plurality of designated protocols that may be used to directly or wirelessly connect the electronic device ED01 to another electronic device (e.g., the electronic device ED02). The interface ED77 may include a High Definition Multimedia Interface (HDMI), a Universal Serial Bus (USB) interface, a Secure Digital (SD) card interface, and/or an audio interface.
A connection terminal ED78 may include a connector through which the electronic device ED01 may be physically connected to another electronic device (e.g., the electronic device ED02). The connection terminal ED78 may include an HDMI connector, a USB connector, an SD card connector, and/or an audio connector (e.g., a headphone connector).
The haptic module ED79 may convert an electrical signal into a mechanical stimulus (vibration, movement, etc.) or an electrical stimulus that the user may perceive through tactile or kinesthetic sense. The haptic module ED79 may include a motor, a piezoelectric element, and/or an electrical stimulation device.
The camera module ED80 may capture a still image or record a moving picture. The camera module ED80 may include additional lens assembly image signal processors, and/or flash units. A lens assembly included in the camera module ED80 may collect light emitted from a subject, which is an object of image capturing.
The power management module ED88 may manage power supplied to the electronic device ED01. The power management module ED88 may be implemented as a portion of a power management integrated circuit (PMIC).
The battery ED89 may supply power to components of the electronic device ED01. The battery ED89 may include a non-rechargeable primary cell, a rechargeable secondary cell, and/or a fuel cell.
The communication module ED90 may support establishment of a direct (wired) communication channel and/or a wireless communication channel between the electronic device ED01 and other electronic devices (the electronic device ED02, the electronic device ED04, the server ED08, etc.) and communication through the established communication channel. The communication module ED90 may include one or a plurality of communication processors that operate independently of the processor ED20 (e.g., an AP) and support direct communication and/or wireless communication. The communication module ED90 may include a wireless communication module ED92 (a cellular communication module, a short-range wireless communication module, a global navigation satellite system (GNSS, etc.) communication module and/or a wired communication module ED94 (a local area network (LAN) communication module, a power line communication module, etc.). Among these communication modules, a corresponding communication module may communicate with other electronic devices through a first network ED98 (a short-range communication network such as Bluetooth, WiFi Direct, or Infrared Data Association (IrDA)) or a second network ED99 (a telecommunication network such as a cellular network, the Internet, or a computer network (LAN, WAN, etc.)). These various types of communication modules may be integrated into a single component (a single chip, etc.) or implemented as a plurality of components (multiple chips) that are separate from each other. The wireless communication module ED92 may confirm and authenticate the electronic device ED01 in a communication network, such as the first network ED98 and/or the second network ED99 by using subscriber information (e.g., International Mobile Subscriber Identifier (IMSI)) stored in the subscriber identification module ED96.
The antenna module ED97 may transmit or receive signals and/or power to or from the outside (e.g., other electronic devices). An antenna may include a radiator including a conductive pattern formed on a substrate (e.g., a printed circuit board (PCB)). The antenna module ED97 may include one or a plurality of antennas. When a plurality of antennas are included, an antenna suitable for a communication method used in a communication network, such as the first network ED98 and/or the second network ED99 may be selected by the communication module ED90 from among the plurality of antennas. A signal and/or power may be transmitted or received between the communication module ED90 and another electronic device through the selected antenna. In addition to the antenna, other components (e.g., a radio frequency integrated circuit (RFIC)) may be included as a portion of the antenna module ED97.
Some of the components may be connected to each other through a communication method between peripheral devices (e.g., a bus, general purpose input and output (GPIO), serial peripheral interface (SPI), mobile industry processor interface (MIPI)) and exchange signals (e.g., command, data, etc.).
A command or data may be transmitted or received between the electronic device ED01 and the external electronic device ED04 through the server ED08 connected to the second network ED99. The other electronic devices ED02 and ED04 may be of the same type as or a different type from that of the electronic device ED01. All or some of operations performed by the electronic device ED01 may be executed in one or a plurality of devices among the other electronic devices ED02, ED04, and ED08. For example, when the electronic device ED01 is to perform a function or service, instead of executing the function or service by itself, a request for performing a portion or all of the function or service may be made to one or a plurality of other electronic devices. One or a plurality of other electronic devices receiving the request may execute an additional function or service related to the request, and transmit a result of the execution to the electronic device ED01. To this end, cloud computing, distributed computing, and/or client-server computing technology may be used.
As various electronic devices include the speaker classifying apparatus or the minutes taking apparatus according to an example embodiment, sound may be obtained by using a certain directional pattern with respect to a certain direction, a direction of transmitted sound may be detected, or sound around the electronic device may be obtained with spatial awareness. For example, when a first user and a second user have a conversation by using an electronic device as a medium, the electronic device may detect a direction in which each user is located, or sense only the voice of the first user by using a directional pattern oriented toward the first user, or sense only the voice of the second user by using a directional pattern oriented toward the second user, or simultaneously sense the voices of both users by distinguishing directions from which each user's voice is heard.
A speaker classifying apparatus or a minutes taking apparatus mounted on an electronic device has uniform sensitivity to various frequencies of sensed sound, and it is easy to manufacture the speaker classifying apparatus or the minutes taking apparatus having a compact size as there is no restriction on distances between respective acoustic sensors. Also, the degree of freedom of operation of the apparatuses is relatively high because various directional patterns may be selected and combined according to a location of a direction estimating apparatus or the conditions of the surroundings. In addition, only simple operations such as a sum or a difference are used to control the direction estimating apparatus, and thus computational resources may be used efficiently.
The speaker classifying apparatus or the minutes taking apparatus according to the example embodiments may be a microphone module 1800 provided in a mobile phone or smartphone illustrated in
In addition, the speaker classifying apparatus or the minutes taking apparatus may be a microphone module 2000 provided in a robot illustrated in
Although the speaker classifying apparatus or minutes taking apparatus described above and an electronic device including the same have been described with reference to the example embodiment illustrated in the drawings, this is merely an example, and it will be understood by those of ordinary skill in the art that various modifications and equivalent other embodiments may be made. Therefore, the disclosed example embodiments should be considered in an illustrative rather than a restrictive sense. The scope of the present disclosure is defined not by the detailed description of the present disclosure but by the appended claims, and all differences within the scope will be construed as being included in the present disclosure.
The example embodiments described above can be written as computer programs and can be implemented in general-use digital computers that execute the programs using a computer-readable recording medium. Also, data structures used in the example embodiments described above may be written to the computer-readable recording medium using various means. Examples of the computer-readable recording medium include magnetic storage media (e.g., ROM, floppy disks, hard disks, etc.), optical recording media (e.g., CD-ROMs, or DVDs), and storage media such as carrier waves (e.g., transmission through the Internet).
It should be understood that example embodiments described herein should be considered in a descriptive sense only and not for purposes of limitation. Descriptions of features or aspects within each example embodiment should typically be considered as available for other similar features or aspects in other embodiments. While example embodiments have been described with reference to the figures, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope as defined by the following claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
10-2021-0183129 | Dec 2021 | KR | national |