The present invention relates to an optimization method, in particular, to an optimization method adopted in the speech/user recognition system.
In this era over which the network prevails (especially, the prosperity of the Internet), massive trade processes and entertainment activities have already been brought to people via the network for providing the daily services for people. However, most of the World Wide Web users are limited with manipulating the input/output device based on the non-voice-commanded equipments such as the mouse, the keyboard, the touch panel, the trackball, the printer, the monitor, etc. merely. Because those equipments are not in compliance with the human nature that humans communicate with each other in voice/speech approach possessing the advantage of fine convenience, the development of the communication between Internet and humans encounters quite a few bottlenecks in practice. Therefore, the scientists/engineers get started to carry out the speech/user recognition system to be the interface adopted in the communications between humans and computer machines, which enables the interactive behavior occurred on the Internet to be more suitable for gratifying the need of humanization.
In recent years, the rapid developments of the speech/user recognition system and the telecommunication render the application of the relevant techniques thereof more widespread, rather than narrowly limited to be used in only a single personal computer. With regard to various types of the speech/user recognition system, the user is allowed to input the speech via different devices at different locations. The inputted speech is transferred to the central processing system, and the corresponding response is responded to the user in an adequate manner (e.g. in the text approach, in the picture approach, or in the voice approach) after the recognition is performed by the central processing system.
Regarding the speech/user recognition technique, the processing of speech feature extraction is considerably critical. The correct recognition results are based on the comparison between the characteristics analyzed from processed feature signal and those set up by the predetermined module, for obtaining the accurate recognition results.
Please refer to
The speech feature extraction processing of the conventional speech/user recognition system is quite dependent on the capability of the central processing unit connected to the recognition engine, and the transferring time required also depends on the bandwidth. Because the speech/user recognition system was not popular in the past, the overloads of the central processing unit and the network are not frequently happened. However, by the various wide-spreading applications of the system and the massively increased numbers of the users, the loads of the central processing unit and network have became more and more demanding, which makes numerous users in the Queue spend excessive time for waiting for the return of the recognition result. Hence, the requirement of real time response to the user could not be satisfied.
Presently, there are two methods for solving the aforementioned problems. The first method is that the calculation is shared out respectively by the server end and by the client end (e.g. PDA, the set-top box, etc.). Basically, for the first method, the amount of respective loading calculation is predetermined according to the processing capability of the server end and client end. However, the method lacks the function for dynamically adjusting the load, and thus the client is not possible to share more calculation for cutting down the waiting time for the calculation if the load is suddenly increased. Once the amount of input devices is increased, the waiting time at each client end will be correspondingly raised. Thus it is impossible to efficiently solve the problem of the excessive waiting time arisen due to the massive inputs.
The second method is to readjust the efficiency of the feature at each stage when overloading. That means to forsake the accuracy of the feature to acquire faster calculating time. Though the second method is for dynamically adjusting the load and for cutting down the waiting time, the correctness for recognizing the speech/user is degraded.
For overcoming the mentioned drawbacks of the prior art, a novel method for optimizing loads of the speech/user recognition system is provided.
According to the aforementioned of the present application, a method for optimizing a load of a speech/user recognition system is provided. The speech/user recognition system includes a server end, a client end and a network, and the method is achieved by performing N stages of computations for a speech feature of a speech, where N is a positive integer, and an i is selected from 1 to N for representing the ith stage speech feature, including steps of: (a) providing a real time factor Ta(i) for respective stage i of the speech feature at the client end, where Ta(i) is for an average computation time of computing the ith stage speech feature at the client end with respect to one second input speech; (b) providing a real time factor Tb(i) for respective stage i of the speech feature at the server end, where Tb(i) is for an average computation time of computing the ith stage speech feature at the server end with respect to one second input speech; (c) providing a load c of the server end and a load d of the network; (d) deciding an n in the range from 1 to N for minimizing a recognition time Tinput of the speech; (e) inputting the speech for being recognized within a time Tinput; (f) performing an computation from the first stage speech feature to the nth stage speech feature of the speech at the client end, while performing an computation from the (n+1)th stage speech feature to the Nth stage speech feature of the speech at the server end; and (g) repeating steps (e) to (f).
Preferably, the step (c) further including steps of: (c1) inputting a first speech for being recognized with a first input time Tinput1, wherein an accomplishment of the first speech recognition takes a first output time Toutput1; and (c2) inputting a second speech for being recognized within a second input time Tinput2, wherein an accomplishment of the second speech recognition takes a second output time Toutput2.
Preferably, the data size of first speech feature of stage n is Dn(Tinput1)
Preferably, a time for the first speech feature of stage n being transferred via the network is Dn(Tinput1)/d.
Preferably, the data size of second speech feature of stage n is Dn(Tinput2).
Preferably, a time for the second speech feature of stage n being transferred via the network is Dn(Tinput2)/d.
Preferably, the data size of speech feature of stage n is Dn(Tinput2).
Preferably, a time for the speech feature of stage n being transferred via the network is Dn(Tinput)/d.
Preferably, a transmitting time for a recognition result via the network is K/d.
Preferably, the step (c1) further including steps of: (c11) providing an n1 in the range from 1 to N; and (c12) performing an computation for the first stage speech feature to the n1th stage speech feature of the first speech at the client end, while performing an computation from the (n1+1)th stage speech feature to the Nth stage speech feature of the first speech at the server end.
Preferably, a computation time for the computation from the first stage speech feature to the n1th stage speech feature of the first speech at the client end is
Preferably, a computation time for an computation from the (n1+1)th stage speech feature to the Nth stage speech feature of the first speech at the server end is
Preferably, a computation time for computing total N stages of the speech feature of the first speech is
Preferably, the first output time is a summation of the computation time for computing total N stages of the speech feature of the first speech, the time for transferring the first speech feature via the network, and the time for returning a recognition result via the network, and equals to
Preferably, the step (c2) further including steps of: (c21) providing an n2 in the range from 1 to N; and (c22) performing an computation from the first stage speech feature to the n2th stage speech feature of the second speech at the client end, while performing an computation from the (n2+1)th stage speech feature to the Nth stage speech feature of the first speech at the server end.
Preferably, a computation time for the computation from the first stage speech feature to the n2th stage speech feature of the second speech at the client end is
Preferably, a computation time for an computation from the (n2+1)th stage speech feature to the Nth stage speech feature of the second speech at the server end is
Preferably, a computation time for computing total N stages speech feature of the second speech is
Preferably, the second output time is a summation of the computation time for computing total N stages of the speech feature of the second speech, the time for transferring the second speech feature via the network, and the time for returning a recognition result via the network, and equals to
Preferably, the computation time for being recognized the speech is the summation of computation time for computing total N stages speech features for the speech, the time for transferring the speech feature via the network, and the time for returning a recognition result via the network, and equals to
According to the aforementioned of the present application, a method for optimizing a recording frame-synchronized speech feature computation comprising a server end, a client end and a network, and the method is achieved by performing N stages of computations for a speech feature of a speech having N′ frames, where N and N′ are a positive integers, where an i is selected from the range from 1 to N for representing the ith stage speech feature, and a n′ is selected from the range from 1 to N′ for representing the nth frame, comprising steps of: (a) providing an specific n in the range from 1 to N; (b) inputting said speech for an input time (Tinput), wherein an computation for the first stage speech feature to the nth stage speech feature of the each frame of the speech is performed at the client end, and an computation from the (n+1)th stage speech feature to the Nth stage speech feature of the each frame of the speech is performed at the server end; (c) after the step (b) is carried out, an computation of the n′ frames is achieved, and a speech feature computation of the nth stage of the (n′+1)th frame is achieved, modifying the n by a specific manner according to the n1 to minimize a computation time for recognizing the speech; and (d) performing an computation from the first stage speech feature to the nth stage speech feature of the respective remaining frames at the client end according to the modified n in step (c), while performing an computation for the (n+1)th stage speech feature to the Nth stage speech feature of the respective remaining frames at the server end.
Preferably, the method is used in a recording frame-synchronized speech feature computation system.
Preferably, in the step (b) the recording frame-synchronized speech feature computation system that speech feature extraction synchronously with speech recording.
Preferably, in the step (c) an computation of the n′ frames is achieved by the recording frame-synchronized speech feature computation system.
Preferably, n in the step (a) is obtained according to the method as recited in claim 1.
Preferably, a factor Ta(i) is for an average computation time of computing the ith stage speech feature at the client end with respect to the input speech.
Preferably, a factor Tb(i) is for an average computation time of computing the ith stage speech feature at the server end with respect to the input speech.
Preferably, a computation time for an operation from the first stage speech feature to the nth stage speech feature of the speech at the client end is
Preferably, a computation time for an computation from the (n+1)th stage speech feature to the Nth stage speech feature of said speech at the server end is
Preferably, a computation time for computing total N stages of the speech feature of the speech is
Preferably, the data size of speech feature of stage n is Dn(Tinput).
Preferably, a time for the speech feature of stage n being transferred via the network is Dn(Tinput)/d.
Preferably, a transmitting time for a recognition result being returned by the network is K/d.
Preferably, the specific manner in the step (c) uses: (c1) if n1 is smaller than n, an equation
is used for obtaining the modified n; and (c2) if n1 is greater than n, an equation
is used for obtaining the modified n, wherein c is a load of the server end and d is a load of the network.
Preferably, the c the d are obtained according to the method as recited in claim 1.
According to the aforementioned of the present application, a method for optimizing a load of a speech/user recognition system including a server end, a client end and a network, wherein a recognition is achieved by performing plural stages of computations to a speech feature of a speech having an inputting time, including steps of: (a) providing a real time factor Ta(i) for a respective stage i speech feature computing at the client end; (b) providing a real time factor for a respective stage i speech feature at the server end; (c) providing a load of the server end and a load of the network; (d) obtaining a specific amount according to the load of the server end and the load of the network to minimize a computation time for recognizing the speech; and (e) determining the computation at the client end and the server end according to the specific amount and performing the plural stages of computations for the speech features of the speech.
Preferably, the step (c) further including steps of: (c1) inputting a first speech to be recognized during a first input time, where an accomplishment of a recognition of the first speech is a first output time; and (c2) inputting a second speech to be recognized during a second input time, where an accomplishment of a recognition of the second speech is a second output time; and (c3) estimating the load of the server end and the load of the network according to the first and second output times of (c1) and (c2).
Preferably, the computation time for computing all stages of the speech feature at the client end is directly proportional to the inputting time.
Preferably, the computation time for computing all stages of the speech feature at the server end is directly proportional to the inputting time.
Preferably, the speech includes a data size.
Preferably, a time for transferring the speech feature via network is a ratio of the data size to the load of the network.
Preferably, a time for computing all the speech features is a summation of the respective times of time for computing the speech feature at the client end and at the server end.
Preferably, an output time of the speech is a summation of the computation time for computing all the speech features, the time for transmitting the speech feature via the network, and the time for transmitting a recognition result via the network.
According to the aforementioned of the present application, a method for optimizing a recording frame-synchronized speech feature computation comprising a server end, a client end and a network, wherein a recognition of a speech is achieved by performing plural stages of computations for a speech feature of the speech having plural frames, including steps of: (a) providing a specific amount; (b) inputting the speech for an input time; (c) after the step (b) is carried out when a part of the plural frames has not been computed, and only part computations of the plural stages for the speech feature of a first frame of the frames having not been computed, modifying the specific amount by a specific manner, to minimize a computation time for recognizing the speech; and (d) distributing the respective loads of the server end and the client end according to the modified specific amount in the step (c) and then performing computations for the frames having not been computed to achieve the recognition.
Preferably, the method is used in a recording frame-synchronized speech feature computation system.
Preferably, the recording frame-synchronized speech feature computation system synchronously performs the speech feature computations, wherein the system distributes the respective computation at the client end and the server end according to the specific amount
Preferably, the specific amount in the step (a) is obtained according to the method as recited in claim 1.
Preferably, a computation time for computing one of the plural stages of computations at the client end is directly proportional to the input time.
Preferably, a computation time for computing one of the plural stages of computations at the server end is directly proportional to the input time.
Preferably, the speech includes a data size.
Preferably, a time for transmitting the speech feature via the network is the ratio of the data size to the load of the network.
Preferably, a time for all plural stages of computations is the summation of a time for computing the speech feature at the client end and a time for computing the speech feature at the server end.
Preferably, an output time of the speech recognition is the summation of a time for computing the speech feature, a time for transmitting the speech features via the network, and a time for transmitting a recognition result via the network.
The present invention will now be described more specifically with reference to the following embodiments. It is to be noted that the following descriptions of preferred embodiments of this invention are presented herein for the aspect of illustration and description only; it is not intended to be exhaustive or to be limited to the precise form disclosed.
Please refer to
In practical performance, the current loads of the server and the network in step B are obtained via the following procedures. In the beginning, a first speech is inputted for performing the recognition, and a computation time Tinput1 for inputting the first speech and an output time Toutput1 for accomplishing the recognition of the first speech and bounding the recognition result are measured. Next, a second speech is inputted for performing the recognition, and a computation time Tinput2 for inputting the second speech and an output time Toutput2 for accomplishing the recognition of the second speech and bounding the recognition result are measured. Then the mentioned measured input times (Tinput1, Tinput2) and the output times (Toutput1, Toutput2) are substituted into the following Equation (1) for forming the joint equations to respectively acquire the present load of the server c and the load of the network d:
wherein N represents the total N stages speech feature computation, c represents the present load of the server, d represents the present load of the network;
represents the computation time from the first stage to the Nth stage;
represents the computation time for computing the speech feature from the (n+1)th stage to the Nth stage; Dn(Tinput) represents the data size of the speech; Dn(Tinput)/d represents the transmitting time for transmitting a speech feature via the network having a load d; K represents the size of the result returned; K/d represents the returning time for returning speech recognition result via the network having a load d, which is regarded as a constant because the variation of the size for the recognition result thereof is usually slight; Toutput represents the output time for accomplishing a recognition which is a summation of the computation time for computing the speech feature at the client end, a computation time for computing the speech feature at the server end, the transmitting time for a transmitting a speech feature via the network, and the returning time for returning a speech recognition result via the network. Besides, in the step C, the value n for minimizing the outputting time is obtained according to the following Equation (2):
The present application re-operates the loads of the server end and the network end for a fixed time depending on the practical situation for estimating a new value n so as to optimize the next entire recognition time. Furthermore, if the variation of the load of server end is slight, the load of the server end is obtained upon the previous response. Thus and then the server end broadcasts the estimated load for next time per fixed time, and the load of the network is estimated per practical estimating time, and the value n needed for next time is estimated accordingly. Besides, before enough relevant information is collected, a value n is estimated based on the experience, till enough relevant information is collected for estimating the loads of the network and the server end.
Please refer to
wherein N represents the total N stages speech feature computation, c represents the present load of the server, d represents the present load of the network;
represents the distributing time for distributing the remaining computations of the speech feature distributed respectively to the client end and to the server end according to the modified value n;
represents the distributing time for distributing the remaining computations of the (n′+1)th speech feature distributed respectively to the client end and to the server end according to the modified value n; Dn(Tinput) represents the data size of the speech feature in stage n; Dn(Tinput)/d represents the transmitting time for transmitting a speech feature via the network having a load d; K represents the size of the recognition result returned; K/d represents the returning time for returning the recognition result via the network having a load d, which could be regarded as a constant because the variation of the size for the recognition result thereof is usually slight. In the step C, if the value n, is greater than or equal to the value n provided in the step B, the value n is modified according to the following equation (4) for minimizing the entire recognition time (Toutput):
wherein N represents the total N stages speech feature computation, c represents the present load of the server, d represents the present load of the network;
represents the distributing time for distributing the remaining computations of the speech feature distributed respectively to the client end and to the server end according to the modified value n;
represents the computing time for computing the remaining computations of the (n′+1)th speech feature, and in particular, the computation is completely accomplished at the server end; Dn(Tinput) represents the data size of the speech feature of stage n; Dn(Tinput)th represents the transmitting time for transmitting speech features of stage n via the network having a load d; K represents the size of the result returned; K/d represents the returning time for returning the recognition result via the network having a load d, recognition result could be regarded as a constant because the variation of the size for the recognition result thereof is usually slight.
To comprehensively sum up the aforementioned, the present invention substantially provides a method for dynamically optimizing the load of the speech/user recognition system with the novelty, the inventiveness, and the utility. The load of the client end is to be dynamically adjusted via estimating the loads of the server end and the network for sharing the work at the server end, which enables the waiting time at each client end and the entire recognition time to be shortest.
While the invention has been described in terms of what are presently considered to be the most practical and preferred embodiments, it is to be understood that the invention need not to be limited to the disclosed embodiment. On the contrary, it is intended to cover various modifications and similar arrangements included within the spirit and scope of the appended claims that are to be accorded with the broadest interpretation, so as to encompass all such modifications and similar structures. According, the invention is not limited by the disclosure, but instead its scope is to be determined entirely by reference to the following claims.
Number | Date | Country | Kind |
---|---|---|---|
093139222 | Dec 2004 | TW | national |