The present invention relates to a reproducing method, an apparatus and a computer-readable recording medium for naturally reproducing the image information and the voice information, which are inputted through a network having a heavy traffic load.
In recent years, there has spread the network system, in which an image is taken by a network camera and is transmitted to a computer system through the network such as the internet. However, this network system can acquire the image information by controlling the computer system but not the surrounding voice information. Thus, there has been developed the network camera (as will be called the “voice mapping type network camera”), which is enabled to perform not only the image communication but also the voice communication by mounting a speaker and a microphone.
On the other hand, the image to be taken is processed into an image of a desired angle and zoom by controlling the pan, tilt and zoom camera by the (not-shown) camera control unit. The browser (i.e., the pe
program of screen displaying information) of the computer system 2 displays the portal screen showing the image and the control bar in the monitor, when it receives the portal screen displaying information through the network 3. When the user controls the pan, tilt and zoom by the control bar, the JAVA (of a registered trade mark) applet or the like transmits the IP packet confining data of a control quantity from the communication control unit 13 to the voice mapping type network camera 1. In this voice mapping type network camera 1, a control unit 9 extracts the data from that IP packet and transmits the control quantity to the camera control unit so that the (not-shown) pan motor, the (not-shown) tilt motor and the (not-shown) linear actuator are driven to change the taking direction and the zoom of the camera 10.
Next, in connection with the voice communication, the voice inputted from a microphone 17 is subjected to an AD conversion and a compression by a voice transmission processor 15 so that the voice transmission data is sent through the communication control unit 13 and the network 3 to the computer system 2. This computer system 2 processes the voice transmission data received, and outputs a voice from a speaker 28. Likewise, the voice inputted from a microphone 27 of the computer system 2 is processed by the computer system 2 and is transmitted as the voice reception data so that it is sent through the network 3 to the voice mapping type network camera 1. In this voice mapping type network camera 1, the voice reception data received is transferred through the communication control unit 13 to a voice reception processor 14, in which the data is decompressed and DA-converted and is outputted to a speaker 18.
In case this voice mapping type network camera 1 transmits the image and the voice to the computer system 2, a time stamp is generally made on the individual data of the image and the voice, that is, the transmission is made by adding synchronous information of time information (as referred to JP-A-9-27871, for example). Both the voice and image data are given the synchronizing information by the time control, and the data having the synchronizing information is reproduced on the reception side so that both the voice and image data are synchronously outputted. At this time, the voice has a determined data length, but the image data is not determined on its output time. In case the network has a heavy traffic load, therefore, this terminal device finds it difficult to transmit all the image data and the voice data, and thins the data. As a result, the image and the voice are partially cut to interrupt the voice. The interrupted voice is hard to listen thereby to deteriorate the information transmission seriously.
There also exists the time stamp method, in which a synchronization is made by adding a frame number to the image data and the voice data. However, the time stamp and the frame number have to be added individually to the image data and the voice data. In case the configuration is so complicated that the network has a heavy traffic load, it is difficult for the terminal device to transmit all the image data and the voice data. As a result, the voice is interrupted, and the configuration is complicated to raise the cost.
There has also been proposed a multimedia multiplex transmission device, which does not cut the voice but creates a multiplex signal efficiently in case the voice signal is a voiceless sound (JP-A-2001-16263). This device is provided with a voice signal buffer unit and a voiceless sound detecting unit, and the voice signal buffer unit stores the voiced encoded signal temporarily. When there is detected the case in which the voice signal caught by an external microphone is voiceless, the write of the data is enabled, in case the input signal from the voiceless sound detecting unit is at a low level, but disabled in case the same is at a high level. Thus, the time area assigned to the voice signal of the multiplex signal is not uselessly assigned to the vide-encoded signals. In case a voice is made voiceless in the processing, a time period longer than necessary is taken from a low level to a high level. In case a voiceless sound is changed into a voiced sound, the level is instantly changed from high to low. These changes do not break the voices at the head and end of a word.
The voice mapping type network camera of JP-A-9-27871 transmits image and voice data, a synchronization is made by adding synchronous information such as time information to each image and voice data or by adding a frame number to each image and voice data. In the case of a heavy traffic load of the network, however, those synchronizing methods have found it difficult to transmit all the image data and the voice data. If a delay occurs, the data have to be thinned so that the image and voice reproduced are partially cut and interrupted. Moreover, these techniques are just to thin the data on the data transmission side but are not solutions for the problem on the reception side, which receives the influences of the traffic fluctuations. If the traffic load is heavy, the packet of the voice data is delayed not to decrease but to increase the voice delay in the voice buffer of the computer system.
On the other hand, the multimedia multiplex transmission device of JP-A-2001-16263 includes a voice signal buffer unit and a voice/voiceless detection unit. In case the voice signal detected by the outside microphone is voiceless, the device does not cut the voice but inhibits the data write so that the device can create the multiplex signal efficiently. In case the voice signal of the external microphone is a voiceless sound, the area assigned to the voiceless sound signal of the multiplex signal to be sent from the multimedia multiplex transmission device is assigned to the video encoding signal. Therefore, this technique does not solve either the problem of the computer device on the reception side. The problem thus far described is left unsolved if the traffic load is heavy.
In view of the above problems of the related art, therefore, the invention has an object to provide an apparatus, a method for reproducing image information and voice information, and a computer-readable recording medium, which can utilize a buffer effectively even with much no-sound data or with a delayed packet.
In order to achieve the above object, according to the present invention, there is provided a terminal for storing voice information, if received through a network, temporarily in a voice reception buffer, for decoding the voice information outputted from the voice reception buffer and for outputting a voice after DA-converted. This terminal includes a buffer control unit for controlling the input/output of the voice information to and from the voice reception buffer, and a reception buffer level determining unit for deciding it as a no-data or a no-sound that the voice information in the voice reception buffer is at a predetermined peak value or less continuously for a predetermined time period, and it as a sound that the same exceeds the peak value. The terminal is mainly characterized in that the voice information determined as the no-data or no-sound by the buffer control unit is discarded, and in that the remaining voice information is compacted and outputted to a voice processing unit.
According to the apparatus, the method for reproducing image information and voice information, and the computer-readable recording medium of the invention, the delay is improved by discarding the voiceless portion even if the voice delay increases.
The above objects and advantages of the present invention will become more apparent by describing in detail preferred exemplary embodiments thereof with reference to the accompanying drawings, wherein:
In order to solve the above-specified object, according to the invention, there is provided a method for reproducing and outputting image information and voice information by receiving the image information and the voice information from a camera through a network, comprising: storing said voice information; deciding it as a no-data or a no-sound that said voice information is lower than a predetermined threshold, and it as a voice that the same is higher than a predetermined threshold; and discarding the voice information decided as the no-data or the no-sound, and compacting the remaining voice information.
The voice information decided as the no-data or the no-sound in the voice reception buffer is discarded, and the remaining voice information is compacted and outputted as a voice. As a result, the voice reception buffer can be effectively utilized, and the voice is neither delayed from the image nor cut. Thus, the method of the invention is hardly influenced by the traffic fluctuations.
A network camera, a program and a network system according to the first embodiment of the invention will be described in the following. As to
In
Numeral 11 designates a codec unit for compressing and decompressing the data to be transmitted and received. Numeral 12 designates an image processor for compressing the image signals taken by the camera 10. Numeral 13 designates a communication control unit for processing the image data compressed by the image processor 12 in the protocol and for transmitting the processed data. Here, this protocol processing indicates the processing such as the TCP/IP protocol or the IEEE 802.03 protocol of the Ethernet (of the registered trade mark).
Numeral 14 designates a voice reception processor for decoding the voice reception data (or the PCM data) received by the voice mapping type network camera 1. Numeral 14a designates a DA converter for converting the output or a digital signal of the voice reception processor 14 into an analog signal. Numeral 15 designates a voice transmission processor for encoding the voice inputted to the voice mapping type network camera 1. Numeral 15a designates an AD converter for converting the output or an analog signal from a voice input adjusting circuit 17a (as will be described hereinafter). Numeral 16 designates a buffer unit of the voice mapping type network camera 1. Numeral 16a designates an image buffer of the buffer unit 16 and for the image data such as JPEG or MPEG compressed by the image processor 12. Numeral 16b designates a voice transmission buffer of the buffer unit 16 and for the PCM data encoded by the voice transmission processor 15. Numeral 16c designates an FIFO (first In First Out) voice reception buffer of the buffer unit 16 and for buffering the PCM data transmitted from the computer system 2 via the network 3.
This voice reception buffer 16c buffers a large quantity of voice reception data transmitted, temporarily in accordance with the relationship between the ability and quantity of processing. In a case that the traffic load rises, the data to arrive is decreased due to the delay of packets so that the processing seems to have no problem. However, a problem is occurred that the time band in which the data cannot be fetched is continues so that a no-data area is mixed into the data of the voice reception buffer 16c. Specifically, the first-in data is continuously outputted, but the data of packet delay is not written in the memory elements configuring the voice reception buffer 16c, that is, the memory elements are not charged. When this state of no data is transferred to the voice reception processor 14, this voice reception processor 14 has to perform the meaningless processing. In the First embodiment, therefore, this area of no data and the intrinsic no-sound state of a low volume of sound are detected and discarded. The no-data and the no-sound will be called together the “no-data/no-sound”.
Next, in
In
Next, numeral 19e designates a buffer control unit for controlling the write action and the output action of the PCM data in and to the voice reception buffer unit 16c. Numeral 19f designates a reception buffer level decision unit for deciding whether or not the level of the data corresponds to the no-data/no-sound; and numeral 19g designates a timer unit for counting whether or not the state of no-data/no-sound continues for a predetermined time 10. period. In the first embodiment, in case the no-data/no-sound continues for the predetermined time period, the buffer control unit 19e discards (or erases the charge) the entire data of the time period and makes the control to eliminate the area of the no-data/no-sound by advancing the discarded area to the subsequent data. The reception buffer level decision unit 19f is set with a threshold for evaluating the voice and the no-data/no-sound. The reception buffer level decision unit 19f decides the no-data/no-sound, when the level is at the threshold or lower, and informs the buffer control unit 19e of the decision. In the first embodiment, the no-data/no-sound is decided when the detected level is at the threshold or lower for 365 ms, but a proper set value may be adopted for the time duration. In response to this notice, the buffer control unit 19e causes the timer unit 19g to count a predetermined time, so as to decide whether or not the no-data/no-sound continues. When the timer unit 19g counts out, it is decided that the no-data/no-sound has occurred. Moreover, numeral 19h designates a setting unit for setting the aforementioned threshold. Next in
The configuration of the computer system 2 will be described with reference to
Moreover: numeral 25a designates buffer control unit for controlling the write action and the output action of the PCM data to the voice reception buffer 23a, numeral 25b designates a reception buffer level decision unit for decision whether or not the level is equivalent to the no-data/no-sound; and numeral 25c designates a timer unit for counting whether or not the state of no-data/no-sound continues for a predetermined time period. Moreover, numeral 25d designates a screen displaying information generation unit for creating a no-sound erasure setting screen 56 (as referred to
On the other hand, the numeral 26 designates the terminal side communication processing unit which is realized by the program downloaded by the file transfer unit 193 of the voice mapping type network camera 1, such as the active x or the JAVA (of the registered trade mark). The numeral 27 designates the microphone, numeral 27a designates a voice input adjusting circuit, the numeral 28 designates the speaker, numeral 28a designates a voice output adjusting circuit, numeral 29 designates a display unit, and numeral 30 designates a motor.
With reference to
Subsequently in
Subsequently, the actions to discard the no-data/no-sound at the voice reception buffer 23a of the computer system 2 will be described in detail with reference to
The buffer control unit 25a shown in
Here, the graph of
When data of a predetermined quantity is stored in the voice reception buffer 23a, the buffer control unit 25a discards the no-data/no-sound data and sequentially compacts and outputs the voice data. The actions of the voice reception buffer 23a at this time will be described with reference to
The areas M and N of the no-data/no-sound thus decided are discarded (or discharged) by the buffer control unit 25a, and the areas A, B and C are sequentially compacted. This state is shown in the two lower diagrams of
However, it cannot be said better that the decision of the no-data/no-sound is always made with the constant thresholds L and H. Specifically, the threshold H is lowered to increase the voice data for the voice, when the buffering data length of the voice reception buffer 23a is short. When the buffering data length is large, the threshold L and the threshold H are increased to reduce the voice data for the voice. These operations are preferred for causing no delay. With these decisions, moreover, the area of no data is always at the threshold L or lower. Even in case the threshold L and the threshold H are varied, it is possible to eliminate the influences due to the fluctuations in the traffic load of the network 3.
In
The threshold L and the threshold H are increased in proportion to the buffering data length as this data length increases. This increase is reasoned in the following. In case the buffer capacity is large, it is frequently proportion to the size of the quantity of the PCM data received. If the range for deciding the no-data/no-sound is widened by raising the threshold L and the threshold H (or the thresh levels), it is possible to decrease the operations to be processed by the voice processing unit 25. If the threshold H is set to −9 dB and the threshold L is set to −12 dB for the buffering data length of 400 ms, they can be preferably increased by 3 dB at every 100 ms from 400 ms to 1,000 ms so that the threshold H may take +9 dB whereas the threshold L may take +6 dB at 1,000 ms. The threshold L and the threshold H are varied with a difference of 3 dB for every 100 ms of the buffering data length.
Here, the description thus far made is directed mainly on the discard setting process and the erasing action of the no-sound data at the voice reception buffer 23a of the computer system 2. Especially, the description is made on the computer system 2 for forming the voice reception buffer 23a by transmitting the program such as the JAVA (of the registered trade mark) applets from the voice mapping type network camera 1 and for configuring the terminal communication processor 26 to communicate, but the invention should not be limited thereto. Moreover, all these descriptions are similar to those of the discard setting process and the erasing actions of the no-sound data in the voice reception buffer 16c of the voice mapping type network camera 1, and the detailed description is omitted because of the overlap. Here, the voice processing unit 25 of the computer system 2 performs the function of the voice reception processor 14 at the voice receiving time and the function of the voice transmission processor 15 at the voice transmitting time. In the computer system 2, moreover, the client receives the portal screen and displays the no-sound erasure setting screen 56 for inputting the settings. In the case of the voice mapping type network camera 1, however, the manager sets from the maintenance terminal.
Subsequently, here is described the flow for discarding the no-data/no-sound data between the network camera and the computer system according to the first embodiment of the invention.
The reception buffer level decision unit 25b discards the voice data (at step 3) in the area of no-data/no-sound, and compacts the spaces of the voice areas sequentially (at step 4). The voice is inputted to the voice processing unit 25 and converted into the voice digital signals (i.e. the PAM signals) (at step 5) so that the analog signals are outputted (at step 6) from the speaker 28 through the DA converter.
Thus, the voice reception buffer 23a varies the buffering data length, and varies the thresh levels according to the quantity of the voice data stored. As a result, the quantity of processing of the voice processing unit 25 can be reduced according to the traffic state at the voice communication time. Even with much no-data and no-sound data or with a packet delay, the voice is neither delayed to make effective use of the buffer nor influenced by the traffic load.
Further, the voice data is determined as the no-data or the no-sound when an absolute value of an amplitude information of the sound data is equal to or lower than a predetermined value for a predetermined time, and the voice data is determined as the voiced sound when the absolute value of the amplitude information of the sound data is higher than the predetermined value for the predetermined time. Therefore, a processing can be conducted with the small quantity of processing.
Further, the voice data is determined as the no-data or the no-sound when an integrated value of square power of the sound data at a predetermined time is equal to or lower than a predetermined value, and the voice data is determined as the voiced sound when the integrated value of the square power of the sound data at the predetermined time is higher than the predetermined value for the predetermined time. Therefore, a precise determining can be performed.
Further, the voice data is determined as the no-data or the no-sound when an absolute value of an amplitude information of the sound data is equal to or lower than the first value for a predetermined time, and the voice data is determined as the voiced sound when the absolute value of the amplitude information of the sound data is higher than the second value for the predetermined time. Therefore, the determination of a precise magnitude of the voice data by an average process and the determination of changing from voiced sound to a no-sound can be determined with different threshold values respectively so that a precise determination with the small quantity of processing can be achieved.
Further, the voice data is determined as the no-data or the no-sound when an integrated value of square power of the voice data at a predetermined time is equal to or lower than the first value, and the sound data is determined as the voiced sound when the integrated value of the square power of the sound data at the predetermined time is higher than the second value. Therefore, the determination of a precise magnitude of the voice data by an average process and the determination of changing from voiced sound to a no-sound can be determined with different threshold values respectively so that a more precise determination can be achieved.
Further, since the second value is set higher than the first value with wide spread, the last data of the voiced sound is prevented from cutting too excess. Also, when the sound data is transit to a voiced sound, the transition of the sound data is pass through an area determined as non-data or no-sound. Therefore, if the second value is rater high, the determination does not become an error.
Further, when a predetermined quantity of the voice data is stored in the voice reception buffer, the determining process determines whether the sound data is the no-data or the no-sound or the voiced sound. The discarding process discards the voice data determined as the no-data or the no-sound. Therefore, an inner of the voice reception buffer is arranged when a predetermined quantity of the voice data is stored therein. Therefore, the arranged sound data can be normally transmitted.
Numeral 301 designates a camera chip containing the CPU and its peripheral circuits. Numeral 302 designates a flash ROM that stores the program and data for the actions of the camera chip 301. Numeral 303 designates a working S-DRAM for the cameral chip 301 to act. Numeral 304 designates a CCD and a CMOS chip for converting a taken image into electric signals. Numeral 305 designates an Audio PCM chip for inputting/outputting voice signals. Numeral 306 designates a LANPHY chip for an electric interface at the time of physical connections with a LAN interface. Numeral 307 designates a motor drive chip for moving the camera within a taking range, i.e., a Tilt motor 308 and a Pan motor 309. There are a microphone for voice inputting and a speaker for voice outputting, although not shown.
The camera chip 301 is configured by a CPU 301-1; a JPEG converter 301-2 for converting a taken image in electric signals, into an image of the JPEG format; a G.726 converter 301-3 for conversions into the voice data format for the network communication; an MMU (Memory Management Unit) 301-4; a GPIO (General Purpose Input/Output); and a LAN (Local Area Network) 301-6.
This hardware configuration diagram of
Moreover, it is possible to realize: the flash ROM 302 with MX29LV320; the S-DRAM 303 with MT48CM16; the Audio PCM chip 305 with AK2308; the LANPHY chip 306 with ICS1893; the CCD chip 304 with the combination of ICX098, MN5400 and HV7131; and the motor drive chip 307 with LB1937.
With these configurations, there can be realized the camera which is enabled to output image information to the network and to output an uninterrupted voice even for a dense communication traffic, by inputting the voice information from the communication terminal and by deciding the magnitude of the voice information received.
The invention can be applied to the network system for image transmissions and voice communications by using the voice mapping type network camera.
Number | Date | Country | Kind |
---|---|---|---|
P2004-191148 | Jun 2004 | JP | national |