The present disclosure relates to text processing techniques, more particularly to extracting an organization phrase from sample texts and segmenting a text based on the organization phrase.
Text-to-Speech techniques can transcribe a text sentence into audio signals. For example, in a navigation application (e.g., a DiDi app), the text sentence, such as traffic condition, addresses, or the like may be presented to a user by voice.
To be read in a natural way, a piece of text (e.g., a sentence) must be segmented properly before being transcribed into audio signals. Generally, each of the phrases that are included in a sentence contains one or more words. Consistent with this disclosure, a word can be in English, French, Spanish, etc. a word in Latin, or a character of Asian languages such as Chinese, Korean, Japanese, etc. These words or characters may be segmented into phrases in a plurality of possible combinations.
A text sentence may contain address information or a Point of Interest (POI), which may be referred to as an “organization phrase.” For example, in a text sentence “China-Singapore Industrial Park is 30 kilometers away” for navigation, “Industrial Park” is an organization phrase. Based on the organization phrase, the above sentence may be segmented as “China-Singapore/Industrial Park/is/30 kilometers away.” Thus, the organization phrase may be used to facilitate a proper segmentation of the text sentence.
Embodiments of the disclosure provide improved systems and methods for extracting an organization phrase and segmenting a text based on the organization phrase.
An aspect of the disclosure provides a method for segmenting a text. The method may include identifying, by a processor, a candidate phrase shared by a plurality of sample texts; determining, by the processor, an evaluation score for the candidate phrase; identifying, by the processor, the candidate phrase as an organization phrase when the evaluation score meets a predetermined criterion; and segmenting the text based on the organization phrase.
Another aspect of the disclosure provides a system for segmenting a text. The system may include a communication interface configured for receiving a plurality of sample texts; a memory; and a processor configured for identifying a candidate phrase shared by the plurality of sample texts; determining an evaluation score for the candidate phrase; identifying the candidate phrase as an organization phrase when the evaluation score meets a predetermined criterion; and segmenting the text based on the organization phrase.
Yet another aspect of the disclosure provides a non-transitory computer-readable medium that stores a set of instructions, when executed by at least one processor of an electronic device, cause the electronic device to perform a method for generating a list of organization word entries. The method may include identifying a candidate phrase shared by the plurality of sample texts; determining an evaluation score for the candidate phrase; identifying the candidate phrase as an organization phrase when the evaluation score meets a predetermined criterion; and segmenting the text based on the organization phrase.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.
An aspect of the disclosure is directed to a system for segmenting a text. For example,
System 100 may be a general server or a proprietary device for processing text information in a sentence. As shown in
Communication interface 102 may be configured to receive one or more sample texts 116. In some embodiments, sample texts 116 may address information to identify a location, such as a road, a building, a park, or the like.
Memory 114 may be configured to store one or more sample texts 116. Memory 114 may be implemented as any type of volatile or non-volatile memory devices, or a combination thereof, such as a static random access memory (SRAM), an electrically erasable programmable read-only memory (EEPROM), an erasable programmable read-only memory (EPROM), a programmable read-only memory (PROM), a read-only memory (ROM), a magnetic memory, a flash memory, or a magnetic or optical disk.
Consistent with embodiments of the disclosure, candidate phrase determination unit 106 may determine a candidate phrase based on received sample texts 116. For example, a plurality of sample texts may include “Beijing Industrial Park”, “Shanghai Industrial Park”, “Silicon Valley Industrial Park”, “China-Singapore Industrial Park”, and “Beijing New Industrial Park”. Candidate phrase determination unit 106 may compare the plurality of sample texts, and determine that a shared phrase (e.g., “Industrial Park”) among sample texts 116 as the candidate phrase. In the above sample texts, the candidate phrase is at the end of each sample text.
Evaluation unit 108 may then determine an evaluation score for the candidate phrase. The evaluation score indicates a probability of the candidate phrase being an organization phrase. In some embodiments, the evaluation score may be determined based on whether the candidate phrase is associated with a proper segmentation path. That is, when a segmentation path that treats the candidate phrase as an organization phrase yields a higher evaluation score, it is an indication that the candidate phrase is indeed an organization phrase.
In a non-limiting example, evaluation unit 108 may generate a second segmentation path that is different from a first segmentation path including a segment corresponding to the candidate phrase, and determine whether the second segmentation path is a proper segmentation path. If the second segmentation path is less likely to be a proper segmentation path, the first segmentation, on the contrary, is more likely to be a proper segmentation path. And thus, candidate phrase is more likely to be an organization phrase.
Consistent with the disclosure, evaluation unit 108 may identify a reference phrase associated with the candidate phrase for each sample text, and determine a first number of sample texts that contain the reference phrase. The reference phrase may be associated with an improper segmentation of the sample text. For example, in a sample text “Camden High Street”, “High Street” may be determined as a candidate phrase, and evaluation unit 108 needs to determine whether the segmentation, based on the candidate phrase, is reasonable. To do that, evaluation unit 108 may generate an alternative segmentation, such as “Camden High/Street.” Based on this alternative segmentation, evaluation unit 108 may determine “Camden High” as a reference phrase, and determine a total number T of sample texts that contain “Camden High.” Then, evaluation unit 108 may segment each sample text into segments, and determine a second number of sample texts that contain a segment corresponding to the reference phrase. With reference to the above example, evaluation unit 108 may segment each sample text into segments using a language model, and determine a number M of sample texts that contain a segment associated with “Camden High.” The language model can generate a segmentation path according to natural language rules. That is, in the number M of sample texts, “Camden High” is segmented as a segment. As discussed above, the segmentation including “Camden High” as a segment is an improper segmentation. Thus, based on the numbers T and M, a segmentation failure rate p may be determined based on the numbers T and M. p may be calculated according to the equation below.
p=M×M/T
According to the above discussion, a reference phrase (e.g., “Camden High”) indicates an improper segmentation, therefore p indicates the segmentation associated with the reference phrase is improper. When the number M of sample texts that contain a segment associated with the reference phrase is small, the value of p is small, which indicates that the segmentation including the candidate phrase is more likely to be a proper segmentation as only a very few of other segmentations exist. For example, the sample text “Camden/High Street” may have a segmentation failure rate p of 0.4, the sample text “Shanxi/South Road” may have a segmentation failure rate p of 0.3, while “Luo/Nan Road” may have a segmentation failure rate p of 17.2.
It is contemplated that, the above language model may segment a text according to natural language rules. And the language model can be trained for a designated language, such as English, Chinese, Japanese, or the like.
Based on the segmentation failure rates calculated for each sample text, evaluation unit 108 may determine the evaluation score by averaging the segmentation failure rates of the respective sample texts. The respective sample texts may each include a segment associated with the candidate phrase. For example, “High Street” may have an evaluation score S of 0.988, and “Zhuang Street” may have an evaluation score S of 5.731. The individual scores may be aggregated in any suitable ways to derive the evaluation score. For example, instead of a straight average of the individual scores, the evaluation score may be a weighted average of the individual scores, and the weights may correspond to how frequently the associated sample text is used. For example, in a navigation app (e.g., the DiDi app), “China-Singapore Industrial Park” is more frequently used, the evaluation score for the candidate phrase “Industrial Park” generated based on this text will be assigned with a greater weight.
Organization phrase determination unit 110 may identify the candidate phrase as an organization phrase when the evaluation score meets a predetermined criterion. In some embodiments, the candidate phrase may be identified as an organization phrase when the evaluation score is less than a threshold. For example, the threshold may be predetermined as “1”. With reference to the above examples of “High Street” and “Zhuang Street”, “High Street” having the evaluation score S of 0.988 may be identified as an organization phrase.
Organization phrase determination unit 110 may further generate a list of organization phrases, and rank the list of organization phrases in an ascendant order of the respective evaluation scores. The list may be stored in memory 114 and used in further processing. In some embodiments, the list may be automatically or manually reviewed to remove one or more phrases that are known to be non-organization phrases.
Segmentation unit 112 may further segment a text based on the organization phrase. For example, when more than one segmentation paths are generated for one text using the language model, segmentation unit 112 may select a segmentation path including an organization phrase as a segment, and segment the text accordingly. Alternatively, the language model may be trained to automatically treat an organization phrase as a segment.
System 100 can extract organization phrases from sample texts, the extracted organization phrases may be further used to segment a text before the text being transcribed into audio signals.
Another aspect of the disclosure is directed to a method for segmenting a text. For example,
In step S202, the segmentation device may identify a candidate phrase shared by a plurality of sample texts. The plurality of sample texts may be compared to determine the candidate phrase. In some embodiments, the candidate phrase is at the end of each sample text.
In step S204, the segmentation device may determine an evaluation score for the candidate phrase. The evaluation score may be determined based on multiple alternative segmentation paths of the text. At least one of the segmentation path includes the candidate phrase as a segment.
As shown in
Then, in step S306, the segmentation device may segment each sample text into segments and determine a second number of sample texts that contain the reference phrase as a segment. In some embodiments, the sample text may be segmented using a language model. In step S308, the segmentation device may determine a segmentation failure rate based on the first and second numbers.
In step S310, the segmentation device may determine the evaluation score by aggregating (such as averaging) the segmentation failure rates of the respective sample texts. The respective sample texts may each include a segment associated with the candidate phrase.
With reference back to
In step S208, the segmentation device may segment the text based on the organization phrase. For example, the segmentation may include the organization phrase as a segment.
Yet another aspect of the disclosure is directed to a non-transitory computer-readable medium storing instructions which, when executed, cause one or more processors to perform the methods, as discussed above. The computer-readable medium may include volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other types of computer-readable medium or computer-readable storage devices. For example, the computer-readable medium may be the storage device or the memory module having the computer instructions stored thereon, as disclosed. In some embodiments, the computer-readable medium may be a disc or a flash drive having the computer instructions stored thereon.
It will be apparent to those skilled in the art that various modifications and variations can be made to the disclosed segmentation system and related methods. Other embodiments will be apparent to those skilled in the art from consideration of the specification and practice of the disclosed system and related methods.
It is intended that the specification and examples be considered as exemplary only, with a true scope being indicated by the following claims and their equivalents.
The present application is a continuation of International Application No. PCT/CN2017/095335 filed on Jul. 31, 2017, designating the United States of America. The entire contents of the above-referenced application are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2017/095335 | Jul 2017 | US |
Child | 16749959 | US |