The invention relates to the field of data processing systems. In particular, the invention relates to an improved method and system for analyzing character strings in order to improve the legibility of words.
Many organizations rely on accurate and reliable information for a variety of different purposes. Organizations also rely on consumers of information to read and interpret the information correctly. For example, in the health care industry one of the main factors that can contribute to the incorrect administration of a medication is the confusion that can arise with ‘look-alike’ and ‘sound-alike’ names of medication. Examples are Hydroxyzine and Hydralazine, or Brupropion and Buspirone. This could have disastrous consequences if, for example, a medical professional reads the name of the medication incorrectly or administers an incorrect dosage to a patient.
Many people also suffer from dyslexia. Dyslexia is a condition that impairs a person's ability to read. If not identified early enough in a child's development, dyslexia can inhibit a child's educational progress and destroy their confidence. Rather then relying on traditional teaching methods to help a child it may be more helpful to display text of the written word in a format that would make it much easier for a person to read.
Therefore, there is a need in the art to find a way in which to represent words in a form which allows words to be read easily, accurately, and quickly and in a form that allows words to be more legible.
Viewed from a first aspect, the present invention provides a method for analyzing a character string, the method comprising: analyzing a character string to determine one of more characters of the character string; determining from a dictionary source, an alternative character string to the analyzed character string; comparing the analyzed character string with the alternative character string to determine a weighting factor for each of the characters of the analyzed character string relative to a positional arrangement of the characters in the alternative character string; and for each determined weighting factor, generating for each character in the analyzed character string a corresponding character of particular size and height as determined by the weighting factor.
The present invention advantageously provides a method in which it is possible to detect characters of a word that have a higher or lower significance when compared to other characters of the word. By determining characters of a word that have a greater or lower significance, each individual character can be displayed in differing font sizes (depending on whether the character is of a greater or lesser significance). Advantageously, by following this method words can be printed or displayed in an optimal manner not just to improve legibility but also to save real estate on a computer display or in a printed form.
The present invention provides a method wherein an additional vowel reduction weighting factor is applied to the analyzed character string in order to generate a corresponding character string of a particular size.
The present invention provides a method wherein the vowel weighting factor decreases or increases the size of the corresponding character string.
The present invention provides a method wherein the weighting factor increases or decreases the size of the corresponding character.
The present invention provides a method further comprising formatting each of the corresponding characters based on their respective assigned weighting factors for collectively displaying each of the corresponding characters as a word having characters of differing sizes.
The present invention provides a method wherein the formatting further comprises formatting each of the corresponding characters along a horizontal alignment whereby a horizontal alignment takes place in an upper quartile of each of the corresponding characters.
Viewed from another aspect, the present invention provides an apparatus for performing a method for analyzing a character string, the method comprising: analyzing a character string to determine one of more characters of the character string; determining from a dictionary source, an alternative character string to the analyzed character string; comparing the analyzed character string with the alternative character string to determine a weighting factor for each of the characters of the analyzed character string relative to a positional arrangement of the characters in the alternative character string; and for each determined weighting factor, generating for each character in the analyzed character string a corresponding character of particular size and height as determined by the weighting factor.
The present invention provides an apparatus wherein an additional vowel reduction weighting factor is applied to the analyzed character string in order to generate a corresponding character string of a particular size.
The present invention provides an apparatus wherein the vowel weighting factor decreases the size of the corresponding character string.
The present invention provides an apparatus wherein the weighting factor increases the size of the corresponding character.
The present invention provides an apparatus wherein each of the corresponding characters are formatted based on their respective assigned weighting factors for collectively displaying each of the corresponding characters as a word having characters of differing sizes.
The present invention provides an apparatus wherein the formatting further comprises formatting each of the corresponding characters along a horizontal alignment whereby a horizontal alignment takes place in the upper quartile of each of the corresponding characters.
Viewed from another aspect, the present invention provides computer program code stored on a computer readable medium for performing a method for analyzing a character string, when loaded into a computer system and executed, the method comprising: analyzing a character string to determine one of more characters of the character string; determining from a dictionary source, an alternative character string to the analyzed character string; comparing the analyzed character string with the alternative character string to determine a weighting factor for each of the characters of the analyzed character string relative to a positional arrangement of the characters in the alternative character string; and for each determined weighting factor, generating for each character in the analyzed character string a corresponding character of particular size and height as determined by the weighting factor.
Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings.
Typically, the data processing system 100 comprises some form of storage 120 in which to either store data locally on the data processing system 100 or via external storage 145, main memory for loading and running a natural language analyzing application 200, input system 125 for receiving data for analysis by the natural language analyzing application 200 and a display 130 for viewing an output of the natural language analyzing application 200. The input system 125 can take the form of a keyboard, mouse, scanner, optical character recognition system, etc.
The data processing system 100 may either be operable as a server or a client device 150. When operating as a server, client devices 150 are able to communicate with the server over a network 140. Client devices 150 can send requests to the natural language analyzing application 200 located on the server and subsequently receive responses from the natural language analyzing application 200.
When the natural language analyzing application 200 is operable on a client device 150, the client device 150 is operating in a standalone mode.
The text receiver component 210 receives data for analyzing from various forms of data sources (step 500). A data source may comprise data from a word processor document, an email, or any other form of structured or non structured data. The text receiver component 210 communicates the data to a text analytics engine 205 for processing.
A word segmentation component 300 receives the data from the text receiver component 210 and begins by determining the language of the data of the document. The word segmentation component 300, then proceeds to identify words by interfacing with a language specific instruction set that determines how to identify words from a continuous stream of data (step 505). The language specific instruction set also comprises rules that determine word boundaries, for example, words from languages which are not typically represented by spaces between different words, etc. Thus, the language specific instruction set can deal with a number of different type of languages.
Each of the identified words is added to a queue for further processing. For each identified word in the queue, the parsing component 310 identifies the number of characters of the word and stores this information in memory.
The parsing component 310 also identifies the first character and the last character of the word and stores this information in memory (step 510). The first and the last characters of a word are important because, with reference to
However, in order to obtain a number of alternative spelling suggestions from the data dictionary 215, a substitution component 320 parses the word stored in the queue and substitutes each identified character, after the first character and up to the last character of the word, with another character of an alphabet (step 515). For example, taking the word configuration, the substitution component 320 begins at the letter ‘o’ and substitutes the letter ‘o’ with another letter of the alphabet (e.g., ‘a’) and then communicates with the data dictionary 215 to determine if there is a word such as ‘canfiguration’. If no such word exists, the substitution component 320 substitutes the character ‘o’ for another letter in the alphabet to find one or more alternative spelling suggestions.
Then, the substitution component 320 proceeds to the next character in the word and substitutes the letter ‘n’ with another character of the alphabet and so on to derive a number of alternative spelling suggestions. This process is continued until a number of alternative spelling suggestions are obtained.
A person skilled in the art will realize that any form of substitution algorithm can be used in order to identify a number of alternative spelling suggestions. A substitution algorithm may also ‘learn’ that particular words always have a number of alternative spelling suggestions, look-alikes' or similar words and these words can be stored in an alternative dictionary source for retrieval when the particular word is subsequently analyzed. Thus, having the advantage of faster access and retrieval times.
As an example, the alternative spelling suggestions for the word ‘configuration’ may be as follows:
conflagration
consideration
confabulation
confederation
communication
communization
The alternative spelling suggestions are stored in an array (step 525, step 530) and are communicated to a comparison component 315 for processing.
The comparison component 315 retrieves the alternative spelling suggestions (example 1) from the array (step 600) and compares the alternative spelling suggestions with the original word (step 605). For example, comparing the word ‘configuration’ with the words ‘conflagration’.
If any alternative spelling suggestions comprise a greater or smaller number of characters than the original word, these alternative spelling suggestions are disregarded (step 610). For example, the word ‘configuration’ comprises thirteen characters, but if an alternative spelling suggestion comprised less than or more than thirteen characters, this alternative spelling suggestion would be disregarded.
Next, the comparison component 315 identifies the first character of the original word (configuration) and communicates with a weighting component 305 in order to assign a calculation/weighting factor (step 615). The comparison component 315 also identifies the last character of the word and the weighting component 305 assigns a weighting factor to this character (step 620).
In an embodiment, the weighting factor assigned to the first and the last characters of the word is a weighting factor that represents the importance or significance of the first and the last characters of the word. As has already been shown with reference to
The comparison component 315 compares each of the characters (after the first character and up to the last character identified in the word) with each of the characters of each of the alternative spelling suggestions in order to determine whether this character appears in any of the alternative spelling suggestions comparative to their positional arrangement (i.e., ‘high information density area’ or ‘low information density area’) in the alternative spelling suggestions.
If a determination is made that a character does not appear in any of the alternative spelling suggestions the comparison component 315 assigns a weighting factor which is indicative of the character's higher importance or significance when compared to the alternative spellings (step 625). Thus characters having a higher significance may be formatted in a larger font size compared to other characters of the words having a detected lower importance.
If a determination is made that a character of a word does appear in one or more alternative spelling suggestions, the analysis component 205 assigns a weighting factor which is indicative of the relative importance (greater or lower) of the character when compared to the one or more alternative spelling suggestions (step 630). Each assigned value is written to a record (step 635) associated with the analyzed word and stored in a database and communicated to the text formatter component 220 (step 640).
The above algorithm can be understood as follows:
For all other characters between the first and the last characters assign a value which reflects the relative importance or significance of the character when compared to the corresponding characters in the alternative spelling suggestions, wherein significance=1−(cnt/(cnt+cnttot))*0.8, wherein ‘cnt’ is the number of occurrences of the character across alternative spelling suggestions and ‘cnttot’ is the total number of alternative spelling suggestions.
A person skilled in the art will realize that the weighting factor of ‘0.8’ is an example weighting factor and that other values may be used without departing from the scope of the invention. The weighting factor can be modified to give different results in the resulting output string depending on the environment in which the invention is utilized.
Further, for an original word comprising only one character in common with an alternative spelling suggestion, this character is assigned a weighting factor of 0.85. Further for any analyzed words comprising only two characters in common with the original word, these characters are assigned a weighting factor of 0.85.
In another embodiment, after the weighting factor has been applied, a vowel reduction factor may be applied in order to decrease the size of vowels, thus reducing the size of the vowels further.
A person skilled in the art will realize that other variations of common characters and weighting factors can be used without departing from the scope of the invention.
For example, using the word ‘configuration’ the resulting weight factors assigned after comparison with the each of the alternative spelling suggestions (conflagration, consideration, confabulation, confederation, communication and communization) is as illustrated in Example 2:
(c) 1.00
(o) 0.72
(n) 0.88
(f) 0.88
(i) 0.81
(g) 0.96
(u) 0.79
(r) 0.90
(a) 0.72
(t) 0.85
(i) 0.72
(o) 0.72
(n) 1.00
In an embodiment the vowel reduction factor may be a value of 0.85. However a person skilled in the art will realize that the weighting factor of ‘0.85’ is an example vowel reduction weighting factor and that other values may be used without departing from the scope of the invention.
Once each character of the word has been analyzed, a weighting factor record is generated and associated with the word.
The associated record can be stored in a database for further reference. Example 3, below, is an example of an associated record which is stored for later use.
cn13 conflagration
cn13 consideration
cn13 confabulation
cn13 confederation
cn13 communication
cn13 communization
The key ‘cn13’ infers that the words start with ‘c’, ends with ‘n’ and has 13 characters. Then when any subsequent words are detected that begin with the character ‘c’ and end with the character ‘n’, a lookup is performed in the records database to locate the record associated with the analyzed word.
The analyzed word and the generated record are transmitted to the text formatter 220 for further processing.
With reference to
The text formatter 220 receives the analyzed word from the comparison component 315 and analyses the generated record (steps 640, 700). The text formatter 220 communicates with a logic component 325 and the logic component 325 reads each weighting assigned to each character of the word and using rules determines the height, size and shape of each of the characters (step 715).
For example, the generated record for the word ‘configuration’ may be as follows:
(c) 1.00
(o) 0.72
(n) 0.88
(f) 0.88
(i) 0.81
(g) 0.96
(u) 0.79
(r) 0.90
(a) 0.72
(t) 0.85
(i) 0.72
(o) 0.72
(n) 1.00
The logic component 325 comprises rules which dictate how a character should be formatted and displayed in relation to its assigned weighting factor. In the example above, the characters ‘c’ and ‘n’ are displayed at full size, for example, font size 10. However, the characters ‘o’, ‘a’ and ‘i’ will be displayed at the smallest size relative to the other characters (step 725).
Thus the logic component 325 derives formats the word ‘configuration’ and the output engine 225 generates the output (step 730) shown in
In another embodiment the text formatter 220 formats the output text such that the characters are displayed along a horizontal alignment whereby a horizontal alignment takes place in the upper quartile of each character. An example is shown in
The above process is completed for each word stored in the queue, until the entire text of the document has been formatted.
The invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In an embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
The invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer device or any instruction execution system. For the purposes of this description, a computer usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus or device.
The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk read only memory (CD-ROM), compact disk read/write (CD-R/W), and DVD.
Improvements and modifications can be made to the foregoing without departing from the scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
10193605 | Dec 2010 | EP | regional |
Number | Name | Date | Kind |
---|---|---|---|
6005973 | Seybold et al. | Dec 1999 | A |
6298247 | Alperovich et al. | Oct 2001 | B1 |
7469051 | Sapashe et al. | Dec 2008 | B2 |
7587308 | Kasravi et al. | Sep 2009 | B2 |
8219550 | Merz et al. | Jul 2012 | B2 |
8244046 | Takahashi et al. | Aug 2012 | B2 |
20020116420 | Allam et al. | Aug 2002 | A1 |
20020156799 | Markel et al. | Oct 2002 | A1 |
20020188944 | Noble | Dec 2002 | A1 |
20030104346 | Stein | Jun 2003 | A1 |
20030107584 | Clapper | Jun 2003 | A1 |
20030133612 | Fan | Jul 2003 | A1 |
20030177681 | Riley | Sep 2003 | A1 |
20030198386 | Luo | Oct 2003 | A1 |
20040113927 | Quinn et al. | Jun 2004 | A1 |
20040117627 | Brewington | Jun 2004 | A1 |
20040125107 | McCully | Jul 2004 | A1 |
20040125108 | McCully | Jul 2004 | A1 |
20040139400 | Allam et al. | Jul 2004 | A1 |
20040183812 | Raskar et al. | Sep 2004 | A1 |
20040183925 | Raskar et al. | Sep 2004 | A1 |
20040183940 | Raskar | Sep 2004 | A1 |
20040184667 | Raskar et al. | Sep 2004 | A1 |
20040184677 | Raskar et al. | Sep 2004 | A1 |
20040195320 | Ramsager | Oct 2004 | A1 |
20050017986 | Anwar et al. | Jan 2005 | A1 |
20050028074 | Harrington et al. | Feb 2005 | A1 |
20050028075 | Harrington et al. | Feb 2005 | A1 |
20050028076 | Harrington et al. | Feb 2005 | A1 |
20050028096 | Harrington et al. | Feb 2005 | A1 |
20050028097 | Harrington et al. | Feb 2005 | A1 |
20050028098 | Harrington et al. | Feb 2005 | A1 |
20050028099 | Harrington et al. | Feb 2005 | A1 |
20050039138 | Urbina | Feb 2005 | A1 |
20050045605 | Simke | Mar 2005 | A1 |
20050071743 | Harrington et al. | Mar 2005 | A1 |
20050071755 | Harrington et al. | Mar 2005 | A1 |
20050116965 | Grunder | Jun 2005 | A1 |
20050154980 | Purvis et al. | Jul 2005 | A1 |
20060017731 | Matskewich et al. | Jan 2006 | A1 |
20060017732 | Matskewich et al. | Jan 2006 | A1 |
20060017733 | Matskewich et al. | Jan 2006 | A1 |
20060028012 | Holder | Feb 2006 | A1 |
20060029258 | Harrington et al. | Feb 2006 | A1 |
20060029259 | Harrington et al. | Feb 2006 | A1 |
20060029260 | Harrington et al. | Feb 2006 | A1 |
20060039585 | Harrington et al. | Feb 2006 | A1 |
20060049248 | Becker et al. | Mar 2006 | A1 |
20060077544 | Stark | Apr 2006 | A1 |
20060116977 | Burger et al. | Jun 2006 | A1 |
20060146075 | Weiss et al. | Jul 2006 | A1 |
20060153616 | Hofmann | Jul 2006 | A1 |
20060155699 | Purvis et al. | Jul 2006 | A1 |
20060164682 | Lev | Jul 2006 | A1 |
20060181562 | Hirano et al. | Aug 2006 | A1 |
20060231629 | Massieu | Oct 2006 | A1 |
20070003294 | Yaguchi et al. | Jan 2007 | A1 |
20070036390 | Harrington et al. | Feb 2007 | A1 |
20070036391 | Harrington et al. | Feb 2007 | A1 |
20070036392 | Harrington et al. | Feb 2007 | A1 |
20070036393 | Harrington et al. | Feb 2007 | A1 |
20070036394 | Harrington et al. | Feb 2007 | A1 |
20070041617 | Harrington et al. | Feb 2007 | A1 |
20070041618 | Harrington et al. | Feb 2007 | A1 |
20070041619 | Harrington et al. | Feb 2007 | A1 |
20070061384 | Harrington et al. | Mar 2007 | A1 |
20070094591 | Etgen et al. | Apr 2007 | A1 |
20070098288 | Raskar et al. | May 2007 | A1 |
20070133842 | Harrington | Jun 2007 | A1 |
20070164112 | Dant | Jul 2007 | A1 |
20070176001 | Cattrone et al. | Aug 2007 | A1 |
20080023552 | Gillet et al. | Jan 2008 | A1 |
20080043996 | Dolph et al. | Feb 2008 | A1 |
20080061473 | Laracey et al. | Mar 2008 | A1 |
20080151274 | Arakawa | Jun 2008 | A1 |
20080194016 | Kusters | Aug 2008 | A1 |
20080200255 | Eisele | Aug 2008 | A1 |
20080220195 | Henshaw | Sep 2008 | A1 |
20080266298 | Hess et al. | Oct 2008 | A1 |
20080306684 | Yamazaki | Dec 2008 | A1 |
20090005176 | Morrow et al. | Jan 2009 | A1 |
20090010442 | Usher et al. | Jan 2009 | A1 |
20090035777 | Kokoris et al. | Feb 2009 | A1 |
20090097665 | L'Esperance et al. | Apr 2009 | A1 |
20090128582 | Weiss et al. | May 2009 | A1 |
20100290633 | Chen et al. | Nov 2010 | A1 |
Entry |
---|
Bernard et al., Comparing the effects of text size and format on the readibility of computer-displayed Times New Roman and Arial text, International Journal of Human-Computer Studies, 59(6), pp. 823-835, 2003. |
Miellet et al., Psychological Science, Research Article, Parafoveal Magnification, Visual Acuity Does Not Modulate the Perceptual Span in Reading, University of Glasgow, vol. 20, No. 6, Copyright 2009 Association for Psychological Science, pp. 721-728. |
Nazir et al., Letter legibility and visual word recognition, http://nivea.psycho.univ-paris5.fr/Papillon/Papillon.html, Printed Apr. 27, 2010, 22 pages. |
Victor Gaultney, Balancing typeface legibility and economy, Practical techniques for the type designer, Copyright 2001, 14 pages. |
Andrew F. Hobson; “Noise Activiated Automatic Volume Control”; Accession Number: AD0721596; Report Date: Dec. 1970; p. 1; Public STINET (Scientific Technical Information Network) at the Defense Technical Information Center (DTIC). |
Number | Date | Country | |
---|---|---|---|
20120141031 A1 | Jun 2012 | US |