The present invention relates generally to storage and transmission of computer data, and, more particularly, methods of and systems for encoding and decoding small amounts of text data.
Text data compression is widely used to send very large files between computers on a network. The compression is most commonly accomplished through pattern recognition techniques which identify repeated patterns within the text data and build a translation dictionary in which various smaller sets of characters are substituted for each such pattern to thereby encode the text using less data. When transmitted, the encoded text is accompanied by the translation dictionary since the dictionary is necessary to decode the text after it is received. But, for two very good reasons, only large amounts of text data are compressed before transmission.
One reason has to do with the dearth—or even the absence—of patterns in small amounts of text data. In general, the longer the text string, the more patterns are repeated in that string.
But there is another transmission issue which discourages compression of any but quite sizable amounts of text: the translation dictionary that maps recognized repeating patterns to abbreviated representation is unique to each compressed file and therefore must be sent along with the compressed text if the text is to be decoded upon reception. Thus, conventional text compression is only cost-effective if the amount of data reduced by replacing recognized repeating patterns with abbreviated representations is sufficient to justify transmission of the dictionary that maps those patterns to their respective representations along with the abbreviated text data. This is certainly not true for most small text messages.
The consequence of the inability of conventional compression techniques to efficiently compress small texts and the need to send the translation dictionary along with the text means that many common transmissions of text—including most e-mail and cell-phone texting (SMS, Short Messaging Service, messages) as well as Web page textual content—are not compressed. But, considering the daily network volume of such text, compression of these smaller text files would reduce significantly the volume of internet traffic and would reduce the amount of storage space needed at the short message centers that ‘store and forward’ text messages over mobile phone networks. The reduced size of short text files would also reduce the amount of storage space used on the various personal and corporate computer storage media.
In accordance with the present invention, text is encoded using a scheme which, in the preferred embodiment, uses a predetermined dictionary not unique to the compressed text to substitute codes of one or more characters for words and phrases, thereby obviating transmission of the dictionary along with transmitted encoded text. In particular, the predetermined dictionary is created independently of any particular body of text. Shorter codes, including codes of a single character, are used to represent words and phrases most frequently used generally, while the generally least frequently used words and phrases are represented by longer codes. The substitution of words and phrases for predetermined codes provides substantial compression of the text data and provides significant privacy as the original text is not readily discernible from the encoded text without access to the dictionary. In effect, the dictionary can be considered a multi-megabyte encryption key.
Frequency of usage is determined generally, across of a population of representative text and not from any particular body of text. As a result, the predetermined dictionary can be shared by a sender and a receiver and thereafter used to encode/decode many bodies of text traveling there between
The codes of the predetermined dictionary are made of one or more text characters such that the message, once encoded, continues to be a legitimate text message. The encoded message can therefore travel through any data transport medium through which a conventional text message can travel.
During encoding of a subject body of text, words or phrases not represented in the predetermined dictionary are copied in original form into the encoded message. Any such word or phrase that can be confused with a code, e.g., is no longer than the longest code, is flagged to indicate that it is not a code. For example, the word can be prefixed with a predetermined flag such as apostrophe. The predetermined flag is not used as an initial character of a code, thereby making all codes distinguishable from words flagged. In decoding, the flag is recognized as such and is removed from the word.
Better compression and obfuscation is achieved by recognizing and omitting common whitespace patterns. For example, a single space character can be implicit between every code of an encoded message. Adjacent codes are distinguished from one another by a marker portion of the code at one end. Such a marker can be a code character selected from a subset of code characters designated as marker characters.
In accordance with the present invention, text data is encoded and decoded by using a predetermined dictionary 116 (
Briefly, text is encoded by replacement of phrases thereof with representative codes from dictionary 116. Since the codes are generally shorter than the represented phrases, such encoding results in compression of the text. Conversely, decoding the message by replacing codes in the encoded message with phrases represented by the respective codes results in decompression and restoration of the text.
Dictionary 116 is predetermined in that dictionary 116 does not depend upon the particular text being encoded—in that dictionary 116 is known before a given message to be encoded by use of dictionary 116 is known. Dictionary 116 is designed to represent commonly-used phrases across all text likely to be compressed with much shorter codes. Since dictionary 116 is predetermined and not constructed from the text to be encoded, there is no need to transmit dictionary 116 along with the encoded text. As a result, short messages that could not be adequately compressed to justify adding a dictionary to the data payload can now be effectively and significantly compressed.
As used herein, a “word” is any string of word characters delimited by non-word characters. Designation of characters as word characters or non-word characters is somewhat arbitrary in that the encoding and decoding methods described herein do not rely on any specific characters being in either set, so long as the two sets are mutually exclusive. As used herein, a “phrase” is a collection of one or more words delimited by one or more non-word characters; thus, a single word can be a “phrase” as defined herein.
It is not necessary that phrases represented in dictionary 116 are English phrases or even phrases of words recognizable as such to human readers. For example, common domain names used in links that can be frequently included in text messages can be recognized by the system described herein as a “phrase.” For example, in the embodiment described more completely below, non-word characters include periods and forward slashes. As a result, a common portion of a Web site URL can be recognized as a phrase. The URL “http://tinyurl.com/abc123” includes a relatively common leading phrase, namely, “http://tinyurl.com”: “http:” as the first word, followed by “//” as whitespace (a string of one or more non-word characters), followed by “tinyurl” as a second word, followed by yet more whitespace (“.”), ending with the word, “corn”, and finally delimited from the phrase that follows by a “/” non-word character.
During encoding, phrases are replaced by their associated codes as represented in dictionary 116. Phrases of the subject text not found in dictionary 116 are not represented by a code, but are instead included in the compressed text data in their original form. Phrases that are short enough to be confused with or otherwise capable of being confused with a code representing a compressed phrase are distinguished as such by the insertion during encoding of a specified character, designated as a quotation flag and not used in the codes or, in alternative embodiments, just not used as first character of a code. Any such quotation flag is removed during decoding as described in greater detail below.
The characters used as code characters are characters from the character set used in the particular text data to be encoded and decoded. Typically, the character set can be selected from character sets used on mobile phone networks and the Internet. Generally, any character set can be used. The entirety of the particular character set used is divided into word characters and non-word characters. Codes are constructed from one or more word characters except for a few word characters that are reserved as flags. But by not using non-word characters in codes, non-word characters remain an effective delimiter of both words, phrases, and codes. In some embodiments, codes can include flags as word characters so long as the flag is not the first character of the code. In this illustrative embodiment, flags are included as prefixes and can therefore serve as second or subsequent characters of codes.
Since the same encoding translation dictionary—e.g., dictionary 116—is used both for encoding and decoding of all text, any computer device on which both the encoding translation dictionary and the encoding/decoding logic are resident can decode any message received from another computer device encoded with the same encoding translation dictionary and the same encoding/decoding logic without requiring transmission of dictionary 116 along with the message.
This encoding/decoding process described more completely herein reduces text data of almost any size and is especially useful in reducing the size of small amounts of text data, including those commonly seen in SMS messages, instant messages, e-mail, and Web text. Even text messages of only a single word can often be compressed by a substantial amount using the encoding techniques described herein.
Before describing the encoding and decoding of textual messages in accordance with the present invention, some elements of a computer 100 (
CPU 108 and memory 106 are connected to one another through a conventional interconnect 110, which is a bus in this illustrative embodiment and which connects CPU 108 and memory 106 to one or more input devices 102 and/or output devices 104 and network access circuitry 122. Input devices 102 can include, for example, a keyboard, a keypad, a touch-sensitive screen, a mouse, a microphone. Output devices 104 can include a display—such as a liquid crystal display (LCD)—and one or more loudspeakers. Network access circuitry 122 sends and receives text data through a wide area network such as the Internet and/or mobile device data networks.
A number of components of computer 100 are stored in memory 106. In particular, text entry logic 112, encoding logic 118, and decoding logic 120 are each all or part of one or more computer processes executing within CPU 108 from memory 106 in this illustrative embodiment but can also be implemented using digital logic circuitry. As used herein, “logic” refers to (i) logic implemented as computer instructions and/or data within one or more computer processes and/or (ii) logic implemented in electronic circuitry. Character images 114 and dictionary 116 are data stored persistently in memory 106. In this illustrative embodiment, character images 114 and dictionary 116 are each organized as a respective database.
A encoding translation dictionary used for text transmission, e.g., dictionary 116, can be constructed for any of many different character sets. In this illustrative embodiment, computer 100 is intended to send brief text messages through SMS networks and/or the Internet. Accordingly, the most useful character sets are those commonly used in transmission of text on mobile phones and the Internet.
The ASCII character set is a subset of the default character set GSM 03.38 used for transmission of text on mobile phone networks in Europe and North America and in parts of Africa, Asia, and the Pacific Islands. Any encoding which uses only characters from the character set GSM 03.38 or a subset of character set GSM 03.38 will be accurately transmitted wherever GSM 03.38 is the character set used for text transmission of an encoded file. In a preferred embodiment, eighty-five (85) displayable ASCII characters, a subset of GSM 03.38, are used as potential word characters. Other embodiments can use different characters sets.
In this illustrative embodiment, encoding logic 118, decoding logic 120, and dictionary 116 share a categorization of every character that can appear in text to be compressed/restored as (i) a word character, (ii) a non-word character, or (iii) a flag character. Flag characters are word characters but are excluded from use as the first character of a code.
All characters that can be included in text to be compressed that are not listed in Table A above or Table B below are considered non-word characters.
In this illustrative embodiment, codes used to represent phrases are made from one or more word characters. Dictionary 116 maps these codes to phrases represented by the respective codes. As used herein, a dictionary is a computer-readable data structure that maps individual data elements to equivalent respective data elements. In this embodiment, codes are individual data elements and the equivalent respective data elements are those phrases represented by the respective codes.
These eighty-five (85) single-byte ASCII characters are used (i) as single-character codes to encode the most frequently used phrases, (ii) in groups of two to form two-character codes to encode somewhat less frequently used phrases, and (iii) in groups of three to form three-character codes to encode even less frequently used phrases.
Using the eighty-five (85) word characters listed above, eighty-five (85) unique single-character codes can be used to represent eighty-five (85) phrases; 7,225 unique two-characters codes can be used to represent 7,225 additional phrases; and 614,125 unique three-character codes can be used to represent 614,125 additional phrases. In embedded system embodiments, such as in mobile telephony devices, it may be desirable to limit the size of dictionary 116. Accordingly, dictionary 116 can be limited to codes with a maximum length of two characters, to codes with a maximum length of three characters, or to a maximum number of entries as illustrative examples. In the latter instance, dictionary 116 can be limited to at most 40,000 three-character codes, for example. Where resources permit, larger numbers of codes represented in dictionary 116 tend to provide better rates of encoding. It should be appreciated that codes of four (4) or more characters in length can also be used to store even greater numbers of entries within dictionary 116.
In this illustrative example, a mobile telephone 202 (
An overview of text encoding and decoding according to the present invention is shown in logic flow diagram 300 (
It should be appreciated that, since encoded message 404 includes only characters that can be used in conventional SMS messages, encoded message 404 can travel through network 406 and short message center 408 without requiring any modification to network 406 or short message center 408. In tests using codes with no more than two characters (only about 7,300 codes representing only about 7,300 respective phrases expected to appear frequently in messages generally), SMS messages have been compressed at ratios of about 1.7:1. As a result, on average, message 402 can be 70% longer than the conventional maximum message length for SMS. In addition, SMS traffic through network 406 and short message center 408 is reduced by approximately 41%. In embodiments which permit larger code sets and dictionary sizes, even greater resource savings are possible.
The intended recipient is a mobile telephony device 420 (
At this point, decoded message 412 is stored in the intended recipient as any conventional SMS message is stored once received. In step 318 (
The encoding and decoding of the message “nothing could be finer than to meet you in the diner” serves as an illustrative example of text message 402. Step 308 is shown in greater detail as logic flow diagram 308 (
In step 502, encoding logic 118 (
In step 504 (
Loop step 506 and next step 518 define a loop in which encoding logic 118 performs steps 508-516 until no characters of text message 402 remain to be processed.
In step 508, encoding logic 118 finds the longest phrase at the beginning of text message 402 (
In test step 510, encoding logic 118 determines whether any code was found for a phrase at the beginning of text message 402 (94). If so, encoding logic 118 appends that code to encoded message 404 and removes the corresponding phrase from the beginning of text message 402 in step 512 (
Conversely, if encoding logic 118 determines in test step 510 that no code of dictionary 116 represents any phrase at the beginning of text message 402, encoding logic moves a single word from the beginning of text message 402 to the end of encoded text 404 in step 514. It is possible that the single word is a legitimate code. For example, given that codes are strings of one or two or three word characters in this illustrative embodiment, any word that is not longer than three characters could be a legitimate code. In such a case, encoding logic 118 prepends a quotation flag to the word in encoded message 404 to distinguish the word from a code. For example, if dictionary 116 contains no code for “In” and text message 402 includes the word “In”, encoding logic 118 prepends a quotation flag—an apostrophe in this illustrative embodiment—to the word as appended to encoded message 404, i.e., “In”.
After either step 512 (
Processing then transfers through next step 518 (
Step 508, in which a code for the longest of a number of phrases at the beginning of text message 402 is retrieved from dictionary 116, is shown in greater detail as logic flow diagram 508 (
Using the example text message, the phrases would be “nothing”, “nothing could”, “nothing could be”, “nothing could be finer”, and “nothing could be finer than”. Compression logic 118 preserves all whitespace embedded in the phrases. For example, if there were two spaces between “nothing” and “could”, encoding logic 118 includes both spaces between those words in the various phrases.
Loop step 604 (
In test step 606, encoding logic 118 requests retrieval from dictionary 116 of a code representing the particular phrase being processed in the current iteration of the loop of steps 604-610, which is sometimes referred to as “the subject phrase” in the context of logic flow diagram 508. If a code is successfully retrieved from dictionary 116, logic flow diagram 508 returns the retrieved code in step 608 and that code is processed by encoding logic 118 in step 512 (
Conversely, if no code is successfully retrieved from dictionary 116 in test step 606, processing by encoding logic 118 transfers through next step 610 to loop step 604 in which the next longest phrase collected in step 602 is processed according to steps 606-608 in the manner described above.
Once all phrases collected by encoding logic 118 have been processed according to the loop of steps 604-610 and no iterations thereof cause early termination through step 608, processing transfers to step 612. In step 612, encoding logic 118 has determined that none of the phrases collected in step 602 are represented in dictionary 116 and therefore returns the shortest of the collected phrases, e.g., a single word in this illustrative embodiment, as the text to be appended to encoded text 404.
It should be appreciated that, by trying to maximize the length of phrases replaced by codes of dictionary 116, greater encoding ratios are realized. To use this illustrative example, it is preferable to replace “nothing could be” with a single code than “nothing” if “nothing could be” and “nothing” are both found in dictionary 116 as phrases that can be represented with a code.
In this illustrative embodiment, encoding logic 118 ensures that every character of text message 402 is represented in encoded message 404. This includes superfluous whitespace and character case and misspellings. To preserve these characteristics of text message 402, phrases represented in dictionary 116 are case-specific and whitespace-specific. As an example, consider the example text message, “Hi. My name is ‘Jim.’” In this illustrative example, spaces, periods, and apostrophes are non-word characters and therefore are considered “whitespace” by encoding logic 118. “Hi” would not be matched by “hi” and, to be represented in dictionary 116, would require a separate entry for “Hi” in dictionary 116 in this illustrative embodiment. Similarly, the phrase “Hi. My” would require an entry in dictionary 116 that matches case and includes exactly a period followed by two spaces between “Hi” and “My”.
There are a number of variations that can ameliorate this problem of message variations, one of which is illustrated as logic flow diagram 605 (
Loop step 702 (
In test step 704, encoding logic 118 determines whether the particular flag pattern being processed in the current iteration of the loop of steps 702-710, which is sometimes referred to in the context of logic flow diagram 605 as “the subject flag pattern,” matches the subject phrase. If not, processing by encoding logic transfers through next step 710 to loop step 702 and encoding logic 118 processes the next flag pattern.
Conversely, if the subject flag pattern matches the subject phrase, processing by encoding logic 118 transfers to step 706. In step 706, encoding logic 118 canonicalizes the subject phrase. In both the initial capitals and the all capitals flag patterns, the canonical form of the phrase is all lowercase. The phrase as canonicalized is used in test step 606 when retrieving a matching code from dictionary 116.
In step 708, encoding logic 118 asserts the flag of the subject flag pattern. Step 608 (
If no flag pattern matches the subject phrase, processing by encoding logic 118 according to logic flow diagram 605 neither modifies the subject phrase nor asserts any flag as neither step 706 nor step 708 is performed for the subject phrase.
Thus, with little added payload of the occasional flag character, a single entry in dictionary 116 can represent a number of variations of phrases. For example, consider that the code, “Ng”, represents “nothing could be” in dictionary 116. The flagged code, “_Ng”, represents “Nothing Could Be”, and the flagged code, “̂Ng”, represents “NOTHING COULD BE”.
In another variation that can ameliorate this problem of message variations is canonicalization of whitespace. Consider the example in which text message 402 includes two spaces between “nothing” and “could”. In this illustrative alternative embodiment, once encoding logic 118 has determined that “nothing could be” (with two spaces between “nothing” and “could”) is not represented within dictionary 116, encoding logic 118 recognizes the double space characters within the phrase and searches dictionary 116 for the same phrase with only single space characters between words. In this example, encoding logic 118 finds such a phrase with whitespace therein so canonicalized. Compression logic 118 assumes that the phrase found in dictionary 116 is the phrase intended by the author of text message 402 (
When decoding logic 120 decodes a message encoded in this manner, the double space characters are not restored between “nothing” and “could.” Accordingly, this form of text compression is lossy. However, this very limited sort of lossiness in text compression can be acceptable in some contexts, particularly informal contexts such as text messaging between mobile telephony devices.
As described above, decoding logic 120 (
In step 802, decoding logic 120 initializes decoded message 412 to be an empty text string. In addition, decoding logic 120 makes a disposable copy of encoded message 410 if encoded message 410 is to be preserved. Alternatively, decoding logic 120 can use pointers to simulate removal of characters from encoded message 410.
In step 804, (
Loop step 806 (
In test step 808 (
In step 810 (
Conversely, if the first word of encoded message 410 is not a code, processing by decoding logic 120 transfers from test step 808 (
After either step 810 (
Processing transfers through next step 816 (
Upon completion of processing of encoded message 410 according to the loop of steps 806-816 (
To properly decode codes prefixed with flags in the manner described above with respect to logic flow diagram 605 (
Loop step 902 (
In test step 904 (
In step 906 (
Continuing in the examples above, processing of the flagged code, “_Ng”, by decoding logic 120 according to logic flow diagram 809 results in recognition by decoding logic 120 of “_” as an initial capital flag in test step 904; retrieval of “nothing could be” from dictionary 116 using the code, “Ng”, in step 906; and restoration of the initial capitalization in step 908 to reconstruct “Nothing Could Be” as the represented text.
As described above, whitespace (any non-word characters) that is not embedded within a phrase is not encoded and is, instead, included in encoded messages 404 (
Improved compression rates can be realized in some embodiments by run-length encoding whitespace. In particular, typical non-word characters tend not to appear in long strings without long strings of a single, repeated non-word character. As a result, run-length encoding can be an effective tool in mitigating the otherwise incompressibility of whitespace in techniques described herein.
Run-length encoding is well-known and is not described herein except in the context of an illustrative embodiment for run-length encoding whitespace by encoding logic 118 and decoding logic 120.
First, it should be appreciated that there is no need to run-length encode whitespace within a phrase already represented in dictionary 116. Suppose, for example, that “wait . . . for . . . just . . . one . . . minute” appeared to frequently in text messages that the phrase is represented in dictionary 116 and associate with a code of 1-3 characters in length. That code would represent the entirety of the phrase, including the four (4) strings of five (5) periods. Accordingly, there would be virtually no incentive to use run-length encoding within phrases stored in dictionary 116. One possible exception might be to reduce the size of dictionary 116 itself by compressing phrases stored therein. However, strings of repeated characters tend to appear in text so rarely as to be unlikely to significantly reduce the size of dictionary 116.
Thus, excluding whitespace embedded in encoded phrases, whitespace is handled by encoding logic 118 only in steps 504 (
Steps 504 (
Run-length encoding by encoding logic 118 in step 1004 deviates from conventional run-length encoding. For example, encoding logic 118 excludes at least one non-word character at the end of the whitespace from run-length encoding such that the trailing non-word character delimits the next word in text message 402. Consider the example text, “Wait . . . 20minutes.” The six (6) periods could be run-length encoded as “.6” but that would result in “Wait.620minutes.” But, since numerals are word-characters, it would not be entirely clear whether that should be decoded as six (6) periods followed by “20minutes”, sixty-two (62) periods followed by “0minutes”, or six hundred and twenty (620) periods followed by “minutes.” Conversely, “Wait.5.20minutes.” is more easily recognizable as the first interpretation.
However, such is not the end of the ambiguity. A message like “Wait.5.minutes.” can be the result of run-length encoding the periods of “Wait . . . minutes.” or can be the result of obviated run-length encoding of “Wait.5.minutes.” Visible punctuation is used in this examples to assist the reader in following the examples where counting non-visible non-word characters (e.g., a space character) would be a challenge.
To remove such ambiguity, encoding logic 118 treats a word that includes only numerals as one that requires a quotation flag prefix. Accordingly, encoding “Wait.5.minutes.” would result in the word, “5”, being prefixed with an apostrophe quotation flag whereas encoding “Wait . . . minutes.” would result in the run-length encoded six (6) periods being represented as “.5.”, i.e., without the apostrophe quotation flag prefix on “5”.
In addition, there is no size reduction in run-length encoding a string of fewer than 4 repeated non-word characters. For example, “.” couldn't be run-length encoded as there is no additional non-word character to follow the run-length encoded whitespace; “..” would require an additional character to run-length encode as “.1.”; and “... ” would require the same number of characters to run-length encode as “.2.”. In addition, “.0.” would be meaningless as a run-length encoded string in this embodiment. Accordingly, the words “0”, “1”, and “2” would require no quotation flag as they would not appear in run-length encoded whitespace.
Steps 804 (
In this illustrative messaging embodiment, dictionary 116 is populated using a training set 1230 (
This population of dictionary 116 is performed using dictionary optimization logic 1212 which is generally not needed in the encoding and decoding of messages in the manner described above. Accordingly, optimization logic 1212 is shown to be included in a different computer'system 1200, such as a computer used in the development and implementation of encoding logic 118 and decoding logic 120.
Most of the components of computer 1200 are directly analogous to components of computer 100 (
Logic flow diagram 1300 (
In the example given above with respect to logic flow diagram 308 (
Once encoding logic 1218 has encoded and compressed the text messages of training set 1230, dictionary 1216 contains usage statistics for all phrases represented in dictionary 1216 and unfound phrases database 1228 contains usage statistics for all phrases searched for without success in dictionary 1216.
In step 1304 (
This expected relative size reduction is the size reduction realized for each substitution of the subject phrase with a code representing it. This difference is sometimes referred to as a “single-use reduction” and takes into consideration the use of quotation flags if necessary and the length of the code. For example, a single-use reduction for “be” if represented by a single-character code is two (2)—three (3) (the length of “be” prefixed with a quotation flag) less one (1) (the length of the single-character code). Similarly, the single-use reduction for “nothing could be” if represented by a two-character code is fourteen (14)—the length of “nothing could be” (16) less the length of the two-character code (2).
To determine a phrase's expected relative size reduction, the phrase's single-use reduction is multiplied by the number of times the phrase appeared in the text messages of training set 1228.
In step 1306 (
After step 1306, dictionary 1216 includes in its limited number of entries those phrases most likely to provide greatest rates of data encoding when used to encode messages of a type modeled by training set 1230. This population of dictionary 1216 can be repeated as new statistics become available or can be repeated as training set 1230 is updated to periodically fine-tune dictionary 1216.
The entries of dictionary 1216, less the statistics, are included in dictionary 116 (
It should be appreciated that dictionary optimization logic 1212 determines expected relative size reduction in a way that favors greatest encoding ratios over large numbers of text messages. In particular, some very long phrases are used just frequently enough to represent greater aggregate data reduction than far more frequently used short phrases. As a result, text messages encoded in the manner described above with dictionaries populated in this manner may often be compressed only slightly or not at all, while other messages are compressed to a much larger extent and often enough to reduce overall data sizes of messages in aggregate.
In other embodiments, it may be preferable to maximize reduction of each message such that senders can include more information in each message despite a hard limit on the maximum size of a message. In such embodiments, other expected relative size reductions, or “value” within a encoding model, of each phrase can be determined and compared for determining which phrases are included in the limited number of entries in dictionary 1216.
In such embodiments, expected relative size reduction is not linear with respect to usage but can be exponentially related to usage, for example. In one embodiment, expected relative size reduction is determined as the single-use reduction multiplied by usage frequency of the subject phrase raised to a power greater than one (1.3, for example). To increase the effect of usage frequency of a phrase relative to the phrase's single-use reduction, higher exponents are used. And, conversely, to increase the effect of a phrase's single-use reduction relative to the phrase's usage frequency, lower exponents are used.
As described above, dictionary 116 does not include usage statistics in the illustrative embodiment. In other embodiments, dictionary 116 does include such usage statistics maintained by encoding logic 118 in the manner described with respect to encoding logic 1218, except that encoding logic 118 also records the total number of messages encoded for normalization of usage statistics relative to other instances of encoding logic 118. In such an embodiment, encoding logic 118 is configured to periodically report usage statistics to dictionary optimization logic 1212 for subsequent use in improving dictionary 1216 in the manner described above with respect to steps 1304 and 1306.
Even more efficient compression can be realized by recognizing that most whitespace between words and phrases in text message consists of a single space character and making such a space character merely implicit in encoded text. This embodiment is represented by logic flow diagrams 308B (
To start, word characters are divided into mutually exclusive sets of initial code characters and subsequent code characters. Initial code characters can only be the first character of a code and subsequent code characters can only be a second or subsequent character of a code. Generally, in this embodiment, the total number of codes that can be represented with a given maximum number of characters is maximized when word characters are nearly evenly divided between initial code characters and subsequent code characters.
Since only about half of all word characters are used in this embodiment as initial code characters, only about half as many single-character codes are available relative to embodiments such as those described above in which whitespace is preserved between codes. Similarly, the number of 2- and 3-character codes that are available are similarly dramatically reduced. However, since much of the whitespace between codes can be omitted from encoded text, 2-character codes occupy as much of encoded text as single-character codes in embodiments in which the single-space character between codes is preserved. Thus, it is currently believed that the embodiment described in conjunction with
When space characters between codes are omitted, the start of a code is recognized as an initial code character that is optionally preceded by a flag. Accordingly, flags are excluded from the set of subsequent code characters. However, flags that apply to unencoded phrases and not to codes (such as the quotation flag) can be included in the set of subsequent code characters.
Logic flow diagram 308B (
In step 1402 (
If the leading whitespace is not a single space character, processing transfers to step 516 in which the leading whitespace is moved to encoded message 404 in the manner described above. Thus, any whitespace other than a single space character is not omitted between codes. Conversely, if the leading whitespace is a single space character, processing transfers to step 1406.
In step 1406, encoding logic 118 (
Thus, after processing of a code that represents a phrase of text message 402, a single space character separating the code from the following phrase is not immediately copied to encoded text 404 but is instead remembered for subsequent processing. If the next phrase is represented by a code, processing of that phrase includes steps 512, 1402, 1404, and 1406, and the single space character is omitted from encoded text 404. The result is that contiguous codes are not separated by single space characters. Such separation is implicit only.
When a phrase of message text 402 is not represented by a code, processing transfers from test step 510 to step 1408. In step 1408, encoding logic 118 (
After step 1406, processing transfers to step 514, and encoding logic 118 (
The result is that, in encoded text 404, adjacent codes for phrases that were separated by a single space character in message text 402 are represented contiguously. The adjacent codes are separated from any unencoded text preceding or following the codes by any whitespace found in message text 402, including single space characters.
Logic flow diagram 316B (
In test step 1508, encoding logic 118 (
If the first word of encoded text 410 is not a string of one or more contiguous codes, processing by encoding logic 118 (
Conversely, if the first word of encoded text 410 is a string of one or more contiguous codes, processing transfers from test step 1508 to step 1510. In step 1510, encoding logic 118 (
Thus, omitting implicit single-space whitespace between adjacent codes achieves better compression ratios and further obfuscates text messages. It should be appreciated that the predetermined initial code characters represent a marker of one end of the code. While this marker is described herein to be at the beginning of a code, it should be appreciated that the marker could be at the end of a token such that a token is zero or more subsequent code characters followed by an initial code character and can be recognized as such during decoding. In addition, the marker is not limited to a single character of a predetermined set of code characters. Predetermined sequences of two or more code characters can be used as markers. Such markers are distinguishable from non-marker portions of codes if the predetermined sequences used as codes are not used in non-marker portions of codes.
Phrases stored in dictionary 116 are generally independent of the respectively associated codes, so long as the code-phrase associations are consistent between encoders and decoders of the same messages. In the example noted above, “nothing could be” is associated with the code “Ng” in dictionary 116. In another embodiment, some other code, e.g., “Gn”, can be associated with “nothing could be” in dictionary 116. Exploitation of this feature can be used to provide a significant degree of privacy.
It should be observed that, since most of the text of encoded messages 404 and 410 are represented by codes that bear no substantive relation to the represented text, encoded messages 404 and 410 are difficult (if not impossible) for human readers to parse and understand. However, it is possible that some portions of encoded messages 404 and 410 are quoted, unencoded words. But, with the great majority of encoded messages 404 and 410 being codes, a substantial degree of privacy is provided even with a dictionary of modest size.
If a group of human users would like an even greater degree of privacy from the rest of the world, they can can use a larger dictionary or replace a universally used dictionary 116 with an analogous dictionary in which the codes associated with respective phrases have been randomly shuffled. Such a dictionary would allow encoding and decoding of messages within the group using this dictionary; however, messages encoded using dictionary 116 could not be decoded with this replacement dictionary, and messages encoded using this replacement dictionary could not be decoded using dictionary 116. Messaging using the shuffled dictionary is restricted to those using the shuffled dictionary.
Privacy can also be provided on an individual user basis.
Encoding logic 118 (
Shuffle key 1608 determines to which respective codes of user-specific dictionary 1616 correspond to each code of dictionary 116. In one embodiment, shuffle key 1608 provides a complete mapping of the codes. In an alternative embodiment, shuffle key 1608 is a seed for a pseudo-random number generator which shuffles the codes of dictionary 116 in a deterministic, pseudo-random manner.
In encoding a message for the user represented by user record 1604, encoding logic 108—in step 608 (FIG. 6)—returns a user-specific code to which the code found in step 606 maps in code shuffler 1602 (
In decoding a message from the same user, decoding logic (
In another embodiment the phrases in dictionary 116 are each preceded by a space, as though each phrase began not with a letter or number, but with a space. Storing the phrases in the dictionary as though each phrase began with a space means that there will be no spaces preceding codes in the encoded text since each code exactly replaces the phrase which it represents, including the first character which, in the predetermined dictionary of this alternative embodiment, is a space character. As a result, it is neither necessary to exclude the space preceding a code, nor, on decoding, to restore the space. Alternatively, phrases in dictionary 116 include a trailing space character to similarly include inter-phrase space characters in codes of the respective phrases. It also should be understood that the usefulness of the invention is not restricted to sending files, but also to compressing them for more compact file storage and obscuring them for privacy.
The above description is illustrative only and is not limiting. The present invention is defined solely by the claims which follow and their full range of equivalents. It is intended that the following appended claims be interpreted as including all such alterations, modifications, permutations, and substitute equivalents as fall within the true spirit and scope of the present invention.
This Application claims priority of U.S. Provisional Patent Application Ser. No. 61/453,842 filed Mar. 17, 2011 entitled “Encoding and Decoding of Small Amounts of Text” by Robert B. O'Dell and James D. Ivey and is a continuation-in-part of U.S. patent application Ser. No. 12/715,244 filed Mar. 1, 2010 by Robert B. O'Dell and James D. Ivey and entitled “Using The Encoding Of Words And Groups Of Words To Compress Computer Text Files”, which in turn claims priority of U.S. Provisional Patent Application Ser. No. 61/280,683 filed Nov. 7, 2009 entitled “Using a Standard Encoding/Decoding Dictionary to Compress Computer Text Files” by Robert B. O'Dell and of U.S. Provisional Patent Application Ser. No. 61/284,634 filed Dec. 29, 2009 entitled “Using the Encoding and Decoding of Words and Groups of Words to Compress Computer Files” by Robert B. O'Dell.
Number | Date | Country | |
---|---|---|---|
61453842 | Mar 2011 | US | |
61280683 | Nov 2009 | US | |
61284634 | Dec 2009 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 12715244 | Mar 2010 | US |
Child | 13418278 | US |