The invention is generally related to organizing and managing data in a database, and more particularly, to treating character strings comprising a plurality of ASCII characters as a numerical value to improve processing efficiency.
Computerized database systems have long been used and their basic concepts are well known. In general, database systems are designed to organize, store and retrieve content in such a way that the content in the database is useful. For example, the content is typically represented as data that may be searched, sorted, organized and/or combined with other data. To a large extent, the usefulness of a particular database system is dependent on an ability to efficiently access the data (and hence the content represented by the data) in the database system.
One drawback of many conventional database systems is that much of the content in such database systems is represented by data in the form of character strings (e.g., ASCII, etc.). Character strings are inherently inefficient. As is generally known, a character string comprises a number of individual characters organized as a string data structure or an array data structure. Processing of the character string, at least at a processor level, is on a character-by-character basis. In other words, if the processing of character strings includes comparing one character string to another character string, such comparing is done character-by-character until a difference is identified or until an end of string is identified. If all the characters in the two character strings are the same and the two character strings have a same length, the two character strings are deemed the same.
In addition to the inefficiencies associated with processing content stored as data in the form of character strings, databases themselves have become extremely large and unwieldy, many comprising thousands or tens of thousands of terabytes of data. Processing such data in a conventional manner consumes an inordinate amount of time, making many tasks virtually impossible to accomplish.
What is needed is a system and method for organizing data in a database system to overcome these and other associated problems.
Various implementations of the invention organize data in a database. In some implementations of the invention, the database or data table is split into a plurality of original field files, wherein the data table comprises of a plurality of rows and a plurality of columns, wherein each of the plurality of rows comprises a data record of the data table, wherein each of the plurality of columns comprises a data field of the data table, wherein each of the plurality of original field files comprises a column split from of the plurality of columns, wherein each of the plurality of original field files comprises a number of data values (i.e., data entries) corresponding to a number of the plurality of rows. In some implementations of the invention, a sort order for the data values in a corresponding one of the plurality of original field files is determined and an array of sorted indices for the corresponding one of the plurality of original field files is generated by sorting an array of indices based on the sort order, wherein each index in the array of indices points to a data value in the corresponding one of the plurality of original field files. In some implementations of the invention, an array of sorted indices is generated for each of the plurality of original field files. In some implementations of the invention, each original field file and its corresponding array of sorted indices are stored together and subsequently used to process content of the data table.
In some implementations of the invention, a binary search for a search term may be conducted against any one or all of the plurality of original field files using the associated array of sorted indices to determine which index points to a data value in the one of the plurality of original field files that matches the search term.
In some implementations of the invention, once the index is determined, the index may be used to retrieve data related to the search term across one or all of the plurality of original field files. In such implementations, such retrieved data corresponds to a row of data table. In some implementations of the invention, the index may be used to retrieve an actual data record from the data table (or database) as would be appreciated.
In some implementations of the inventions, one or more of the original field files comprises a plurality of character strings, where each character string comprises a plurality of characters. In these implementations of the invention, such plurality of characters in each character string are collectively treated as a single integer value (as opposed to a number of individual characters). In other words, these character strings are read, written, compared, or otherwise processed as a single integer value.
These and other implementations of the invention are described in further detail below in connection with the accompanying drawings.
Various implementations of the invention are directed towards systems and methods for organizing data in a database system. Various implementations of the invention are described below with respect to various database applications, where large amounts of content in the form of data is compiled, stored, manipulated, and/or analyzed to determine various relationships present in the content.
In some implementations of the invention, a database system is used to store content in the form of data records that include data associated with accounts receivable. In such implementations, a company may collect content, in the form of data, relating to various persons, businesses and/or accounts from one or more sources. The sources may include, for example, credit card companies, financial institutions, banks, retail, and wholesale businesses and other such sources. While each of these sources may provide data relating to various accounts, each source may provide data representing different information based on its own needs. Furthermore, this data may be organized in entirely different ways. For example, a wholesale distributor may have data corresponding to accounts receivable corresponding to business accounts. Such data may be organized by account numbers, with each data record having data fields identifying an account number, a business name associated with that account number, an address of that business, and an amount owed on the account. A retail company may have data records representing similar information but based on accounts corresponding to individuals as well as businesses.
Various implementations of the invention may use data, including different types of data, from a wide variety of sources. For example, the scientific institutions may provide scientific data with respect to various areas of research. Industrial companies may provide industrial data with respect to raw materials, manufacturing, production, and/or supply. Courts or other types of legal institutions may provide legal data with respect to legal status, judgments, bankruptcy, and/or liens. Social media companies may provide business intelligence based on user interaction. Security companies or government entities may provide security intelligence based on accessed or monitored communications.
U.S. Pat. No. 6,424,969 to Bjorn Gruenwald, entitled “System and Method for Organizing Data” (the “969 Patent”), the entirety of which is incorporated herein by reference, describes a system that converts a character string (e.g., ASCII character string) to a numerical value in a number system, where the number system has a radix at least equal to a number of different symbols (e.g., characters) in the character string. For typical character strings that are restricted to the characters ‘0’-‘9’ and ‘A’-‘Z’ (referred to as “alphanumeric characters”), a base-40 number system was utilized. In the '969 Patent, each of the alphanumeric characters was assigned a numeric value, represented as either a hexadecimal value or decimal value in accordance with Table 1:
Using Table I and using appropriately assigned symbols (or digits as described in the '969 Patent) to the numbers in base-40, the character string “JOHN”, which is comprised of four alphanumeric characters “J” “O” “H” and “N” would be represented in a base-40 number system as the base-40 number ‘JOHN’ having a decimal value equivalent of 1,255,103 (i.e., 19*403+24*402+17*401+23*400, where alphanumeric character “J” has a decimal value of 19, alphanumeric character “O” has a decimal value of 24, alphanumeric character “H” has a decimal value of 17, and alphanumeric character “N” has a decimal value of 23).
Rather than converting a character string to a numeric value in a particular number system, various implementations of the invention simply treat the character string as an integer data type (e.g., integer, long integer, double integer, bigint, word, double word, quadword, or other integer data type) based in part on the number of characters in the character string. For example, in ASCII, each character comprises 8 bits. Processors with 64-bit registers and data buses can accommodate 8 ASCII characters in their registers as a typical integer, 16 ASCII characters as a typical double integer, etc.; processors with 128-bit registers and data buses can accommodate 16 ASCII characters in their registers as a typical integer, 32 ASCII characters as a typical double integer, etc. Hence, rather than treating, for example, a character string “New York” as eight (8) ASCII characters each eight (8) bits wide and processing this string on a character-by-character basis, the entire character string may be treated as a single 64-bit integer (by whatever such an integer data type is referenced in the corresponding programming language) according to various implementations of the invention. Depending on how the character string is stored in memory, the eight bytes comprising the character string may be read as a single 64-bit integer straight out of memory; or the eight bytes comprising the character string may be read consecutively byte-by-byte (or other unit less than the full 64-bit integer) out of memory; or individual eight bits may be added to a data register (or loaded into the data register for the first eight bits), followed by an eight bit shift left of the register (or the equivalent) to accommodate each additional character in the character string as would be appreciated. Other mechanisms may also be used to load the eight characters into the 64-bit integer as would be appreciated. Loading the entire character string “New York” as a single integer results in such an integer having a hexadecimal value of 4E657720596F726B. In this manner, the character string “New York” may be treated as a single numeric value and processed accordingly. These implementations of the invention do not require any selection of a number system or conversion of the character string into such number system.
One benefit of various implementations of the invention is that the character string “New York”, once treated as a single integer (or numeric value as discussed in the '969 Patent), may be compared in a single instruction cycle to other character strings for purposes of determining equivalency with such character strings rather than eight byte-wise comparisons as typically required by conventional character string comparisons. Other benefits also exist as would be appreciated. Once the character string is treated as a single integer, the processing of the integer may be conducted in accordance with the principles described in the '969 Patent.
Various implementations of the invention apply as discussed above, regardless as to the mechanism by which a given processor stores character strings or integers in memory. For example, various implementations of the invention apply regardless of whether data is stored following a “big endian” or “little endian” protocol—in either case, each unique character string will also have a unique integer value.
In some implementations of the invention, each original field file 510 may be stored in memory or on a storage device with a row file 520 as a 2-by-n array (where n is the number of records in data table 410). In some implementations of the invention, each original field file 510 may be stored in memory or on a storage device with a row file 520 as two 1-by-n arrays. In some implementations of the invention, each original field file 510 may be stored in memory or on a storage device as a 1-by-n array; in such implementations, the row number or index is referred to herein as an implied row number or implied index simply based on a position of each individual data value in original field file 510. As would be appreciated, an implied row number or implied index is associated with each of the data values that correspond to a given data record 420 of data table 410 across all data fields 430.
According to various implementations of the invention, an array of sorted indices (or sorted row numbers) is generated by determining a sort order for original field file 610 via a second file that initially comprises row numbers 620 (implied or otherwise). The row numbers in this second file point to character strings in original field file 610. According to various implementations of the invention, these character strings are treated as integers and the sort order is determined based on the values of these integers. This sort order is applied to the second file, thereby rearranging the row numbers of the second file and generating an array of sorted row numbers (sometimes referred to generically as an array of sorted indices) while leaving original field file 610 as is. The sorted row numbers (as indices) indirectly impute the sort order onto original field file 610 as would be appreciated. In other words, rather than sorting original field file 610 by its data values, the sort order is applied to pointers pointing to such data values to provide the proper sort order of the data values.
As also illustrated in
Original field files 610 illustrated in
As also illustrated in
As also illustrated in
The implementations of the invention of
Storing data as data tables 410 in conventional files on a storage device (e.g., disc) and moving such data to RAM each time for processing is extremely time consuming, particularly considering the sizes of today's databases, and further subjects the data to corruption. In addition, these conventional mechanisms require significant hardware resources and software overhead to process the data and even more to protect it. Often, such conventional mechanisms require normalization and/or cumbersome conventional index tables with keys and foreign keys (not to be confused with the arrays of indices of various implementations of the invention). As would be appreciated, the installations supporting such databases consume valuable real estate and electrical power.
Various implementations of the invention solve these and other problems associated with conventional databases. According to various implementations of the invention, original field file 610 and an array of sorted indices are stored for each of data fields 430 (e.g., columns) of data table 410 and these become the basic units of data table 410 on which various processing occurs. Depending on which implementation of the invention is used to store the character strings of original field file 610 in memory 700, the array of sorted indices comprises either sorted rows 810, sorted addresses 1210, or sorted addresses 1510.
As described above, original field file 610 is itself not sorted; rather the array of sorted indices is sorted based on the data values in original field file 610. This allows the integrity of the data values in original field file 610 to be maintained. However, in some implementations of the invention, a copy of original field file 610 may be made and this copy sorted to generate a sorted field file (not otherwise illustrated). In such implementations of the invention, original field file 610, sorted field file, and the array of sorted indices may be stored for each of data fields 430 of data table 410 and these become the basic units of data table 410 on which various processing occurs.
More particularly, a search term may be located in original field files 610 by using a binary search of original field files 610 via the appropriate array of sorted indices. A binary search operates by locating a data value at a middle data entry of a sorted file, determining whether that data value is less than or greater than the search term, and then bisecting either the lower or upper half of the sorted file until the search term is located or no further bisecting of the sorted file is possible. In various implementations of the invention, a binary search is configured to bisect the array of sorted indices, retrieve the underlying data from original field file 610 to which the index at the bisected location points, and compare the underlying data to the search term to determine whether the search term is located or if not, which remaining portion of the array of sorted indices to further bisect. Binary searches are well understood and provide an extremely efficient mechanism for locating search terms in an array of sorted data. This coupled with treating character strings as single integer values greatly enhances the speed of a search of even the largest of databases. For example, for an original field file comprising 4.3 billion data entries (i.e., rows), a maximum number of 32 jumps (i.e., iterations) of a binary search are required to determine whether or not the search term exists in the original field file and for an original field file comprising 18 million trillion data entries, a maximum number of 64 jumps are required. In accordance with various implementations of the invention, modern processors may accomplish such binary searches in less than 25 ns and 50 ns, respectively.
According to various implementations of the invention, original field files 610 and arrays of sorted indices are stored on disks, and in some implementations, without use of conventional file systems (i.e., operating system file systems). Instead, various implementations of the invention address disks directly by reading and/or writing sectors of the disk (e.g., 512 bytes) as would be appreciated. Doing so prevents, among other things, disk fragmentation and data deterioration.
While various implementations of the invention are described using the example of an eight character ASCII string for a 64-bit data register, the invention applies to strings of other lengths and data registers of other sizes. In order to accommodate ASCII strings smaller than a given data register, some implementations of the invention may use padding (e.g., space padding on the string or zero or null padding on the integer) as would be appreciated. As would appreciated, such padding may be accomplished through programming or automatically, through use of modern processor op codes. In order to accommodate larger ASCII strings, some implementations of the invention may use larger integer data types (e.g., double word or double integer, quad word or quad integer, etc.) as would be appreciated; doing so will still result in each unique ASCII string having a unique integer value. Further, while various implementations of the invention are described using the example of ASCII characters, the invention applies to other types of character codes such as but not limited to, ANSI, Unicode, UTF-8, UTF-16, UTF-32, Kanji, or other character codes or symbol codes as would be appreciated. Further, while various implementations of the invention are described in reference to character strings representing content, the invention applies to other types of character strings representing, for example, encrypted data, where the underlying encoding scheme may be time variable.
While the invention has been described herein in terms of various implementations, it is not so limited and is limited only by the scope of the following claims, as would be apparent to one skilled in the art. These and other implementations of the invention will become apparent upon consideration of the disclosure provided above. In addition, various components and features described with respect to one implementation of the invention may be used in other implementations as well.
This Application claims priority to U.S. Provisional Application No. 62/162,628, which was filed on May 15, 2015, and entitled “System and Method for Organizing Data.” The foregoing application is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62162628 | May 2015 | US |