Information is stored on various data systems for convenient access at a later time. However, the information is often stored in differing formats, even when similar systems are used. Also, many databases are user-created, which even further compounds the diversity of storage formats. Often, many types of data are all stored in a relatively large, but sparsely populated, database table. Such data storage can be relatively difficult to query because, for example, multiple tables can have shared or disparate column headings and can have list items that are often left empty. The various approaches often complicate the process of searching for desired data that is stored amongst different types of data.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended as an aid in determining the scope of the claimed subject matter.
Data can be stored and searched for in information systems using a list for representing the stored data. A list is typically a collection of items (e.g., rows in a table) which have properties (e.g., columns in a table). Some list examples include a set of personal contacts with properties (such as name, address, company), a set of parts with properties (such as cost or size), and a set of documents with properties (such as last modified time or author).
Database tables are commonly used for storing such data. For some applications, it is often necessary to create a single wide database table that is quite often only sparsely populated to store information (rather than by using separate tables to store the information). This is especially useful for generating large numbers of lists that can be defined because many database servers typically support many items in a table, rather than many different tables having few items. Such tables are often referred to as being a sparse database design, because many of the cells in the database are not populated.
A separate table can be created for storing a copy of the important data from a sparse data base table. The separate table comprises Name Value Pairs (NVP) that can be used to efficiently query the important data by using the database indices. Efficient querying can be accomplished by creating a separate table for a sparse table, indexing the separate table using NVPs; and creating multiple tables in a collation order in order to search data that is sorted in accordance with user conventions in various locales.
These and other features and advantages will be apparent from a reading of the following detailed description and a review of the associated drawings. It is to be understood that both the foregoing general description and the following detailed description are explanatory only and are not restrictive. Among other things, the various embodiments described herein may be embodied as methods, devices, or a combination thereof. Likewise, the various embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. The disclosure herein is, therefore, not to be taken in a limiting sense.
As briefly described above, embodiments are directed to dynamic computation of identity-based attributes. With reference to
Computing device 100 may have additional features or functionality. For example, computing device 100 may also include additional data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Such additional storage is illustrated in
Computing device 100 also contains communication connections 116 that allow the device to communicate with other computing devices 118, such as over a network. Networks include local area networks and wide area networks, as well as other large scale networks including, but not limited to, intranets and extranets. Communication connection 116 is one example of communication media. Communication media may typically be embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. The term computer readable media as used herein includes both storage media and communication media.
In accordance with the discussion above, computing device 100, system memory 104, processor 102, and related peripherals can be used to implement sparse data table indexing engine 120. Sparse data table indexing 120 in an embodiment can be used to efficiently query data within sparse data tables (described below).
People use data base programs for storing data so that the data can be conveniently searched and retrieved at a later time. However, many such programs have been developed that require special training to use. Accordingly, easy-to-use data base programs have been written that use relatively unstructured data that is not stored in a uniform manner, such as in a sparse data base design. For example, a sparse data base typically stores list items as cells in a list that is arranged in rows and columns.
For example, the list items are usually grouped in columns (which each column is used to specify a type of an element, such as a “name,” “address,” or “phone number”) and rows (where each row is used to identify a person or thing, such as personal data, a mail list, job status, and the like). Multiple tables can have shared or disparate column headings, and can have list items that are often left empty.
However, such easy-to-use designs can be difficult to efficiently query by value, difficult to efficiently query across multiple sets of data (lists), and to efficiently query by value in a database design in accordance with user's conventions in various locales.
Table 214 has a list identifier of “Parts.” As illustrated, table 214 comprises three columns: an identifier (for identifying a parts item, which does not necessarily have to be unique), a part name, and a description. The table is populated with two items, having identifiers of “1” and “2.”
Design 220 is a sparse data table design. For example, design 220 comprises eight columns: a list identifier (for identifying a list), an item identifier (which does not necessarily have to be unique), a contact name (such as a person's name), a contact phone number, a parts description, a part name, a parts description and a physical address.
The table is populated with four items, two each from tables 212 and 214. For example, items “1” and “2” from table 212 have been included, as well as items “1” and “2” from table 214 have been included. It can be seen that various cells remain unpopulated, which is a characteristic of sparse data table designs. Moreover, it can be seen that as more unrelated (or partially related) data is added, the unpopulated cells occur even more frequently (which is often due to lack of commonality in column types).
In some cases, the actual schema of the data to be stored can be user-defined and/or dynamically instantiated in the application. Thus, the initial table design might be fixed, but the actual values stored in each column could vary based on a user's scenario. A user can use Name Values Pairs (NVPs) for specifying what type of data from the lists can be used to create indexes.
For example, which columns hold which data in a sparse data table design is typically determined by the list (within the overall table) to which the data belongs. One row (from a first list) might use “Integer1” for the size of the item, and another row (from a second list) might also use “Integer1” for the cost of an item. The schema that is being used would be typically determined by consulting which particular list hosts a particular item.
In some cases, the actual schema of the data to be stored can be user-defined and/or dynamically instantiated in the application. Thus, the initial table design might be fixed, but the actual values stored in each column could vary based on a user's scenario. A user can use Name Values Pairs (NVPs) for specifying what type of data from the lists can be used to create indexes.
In some cases, the actual schema of the data to be stored might be user defined and dynamic in the application. The initial table design would be fixed, but the actual values stored in each column might vary based on the user scenario.
Data from different lists (such as from tables 112 and 114) can be stored in a more compact form by sharing columns having compatible data types (such as integer, string, date, and the like). For example, a column having a data type of integer can be used to hold a list number. In similar fashion, a column having a data type of string can be used to hold string data such as contact name, part name, job description, part description, phone number, address and the like.
Trying to query across this data can be difficult since no one column contains data that is aligned to a schema of a particular list. There may also be many lists in the table that are not relevant to the query, or that contain no items that are relevant to the query. Also, data is often stored in a de-normalized fashion, such that a logical “item” has data spread out in different locations (in separate tables, for example). Two schemas may define this separation in different ways, which require queries of different forms. Such data can be efficiently queried by first limiting (or otherwise qualifying) the total number of lists queried, and then aligning the data being queried.
Sparse database designs can be scalable to support an infrastructure for users to store various lists, and allow the users to customize the properties on each list (such as name, phone, address, and the like. After a sparse database design has been populated with data, users often want to define rich queries over the data.
An index can be added to a database table to make queries more efficient. However, simply indexing all columns is not usually sufficient because most database servers do not perform efficiently with relatively large amounts of indexes. The software vendor cannot normally optimize the indices because the vendor does not usually have beforehand knowledge of which properties are to be queried. The vendor does not usually have beforehand knowledge of which properties are to be queried because the properties are defined by users that subsequently purchase the software.
Further complicating the design is that one column in the table could actually store data from different lists, so a single column does not always contain related data (e.g., data from the same list). Having columns with unrelated data usually makes it difficult to use the typical indexing mechanisms that conventional databases provide.
Querying is difficult because indexing each additional column can cause substantial decreases in performance. Moreover, because properties (e.g., columns) are user defined, the system doesn't know beforehand which properties would be important for optimizing queries. Because all lists typically share the same table, a column does not always contain the same property for all rows.
Creating a separate fixed table allows using a database index (based on user definitions) over a consistent set of properties. Although there is an added cost of keeping the table up-to-date with a copy of the data, the index allows the benefit of “rich” queries (such as described below with respect to
Some of the lists that are stored in the same sparse data table could have the same property type (such as “Description” in the example table 410). Even if same property type is stored in separate columns in the sparse table, the same field identifier can be used as an index to find values associated with the field identifier. Thus a search using a field identifier can retrieve data from different lists when, for example, two different lists in a sparse data table share common properties.
Table 510 provides data in a consistent form, which allows customization of the way the user-supplied data and properties is indexed. The way the data is indexed can be customized in accordance with a particular application. For example, the application can use the indices of the NVP table to perform fast queries in one list, fast queries across lists, and to efficiently locate items in the NVP table.
When performing fast queries in one list, the List ID, Field ID, Value, and Item ID can be used as indices to search rows (index entry groups) of the NVP table. Thus searches using these keys can efficiently find a set of items which have a particular value. For example, a search for finding the Item ID using the keys “List ID=Contacts, Field ID=Contact Name, Value=Dustin” would result in the return of the value of “1” for the Item ID. The Item ID can be used to efficiently retrieve other data associated with the key set by, for example, searching the sparse data table for the Item ID associated with the specified contact list.
To perform fast queries across lists, the Field ID and Value can be used as an index to search the NVP table. The index can be queried for items with a particular value across multiple different lists. For example, a search for finding the Item ID and List ID using the key “Description=Multi” would return the result of “Contacts, 3” and “Parts, 3,” which are the third entries of the Contacts list and the Parts list, respectively.
To efficiently locate items in the NVP, the list ID and Item ID can be used as an index. Thus rows in the NVP table that would need to be updated can be easily found when an item is changed in the sparse data table (which can be used to keep the NVP synchronized with the sparse table). For example, a search for finding the rows where “Item ID=1” and “List ID=Contacts” would return the results of “Contact Name, Dustin,” and “Description, PM.”
An additional problematic situation can occur when supporting multiple collations (sort orders). For example, most database servers have the inability to create an index on data that does not share the same kind of collation. The big sparse data table can be used to store data from multiple lists, but it is also possible that the data being stored by users is in different languages or is from different locales where grammatical conventions differ. Accordingly, the way in which the data is sorted could differ in response to, for example, a locale (such as a geographical location).
To create a database index that can work across different collations, multiple copies of the NVP table can be created such that each created NVP table is associated with a different collation. For example, for a sparse data table containing French and Latin values (which may have differing alphabets), a French NVP table and a Latin NVP table would be created. In a similar fashion, a table having (for example) contact information in English and Japanese could have an English NVP table and a Japanese table for allowing data from multiple collations to be presented.
It can be seen that NVP tables tend to be even more beneficial when querying larger sparse data tables. When querying fairly small sparse data tables (such as under 2000 entries, for example), the performance cost of maintaining NVP tables might be greater than the savings provided by using the NVP tables. An efficiency threshold can be set such that when a sparse data table grows above a certain threshold, sparse data table indexing using NVP tables can be enabled. The NVP indexing feature could also be turned on in response to user queries of specific types that would benefit from the NVP indexing feature.
The above specification, examples and data provide a complete description of the manufacture and use of embodiments of the invention. Since many embodiments of the invention can be made without departing from the spirit and scope of the invention, the invention resides in the claims hereinafter appended.
This utility patent application claims the benefit under 35 United States Code § 119(e) of U.S. Provisional Patent Application No. 60/859,165 filed on Nov. 14, 2006, which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
60859165 | Nov 2006 | US |