Information is stored on various data systems for convenient access at a later time. However, the information is often stored in differing formats, even when similar systems are used. Also, many databases are user-created, which even further compounds the diversity of storage formats. Often, many types of data are stored all in a relatively large, but sparsely populated, database table. Other mechanisms of storing data include storing data in multiple tables each having a unique schema. The various approaches complicate the process of searching for desired data that is stored amongst different types of data.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended as an aid in determining the scope of the claimed subject matter.
Data can be stored and searched for in information systems using a list for representing the stored data. A list is typically a collection of items (e.g., rows in a table) which have properties (e.g., columns in a table). Some list examples include a set of personal contacts with properties (such as name, address, company), a set of parts with properties (such as cost or size), and a set of documents with properties (such as last modified time or author).
Database tables are commonly used for storing such data. For some applications, it is often necessary to create a single wide database table that is quite often only sparsely populated to store information (rather than by using separate tables to store the information). This is especially useful for generating large numbers of lists that can be defined because many database servers typically support many items in a table, rather than many different tables having few items. Such tables are often referred to as being a sparse database design, because many of the cells in the database are not populated.
Efficient querying across disparate schemas can be implemented by initially limiting the total number of lists and the total number of items queried and by using a mechanism for aligning data during the query. Querying across disparate data (e.g., data that is stored in accordance with disparate schemas) can comprise removing lists that are not applicable, defining a data alignment for the lists being searched, and executing the query.
These and other features and advantages will be apparent from a reading of the following detailed description and a review of the associated drawings. It is to be understood that both the foregoing general description and the following detailed description are explanatory only and are not restrictive. Among other things, the various embodiments described herein may be embodied as methods, devices, or a combination thereof. Likewise, the various embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. The disclosure herein is, therefore, not to be taken in a limiting sense.
As briefly described above, embodiments are directed to dynamic computation of identity-based attributes. With reference to
Computing device 100 may have additional features or functionality. For example, computing device 100 may also include additional data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Such additional storage is illustrated in
Computing device 100 also contains communication connections 116 that allow the device to communicate with other computing devices 118, such as over a network. Networks include local area networks and wide area networks, as well as other large scale networks including, but not limited to, intranets and extranets. Communication connection 116 is one example of communication media. Communication media may typically be embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. The term computer readable media as used herein includes both storage media and communication media.
In accordance with the discussion above, computing device 100, system memory 104, processor 102, and related peripherals can be used to implement disparate data query engine 120. Disparate data querying engine 120 in an embodiment can be used to efficiently query data within sparse data tables (described below).
The disparate data query engine can query across disparate schemas by examining a query to determine lists that would be implicated by the search. The list of determined list can be used to exclude lists (and associated items) from the lists that are to be searched. An alignment table can be used for aligning data to implement the query.
Querying across disparate data (e.g., data that is stored in accordance with disparate schemas) can comprise removing lists that are not applicable, defining a data alignment for the lists being searched, and executing the query using the alignment table. The efficiency of the search is enhanced because data not implicated by the query is not searched.
Table 214 has a list identifier of“Parts.” As illustrated, table 214 comprises three columns: an identifier (for identifying a parts item, which does not necessarily have to be unique), a part name, and a description. The table is populated with two items, having identifiers of “1” and “2.”
Design 220 is a sparse data table design. For example, design 220 comprises eight columns: a list identifier (for identifying a list), an item identifier (which does not necessarily have to be unique), a contact name (such as a person's name), a contact phone number, a parts description, a part name, a parts description and a physical address.
The table is populated with four items, two each from tables 212 and 214. For example, items “1” and “2” from table 212 have been included, as well as items “1” and “2” from table 214 have been included. It can be seen that various cells remain unpopulated, which is a characteristic of sparse data table designs. Moreover, it can be seen that as more unrelated (or partially related) data is added, the unpopulated cells occur even more frequently (which is often due to lack of commonality in column types).
In some cases, the actual schema of the data to be stored can be user-defined and/or dynamically instantiated in the application. Thus, the initial table design might be fixed, but the actual values stored in each column could vary based on a user's scenario. A user can use Name Values Pairs (NVPs) for specifying what type of data from the lists can be used to create indexes.
For example, which columns hold which data in a sparse data table design is typically determined by the list (within the overall table) to the data belongs. One row (from a first list) might use “Integer1” for the size of the item, and another row (from a second list) might also use “Integer1” for the cost of an item. The schema that is being used would be typically determined by consulting which particular list host a particular item.
Data from different lists (such as from tables 112 and 114) can be stored in a more compact form by sharing columns having compatible data types (such as integer, string, date, and the like). For example, a column having a data type of integer can be used to hold a list number. In similar fashion, a column having a data type of string can be used to hold string data such as contact name, part name, job description, part description, phone number, address and the like.
Trying to query across this data can be difficult since no one column contains data that is aligned to a schema of a particular list. There may also be many lists in the table that are not relevant to the query, or that contain no items that are relevant to the query. Also, data is often stored in a de-normalized fashion, such that a logical “item” has data spread out in different locations (in separate tables, for example). Two schemas may define this separation in different ways, which require queries of different forms. Such data can be efficiently queried by first limiting (or otherwise qualifying) the total number of lists queried, and then aligning the data being queried.
The characteristics for items of these lists are stored in Cols. 1 and 2, except with reverse column orders (with respect to the opposing list). Col. 3 has data for the color which is uniform across the lists except for Connectors. Moreover, Col. 4 has data on the Cost (which only applies to items in the Accessories list), and Destination (which only applies to items in the Connectors list).
It may also be convenient to organize selected lists from the table into groups. The groups can be labeled with (for example) a group identifier. For example, the lists Parts and the list Accessories can be combined into a single group (Car Items, for example). Grouping can be used to facilitate searching amongst lists that originate from, for example, a single web site.
In operation 510, unnecessary lists are typically removed from the query. When querying for Parts or Accessories, items that are not in the defined list type (Parts or Accessories) are not normally consulted. While the query normally explicitly defines which lists to use or not use, other methods can be used to determine and/or to identify entire lists to omit. For example, an index can be made which can efficiently indicate whether if certain items in the list are going to produce results. This index could determine whether the Connectors list has items that do not have Name, Description, or Color, and thus decide to skip that entire set of data. Thus, removing unnecessary lists reduces the total number of lists that are queried.
In another example, it is possible that a query author might wish to define a query to include even those lists which were missing one or more of the fields referenced. In such queries, results can be given by returning empty data for missing fields. The query author can thus define queries which take advantage of both behaviors.
For example, one form of the query syntax can “discover” lists that are associated with a certain field by following indexes in the schema by starting from the field name (or field ID, more precisely) and by following links to discovers lists that are associated with the specified field. This implementation usually requires the field to be indexed, which typically improves performance for queries that rely on a particular field.
In operation 520, the data alignment is defined.
Table 610 contains a mapping of each field to the place it is actually stored depending on which list of data being searched. Although, for simplicity of explanation, the example shows the lists as being comprised by a single table, the mapping might instead point to other locations (such as separate tables). Accordingly, the alignment table can be used to store a pointer to where the data actually resides.
As described above, differences in the structure of the query may need to be resolved. As an example, a design can include the Parts and Accessories lists and includes a Vendor field, but that the Vendor data is actually stored in a separate list. The Parts list's Vendor data is stored in the Manufacturers, while the Accessories list's Vendor data refers to the Designers list. A query over the Parts list that includes the Vendor field will have a different structure than a similar query over the Accessories list, because the Vendor data is in a different location.
Such differences can be resolved by including an additional column in the alignment table that identifies the target list. When structural differences in the query cannot be easily resolved, a query can be constructed for combining each of the individual result sets from otherwise incompatible queries.
For example, the SQL UNION statement can be used to combine potential result sets from the otherwise incompatible queries. The result sets can be manipulated using other logical/set operations such as AND, NOT, OR, XOR, INTERSECTION, ELEMENT, and the like to logically combine result sets. A different alignment table can be constructed for each query for which the result sets are to be combined.
Referring again to
For example, the query can be executed using the following parameters. The logical conditions can be used to specify that the List ID is equal to “Parts” or “Accessories” in response to operation 510, for example. The data to be returned can be specified as “Name,” “Description,” and “Color.” The columns to be pointed to by the alignment table can be specified as “Alignment[ListID].Name,” “Alignment[ListID].Description,” and “Alignment[ListID] Color.”
The result set of the query can be sorted by one of the shared columns. For example, sorts can be applied using a specified permutation of the columns. Additionally, other sorts can be used, such as by grouping the items in accordance with the containing list of the items.
For example, results from a query for items occurring within a range of dates can be given. Lists (contained within a dataset) not having dates associated therewith can be put in an “exclude” list. An alignment table can be constructed using lists that are not in the exclude list. The sorting can be made efficient by sorting the lists in the table first, then by the fields, and then by the value. By sorting by value last, all of the rows in the alignment table are in date order, which increases the efficiency of queries looking for fields and lists that are associated with a range of dates.
The above specification, examples and data provide a complete description of the manufacture and use of embodiments of the invention. Since many embodiments of the invention can be made without departing from the spirit and scope of the invention, the invention resides in the claims hereinafter appended.
This utility patent application claims the benefit under 35 United States Code §119(e) of U.S. Provisional Patent Application No. 60/859,051 filed on Nov. 14, 2006, which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
60859051 | Nov 2006 | US |