The advent of global communications networks (e.g., the Internet) now makes accessible an enormous amount of data. People access and query unstructured and structured data every day. Unstructured data is used for creating, storing and retrieving reports, e-mails, spreadsheets and other types of documents, and consists of any data stored in an unstructured format at an atomic level. In other words, in the unstructured content, there is no conceptual definition and no data type definition—in textual documents, a word is simply a word. Current technologies used for content searches on unstructured data require tagging entities such as names or applying keywords and metatags. Therefore, human intervention is required to help make the unstructured data machine readable. Structured data is any data that has an enforced composition to the atomic data types. Structured data is managed by technology that allows for querying and reporting against predetermined data types and understood relationships.
Programming languages continue to evolve to facilitate specification by programmers as well as efficient execution. In the early days of computer languages, low-level machine code was prevalent. With machine code, a computer program or instructions comprising a computer program were written with machine languages or assembly languages and executed by the hardware (e.g., microprocessor). These languages provided an efficient means to control computing hardware, but were very difficult for programmers to comprehend and develop sophisticated logic.
Subsequently, languages were introduced that provided various layers of abstraction. Accordingly, programmers could write programs at a higher level with a higher-level source language, which could then be converted via a compiler or interpreter to the lower level machine language understood by the hardware. Further advances in programming have provided additional layers of abstraction to allow more advanced programming logic to be specified much quicker then ever before. However, these advances do not come without a processing cost.
The state of database integration in mainstream programming languages leaves a lot to be desired. Many specialized database programming languages exist, such as xBase, T/SQL, and PL/SQL, but these languages have weak and poorly extensible type systems, little or no support for object-oriented programming, and require dedicated run-time environments. Similarly, there is no shortage of general purpose programming languages, such as C#, VB.NET, C++, and Java, but data access in these languages typically takes place through cumbersome APIs that lack strong typing and compile-time verification.
The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed innovation. This summary is not an extensive overview, and it is not intended to identify key/critical elements or to delineate the scope thereof. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.
The disclosed innovation includes language extensions for strongly typed, compile-time checked query and set operations that can be applied to arbitrary data structures, be they object-relational (O-R) mappings, XML, or just regular objects. As is appropriate for a general purpose programming language, the extensions do not mandate a particular object-relational layer; rather, they are introduced as abstractions that can be implemented in multiple environments. Accordingly, there is provided a system that facilitates data querying in accordance with an innovative aspect. The system include a program component that provides embedded query and set operations in a programming language, and an application component that facilitates application of the query and set operations over a data structure of data. The data can be any kind of data such as that found in a database, a document (e.g., XML), and data sources in a programming language (e.g., C#), for example.
In another aspect, operators are provided that facilitate restriction, projection, testing, aggregation, ordering, grouping, sets, catenation, casting, singleton processing, converting, and partitioning.
In another aspect thereof, deferred execution is provided. When an expression with one or more sequence operators is received, and expression execution is initiated, overall execution is delayed as one or more operators execute to create one or more intermediary sequence objects. The one or more intermediary objects are initialized, and the query results are computed on-the-fly as sequence objects are enumerated.
In another innovative aspect, sequence aliasing is provided to explicitly name a current element of a sequence.
In yet another aspect, a query can be remoted to a data source whereat the query is executed and relevant results returned.
In still another aspect of the subject innovation, operations of create, update and delete operations arte provided as integral to the language.
To the accomplishment of the foregoing and related ends, certain illustrative aspects of the disclosed innovation are described herein in connection with the following description and the annexed drawings. These aspects are indicative, however, of but a few of the various ways in which the principles disclosed herein can be employed and is intended to include all such aspects and their equivalents. Other advantages and novel features will become apparent from the following detailed description when considered in conjunction with the drawings.
The innovation is now described with reference to the drawings, wherein like reference numerals are used to refer to like elements throughout. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding thereof. It may be evident, however, that the innovation can be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to facilitate a description thereof.
As used in this application, the terms “component” and “system” are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, software, or software in execution. For example, a component can be, but is not limited to being, a process running on a processor, a processor, a hard disk drive, multiple storage drives (of optical and/or magnetic storage medium), an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a server and the server can be a component. One or more components can reside within a process and/or thread of execution, and a component can be localized on one computer and/or distributed between two or more computers.
The disclosed innovation includes language extensions for strongly typed, compile-time checked query and set operations that can be applied to arbitrary data structures, be they object-relational (O-R) mappings, XML, or just regular objects. As is appropriate for a general purpose programming language, the extensions do not mandate a particular object-relational layer; rather, they are introduced as abstractions that can be implemented in multiple environments, including, for example, ObjectSpaces (a technology that facilitates building services supporting object representations of data in relational databases), MBF (Microsoft Business Framework), and WinFS. Following is an example of code that employs one such operator, a where operator:
Referring initially to the drawings,
At 200, a programming language is received. At 202, query and/or set operations are embedded therein and that can be compile-time checked. At 204, a generic code interface is provided that represents a sequence of objects of a specified type. At 206, another code interface is provided that facilitates remoting of a query to a remote data store or source.
The introduction of generic interfaces (“generics”) in a programming language (e.g., C# Version 2.0) puts in place a foundation for providing better data access in the language. With generics it is possible to preserve strong typing in many scenarios where the all-purpose object type would have otherwise been used.
The disclosed query and set operations revolve around an
Because
Attributes of
In this description, the term sequence type is used for any type constructed from
When sequence types are used to represent collections and database tables, it is beneficial to provide sequence operators for the operations that are commonly performed on sequences. Examples of such operations include filtering, projection, and aggregation. The disclosed sequence operators can be introduced through a series of examples that use the classes below. The classes can be an O-R mapping of a database, but they could equally well just be a set of regular classes.
In the classes, one-to-one and one-to-many relations can be captured as properties of the appropriate class and collection types. For example, an
The generic
The examples that follow assume the existence of two collections:
Because
Filtering: The where operator. To filter a sequence, a where operator is provided. The following where operation returns a sequence of those customers that have a zip code of 98112.
The predicate expression can be written as a regular C# Boolean expression and the members of the current element are automatically in scope. When necessary, the entire current element can be referenced using the identifier it, as illustrated below:
A select operator is provided for mapping and projection. The following select operation returns a sequence of the Name fields of each customer:
The select operator evaluates the given expression for each element in the source sequence, producing a sequence of the results. Similar to the where operator, the members of the current element are automatically in scope and the entire current element can be referenced using the identifier it.
A select expression can do more than just select a field. For example, the following operation returns a sequence of strings containing customer names and phone numbers.
If a select expression selects a sequence, the result of the select operation is a “flattened” sequence, not a sequence of sequences. The following returns a sequence of the orders of the customers in the customers collection:
Because of flattening, the result is a sequence<Order>, not a sequence<sequence<Order>>.
Multiple fields can be selected by creating instances of an appropriate type in the select expression. For example, to select the Name and Phone fields from a sequence of customers, a Contact class can be declared:
This class can then be used in a select operation:
Sequence operators use a method-like syntax that is ideally suited for composition into path-like queries. The following syntax combines a where and select operation to produce a sequence of the names of those customers that reside in California:
The following syntax produces a sequence of those orders that were placed by customers in California in the year 2003:
Accordingly,
Referring now to
Accordingly, a count function is provided that computes the number of elements in a sequence. For example, the following counts the number of customers in the 98112 zip code:
An exists function checks whether a sequence contains any elements. The following returns a sequence of those customers that have one or more orders:
The min, max, sum, and avg functions compute the minimum, maximum, sum, and average of sequence. For example, given a variable order of type Order, the following uses the sum function to compute the order total:
At 704, an operator is provided that tests a specified condition for all elements of a sequence. As disclosed herein, an all operator returns true if the specified condition is true for all elements in a sequence. The following returns a sequence of those customers with orders that have always included wine:
The element function throws an exception if the given sequence is empty and may throw an exception if the given sequence contains more that one element.
At 804, a function is provided that extracts a first and a last element of a sequence. As provided herein, first and last functions extract the first or last element of a sequence. Unlike element, first and last do not throw an exception if the sequence contains more than one element.
Multiple sort keys may be specified, separated by commas. A sort key may optionally be prefixed with ascending or descending. For example, the following produces a sequence of products ordered by category and, within each category, descending unit price:
Each sort key is an expression of a type that implements
It is possible to perform an operation similar in semantics to the having clause in the SQL (structure query language) language, by adding a where clause after the groupby operator. The following code returns just the Categories where the number of items in stock is less than 600:
At 1204, an operator is provided that returns a sequence of only unique elements. A distinct operator is used to return a sequence that contains just unique elements. The compiler calls Equals on each object in the sequence to compare it with the other objects in the sequence. For example, the following code returns just the unique zip codes where customers live:
At 1206, an operator is provided that returns objects common to two or more collections. At 1208, an operator is provided that returns objects in at least one sequence that are not in another sequence. An intersect operator returns all the objects that are present in both collections, and an except operator is the complementary operation and returns all the objects that are present in one sequence, but not in the other. The following code returns all the cities where both employees and customers live:
The {c} sequence alias above associates the identifier c with the current element of customers. At 1302, a sequence alias can be introduced immediately before a where, select, any, or all operator, and is in scope in the expressions of each following where, select, any, or all operator. At 1304, when a sequence alias is in scope, the members of that current element can be accessed through the alias. In another implementation, the members of that current element can only be accessed through the alias. Thus, until employing the sequence alias, current member elements are implicit.
At 1306, sequence aliases can be employed when multiple current elements of the same type are within scope substantially simultaneously. At 1308, sequence aliases can be employed when similarly named members are in scope. For example, the following produces a sequence of the most expensive products in each category:
In the innermost where expression, two Product elements are in scope and a sequence alias is needed to used the outer element.
At 1310, a sequence alias extends the scope of a current element over each following where, select, any, and all operator. In the following example, the alias c extends over both of the select operators, allowing the second select operator to “reach back” and access the current customer:
A feature of the disclosed sequence operators is that they can provide deferred execution. A sequence object produced by a sequence operator is essentially a proxy for a deferred query. In the following example,
With deferred execution it is not necessary to materialize a query in a separate collection. For example, given a
it is possible to pass a query itself (rather than the results of a query) to the method:
The sequence passed to
Deferred execution provides benefits when sequence objects are composed. For example, consider the following rewritten version of the code above:
Because the custs and orders temporary sequences are not materialized, there is little or no cost associated with breaking the large query into multiple smaller queries.
With respect to a sequence and a database cursor, a database cursor represents the result of a query, and a sequence represents the query itself. A sequence can be enumerated multiple times and each enumeration re-executes the query.
In the code above, if customers were added or removed from the customers collection between the two foreach statements, the second foreach statement will reflect the changes.
Materialization of a sequence can be forced by copying the sequence into a collection. For example, the
The code above illustrates a nice way in which sequence operators combine with the existing language. At the cost of one small object (the object created by the where operator) it is possible to pass a query as an argument to the
Accordingly,
An implementation strategy for sequence operators described supra works well for in-memory data structures. However, when a sequence<
The following would be a typical usage scenario:
A problem here can be that the query expressed by the where operator would execute locally. The enumerator of the
To permit the remoting of queries, an
uery
Similar to the sequence<
The
query<
The actual strategy for executing the query is up to
Certain restrictions can apply to predicate and projection expressions when the source sequence is a query<
Deferred execution means that it is possible to compose query trees. Consider the following method that takes a queryable sequence of customers as a parameter:
The following two invocations effectively pass encapsulated query trees to the method:
When the custs parameter is further queried in the method, it is possible for the underlying
Accordingly,
Few, if any, language extensions are required for O-R mappings to provide adequate support for Create operations. Referring to the
Accordingly, the operator 1700 provides an interface 1702 for denoting an updateable sequence. An
At 1704, a delete statement takes an
At 1706, an update statement is provided applies a list of assignment statements to an
Because of deferred execution, sequences passed to delete and update are never actually materialized. Rather, the sequences are represented as expression trees, which is precisely the desired representation for an underlying O-R mapping. The assignment(s) specified in an update statement would likewise be represented as expression trees.
Only where operators can be permitted on updateable sequences. When a where operator is applied to an
The including operator in the query above indicates that each customer's
Referring now to
Generally, program modules include routines, programs, components, data structures, etc., that perform particular tasks or implement particular abstract data types. Moreover, those skilled in the art will appreciate that the inventive methods can be practiced with other computer system configurations, including single-processor or multiprocessor computer systems, minicomputers, mainframe computers, as well as personal computers, hand-held computing devices, microprocessor-based or programmable consumer electronics, and the like, each of which can be operatively coupled to one or more associated devices.
The illustrated aspects of the innovation may also be practiced in distributed computing environments where certain tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules can be located in both local and remote memory storage devices.
A computer typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer and includes both volatile and non-volatile media, removable and non-removable media. By way of example, and not limitation, computer-readable media can comprise computer storage media and communication media. Computer storage media includes both volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital video disk (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the computer.
Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above should also be included within the scope of computer-readable media.
With reference again to
The system bus 2008 can be any of several types of bus structure that may further interconnect to a memory bus (with or without a memory controller), a peripheral bus, and a local bus using any of a variety of commercially available bus architectures. The system memory 2006 includes read-only memory (ROM) 2010 and random access memory (RAM) 2012. A basic input/output system (BIOS) is stored in a non-volatile memory 2010 such as ROM, EPROM, EEPROM, which BIOS contains the basic routines that help to transfer information between elements within the computer 2002, such as during start-up. The RAM 2012 can also include a high-speed RAM such as static RAM for caching data.
The computer 2002 further includes an internal hard disk drive (HDD) 2014 (e.g., EIDE, SATA), which internal hard disk drive 2014 may also be configured for external use in a suitable chassis (not shown), a magnetic floppy disk drive (FDD) 2016, (e.g., to read from or write to a removable diskette 2018) and an optical disk drive 2020, (e.g., reading a CD-ROM disk 2022 or, to read from or write to other high capacity optical media such as the DVD). The hard disk drive 2014, magnetic disk drive 2016 and optical disk drive 2020 can be connected to the system bus 2008 by a hard disk drive interface 2024, a magnetic disk drive interface 2026 and an optical drive interface 2028, respectively. The interface 2024 for external drive implementations includes at least one or both of Universal Serial Bus (USB) and IEEE 1394 interface technologies. Other external drive connection technologies are within contemplation of the subject innovation.
The drives and their associated computer-readable media provide nonvolatile storage of data, data structures, computer-executable instructions, and so forth. For the computer 2002, the drives and media accommodate the storage of any data in a suitable digital format. Although the description of computer-readable media above refers to a HDD, a removable magnetic diskette, and a removable optical media such as a CD or DVD, it should be appreciated by those skilled in the art that other types of media which are readable by a computer, such as zip drives, magnetic cassettes, flash memory cards, cartridges, and the like, may also be used in the exemplary operating environment, and further, that any such media may contain computer-executable instructions for performing the methods of the disclosed innovation.
A number of program modules can be stored in the drives and RAM 2012, including an operating system 2030, one or more application programs 2032, other program modules 2034 and program data 2036. All or portions of the operating system, applications, modules, and/or data can also be cached in the RAM 2012. It is to be appreciated that the innovation can be implemented with various commercially available operating systems or combinations of operating systems.
A user can enter commands and information into the computer 2002 through one or more wired/wireless input devices, e.g., a keyboard 2038 and a pointing device, such as a mouse 2040. Other input devices (not shown) may include a microphone, an IR remote control, a joystick, a game pad, a stylus pen, touch screen, or the like. These and other input devices are often connected to the processing unit 2004 through an input device interface 2042 that is coupled to the system bus 2008, but can be connected by other interfaces, such as a parallel port, an IEEE 1394 serial port, a game port, a USB port, an IR interface, etc.
A monitor 2044 or other type of display device is also connected to the system bus 2008 via an interface, such as a video adapter 2046. In addition to the monitor 2044, a computer typically includes other peripheral output devices (not shown), such as speakers, printers, etc.
The computer 2002 may operate in a networked environment using logical connections via wired and/or wireless communications to one or more remote computers, such as a remote computer(s) 2048. The remote computer(s) 2048 can be a workstation, a server computer, a router, a personal computer, portable computer, microprocessor-based entertainment appliance, a peer device or other common network node, and typically includes many or all of the elements described relative to the computer 2002, although, for purposes of brevity, only a memory/storage device 2050 is illustrated. The logical connections depicted include wired/wireless connectivity to a local area network (LAN) 2052 and/or larger networks, e.g., a wide area network (WAN) 2054. Such LAN and WAN networking environments are commonplace in offices and companies, and facilitate enterprise-wide computer networks, such as intranets, all of which may connect to a global communications network, e.g., the Internet.
When used in a LAN networking environment, the computer 2002 is connected to the local network 2052 through a wired and/or wireless communication network interface or adapter 2056. The adaptor 2056 may facilitate wired or wireless communication to the LAN 2052, which may also include a wireless access point disposed thereon for communicating with the wireless adaptor 2056.
When used in a WAN networking environment, the computer 2002 can include a modem 2058, or is connected to a communications server on the WAN 2054, or has other means for establishing communications over the WAN 2054, such as by way of the Internet. The modem 2058, which can be internal or external and a wired or wireless device, is connected to the system bus 2008 via the serial port interface 2042. In a networked environment, program modules depicted relative to the computer 2002, or portions thereof, can be stored in the remote memory/storage device 2050. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers can be used.
The computer 2002 is operable to communicate with any wireless devices or entities operatively disposed in wireless communication, e.g., a printer, scanner, desktop and/or portable computer, portable data assistant, communications satellite, any piece of equipment or location associated with a wirelessly detectable tag (e.g., a kiosk, news stand, restroom), and telephone. This includes at least Wi-Fi and Bluetooth™ wireless technologies. Thus, the communication can be a predefined structure as with a conventional network or simply an ad hoc communication between at least two devices.
Wi-Fi, or Wireless Fidelity, allows connection to the Internet from a couch at home, a bed in a hotel room, or a conference room at work, without wires. Wi-Fi is a wireless technology similar to that used in a cell phone that enables such devices, e.g., computers, to send and receive data indoors and out; anywhere within the range of a base station. Wi-Fi networks use radio technologies called IEEE 802.11 (a, b, g, etc.) to provide secure, reliable, fast wireless connectivity. A Wi-Fi network can be used to connect computers to each other, to the Internet, and to wired networks (which use IEEE 802.3 or Ethernet). Wi-Fi networks operate in the unlicensed 2.4 and 5 GHz radio bands, at an 11 Mbps (802.11 a) or 54 Mbps (802.1 lb) data rate, for example, or with products that contain both bands (dual band), so the networks can provide real-world performance similar to the basic 10BaseT wired Ethernet networks used in many offices.
Referring now to
The system 2100 also includes one or more server(s) 2104. The server(s) 2104 can also be hardware and/or software (e.g., threads, processes, computing devices). The servers 2104 can house threads to perform transformations by employing the invention, for example. One possible communication between a client 2102 and a server 2104 can be in the form of a data packet adapted to be transmitted between two or more computer processes. The data packet may include a cookie and/or associated contextual information, for example. The system 2100 includes a communication framework 2106 (e.g., a global communication network such as the Internet) that can be employed to facilitate communications between the client(s) 2102 and the server(s) 2104.
Communications can be facilitated via a wired (including optical fiber) and/or wireless technology. The client(s) 2102 are operatively connected to one or more client data store(s) 2108 that can be employed to store information local to the client(s) 2102 (e.g., cookie(s) and/or associated contextual information). Similarly, the server(s) 2104 are operatively connected to one or more server data store(s) 2110 that can be employed to store information local to the servers 2104.
What has been described above includes examples of the disclosed innovation. It is, of course, not possible to describe every conceivable combination of components and/or methodologies, but one of ordinary skill in the art may recognize that many further combinations and permutations are possible. Accordingly, the innovation is intended to embrace all such alterations, modifications and variations that fall within the spirit and scope of the appended claims. Furthermore, to the extent that the term “includes” is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term “comprising” as “comprising” is interpreted when employed as a transitional word in a claim.