Technology advancements and cost reductions over time have enabled computers to become commonplace in society. Enterprises employ computers to collect and analyze data. For instance, computers can be employed to capture data about business customers that can be utilized to track sales and/or customer demographics. Furthermore, individuals also interact with a plurality of non-enterprise computing devices including home computers, laptops, personal digital assistants, digital video and picture cameras, mobile devices, and the like. As a consequence of computer ubiquity, an enormous quantity of digital data is generated daily by both enterprises and individuals.
Computer operations are commonly performed through instruction sets generally referred to as a programming languages. Programming languages are conventionally based upon a common syntax that enables a programmer to write commands in the language, and are continuously evolving to facilitate specification by programmers as well as efficient execution. For example, in the early days of computer languages, low-level machine code was prevalent. With machine code, a computer program or instructions comprising a computer program was written with machine languages or assembly languages and executed by the hardware (e.g., microprocessor). Such languages provided an efficient procedure to control computing hardware, but were difficult for programmers to comprehend and develop sophisticated logic.
Subsequently, languages were introduced that provided various layers of abstraction. Accordingly, programmers could write programs at a higher level with a higher-level source language, which could then be converted via a compiler or interpreter to the lower level machine language understood by the hardware. Further advances in programming have provided additional layers of abstraction to allow more advanced programming logic to be specified much quicker then ever before.
Moreover, the state of database integration in mainstream programming languages leaves a lot to be desired. Many specialized database programming languages exist, such as xBase, T/SQL, and PL/SQL, but these languages have weak and poorly extensible type systems, little or no support for object-oriented programming, and require dedicated run-time environments. Similarly, there is no shortage of general purpose programming languages, such as C#, VB.NET, C++, and Java, but data access in these languages typically takes place through cumbersome APIs that lack strong antyping and compile-time verification. In addition, such APIs lack the ability to provide a generic interface to query data, data collections, and the like.
The following presents a simplified summary in order to provide a basic understanding of some aspects of the claimed subject matter. This summary is not an extensive overview. It is not intended to identify key/critical elements or to delineate the scope of the claimed subject matter. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.
The subject innovation optimizes query translations at compile time in Language-Integrated Query (LINQ) languages via an optimization component, which optimizes algebraic trees and rewrites an expression composed from sequence operators into a more efficient expression(s). In a related aspect, a compiler associated with the optimization component can receive syntax (e.g., query comprehensions, query expressions) to turn into standard sequence operators that can operate on arbitrary collections. The compiler can then perform transformations on the algebraic trees, such as push filter conditions closer to the leafs (e.g., downward or upwards)and/or to combine filter conditions. The filter conditions can reduce unnecessary projections (e.g., elimination at earliest stage) and “where” conditions can be optimized into join operations. Moreover, ordering and groupings can be pushed to end of operation or further up. As such, the optimization component can include: change of the order for iterating over collections, reflect nested iterations by joins, arbitrary nesting, pushing filter operations upfront, changing the orders therein, and the like.
According to a further aspect, the optimizer component can customize optimization process based on notions of collections being defined. Hence, type information can be collected to analyze the algebraic tree and gather information for optimization. Moreover, some algebraic optimization can hold true for any sequence operations. The compiler can evaluate different operations, wherein the compiler attempts to locate the operation that has minimum cost and is optimal. For different collection types, additional specific and/or customized rules can be valid based on the domain, which implement their own specific optimization rules (e.g., the collection being finite or infinite, size of the collection, multiple runs being involved and data that can be passed from runtime to compile time, child nodes involved in the query tree, and the like).
In a related methodology, the compiler operates on an algebraic tree and/or receives syntax, and subsequently performs a semantic analysis thereon. Results of the semantic can be presented as the sequence of nodes in form of the query tree, which can be transformed into sequence operator calls. In one aspect, the query syntax is translated into sequence operators, followed by a compile-time optimization phase that optimizes the code generated earlier. Such optimizations can be a combination of generic optimization rules, which typically can be valid for all implementations of the standard sequence operator pattern—in conjunction with domain specific optimizations that can be defined for a specific implementation of the standard sequence operators. Such rules and/or algebraic laws can be defined via employing a variety of methods such as custom attributes, special rewrite rules (e.g., expressed as queries themselves). Some of the optimizations can further employ feedback from instrumented runs of the program. Hence, the compiler can generate parse tree, to produce semantic analysis, wherein the results will be the query/query tree rather than sequence of calls. By building a query tree (based on semantics) and supplying multiple passes that provide for transformations, expressions can be simplified to optimize execution. It is to be appreciated that in addition to the static/compile-time optimization, the subject innovation can employ a run-time optimization pass that performs further optimization of (in-memory) queries based on statistics and operational characteristics of the collection type on which the LINQ query is executed.
To the accomplishment of the foregoing and related ends, certain illustrative aspects of the claimed subject matter are described herein in connection with the following description and the annexed drawings. These aspects are indicative of various ways in which the subject matter may be practiced, all of which are intended to be within the scope of the claimed subject matter. Other advantages and novel features may become apparent from the following detailed description when considered in conjunction with the drawings.
The various aspects of the subject innovation are now described with reference to the annexed drawings, wherein like numerals refer to like or corresponding elements throughout. It should be understood, however, that the drawings and detailed description relating thereto are not intended to limit the claimed subject matter to the particular form disclosed. Rather, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the claimed subject matter.
In general, the LINQ framework 130 defines standard query operators that allow code written in LINQ-enabled languages to filter, enumerate, and create projections of several types of collections using the same syntax. Such collections can include arrays, enumerable classes, XML, datasets from relational databases, and third party data sources. Moreover, such can employ features of the .NET Framework, new LINQ-related assemblies, and extensions to the C# and Visual Basic .NET languages. For example, LINQ can be viewed as a set of features in that extends powerful query capabilities into the language syntax of C# and Visual Basic. The LINQ framework 130 can introduce standard, easily-learned patterns for querying and updating data, and can be extended to support potentially any type of data store, for example. LINQ can either access structures in memory or be translated into a remote call (e.g., in form of constructs that are familiar to developers who have worked with database queries, such as structured query language queries (e.g., SQL queries). Developers can use familiar clauses like ‘Where’ and ‘Order By’, just as they would with a database query and the collection will return an appropriate results.
The optimization component 140 can re-arrange clauses in the query and mitigate tasks of subsequent query operators. Accordingly, the optimization component 140 can facilitate supplying more intuitive ways of writing queries, wherein the optimization component 140 can then supply the proper replacement (e.g., nested operations being a more intuitive manner of a user can write syntax, which can be replaced by a join.) As such, rather than directly creating sequence operators from an algebraic tree, a pre-processing occurs via the optimization component based on algebraic rules 141, 142, 143 (1 to N, where N is an integer) that can be parameterized from external sources, or based on feedback from runtime operation and based on size differences between different collections.
The following instance indicates a particular operation for the optimization component 140 of the subject innovation, wherein conventional standard definition of translating query comprehensions is via a fixed set of rules. For example, operation of conventional systems for the following query
is conventionally and blindly translated into the following sequence operator expression
Dim Q=Xs.Where(Function(X)P(X)).Where(Function(X)R(X)).Select(Function(X)F(X))
Such code is sub-optimal as it creates several unnecessary intermediate collection. The subject innovation can optimize such query into the following
Dim Q=Xs.Where(Function(X)P(X) AndAlso R(X)).Select(Function(X)F(X)))
Moreover, the subject innovation can further supply optimization, wherein an additional act to fuse the filter into the final selection can be provided
Since the standard sequence operators represent monads/monoids, an optimizer can employ typically all of the standard monad and monoid laws to optimize queries in conjunction with any other additional laws that are applicable for standard sequence operators that go beyond standard monads (such a join, grouping, sorting, and the like.) For example, the optimizer can replace a nested loop by a (hash) join.
In addition, each collection type can also provide domain specific optimization. For example, if it is known that a collection is a set, the optimizer can use the knowledge that the order of elements is irrelevant for example by reordering the iteration over collections
From X In Xs, Y In Ys→From Y In Ys, X In Xs Select X, Y
Or if it is known that a list is sorted, parts of a list may be skipped when doing a join.
Even if the order of the collection is important, the optimizer component 140 can employ algebraic properties of the various lambda expressions (such associativity, commutativity, idempotence, neutral elements, and the like) that are passed to the standard sequence operators to optimize queries.
In the example above, there exists a where node that has a left child “Xs” (the source collection) and as the right child a function as the predicate. Moreover, there exists another “where” on top of that there is a select, and such tree can be replaced with another tree that can be efficiently executed (e.g., with lower costs for execution). As indicated above, the second “where” clause and the second child can be pushed inside the predicate of the lowest “where” so the second tree can have the where with the “x” as the function, wherein there exists a select as the function, and thereupon exists combinations of the functions (p(x) and r(x)). Hence, instead of performing two “where” clauses only one “clause” can be performed. The final query can traverse the collection once and be performed efficiently, even though the results remains the same. Hence in general, a fixed translation is not employed from the source languages to the query operators, and instead an analysis is performed that employs general rules and customized rules that are specific to the type of collections, and also information based on previous runs.
Typically, trees 260, 270, 280 can represent the syntactic structure of a string according to some formal grammar, wherein the program that produces such trees is in form of a parser and the structure starts form a root node and end in leaf nodes (e.g., parent-child relations). For example, the expression tree representation allows any suitable query processor to implement data operations (Where, Select, SelectMany, a filter function, a grouping function, a transformation function, and the like) therewith. Such query processor allows data to be queried locally, remotely, over a wire, regardless of programming language and/or format, wherein the system 200 allows a representation of the query expression to be created, then send to the data and be allowed to be implemented remotely. Moreover, such data can be queried in a remote location the same as querying data in the memory of a local computer.
Upon creation of the expression tree representation, a query processor (not shown) can be implemented to provide a query result. As such the expression tree representation 260, 270, 280 can be employed by any suitable query processor(s) to allow for the querying of data. For example, the query processor(s) can be in form of “plug-in” to allow the utilization of any suitable query operation and/or data operation.
Some of the optimizations can further employ feedback from instrumented runs of the program. Hence, the compiler can generate parse tree, to produce semantic analysis, wherein the results will be the query rather than sequence of calls. By building a query tree (based on semantics) and supplying multiple passes that provide for transformations, expressions can be simplified to optimize execution. In addition to the static/compile-time optimization, the subject innovation can employ a run-time optimization pass that performs further optimization of (in-memory) queries based on statistics and operational characteristics of the collection type on which the LINQ query is executed.
The AI component 730 can employ any of a variety of suitable AI-based schemes as described supra in connection with facilitating various aspects of the herein described invention. For example, a process for learning explicitly or implicitly how or which rule to employ can be facilitated via an automatic classification system and process. Classification can employ a probabilistic and/or statistical-based analysis (e.g., factoring into the analysis utilities and costs) to prognose or infer an action that a user desires to be automatically performed. For example, a support vector machine (SVM) classifier can be employed. Other classification approaches include Bayesian networks, decision trees, and probabilistic classification models providing different patterns of independence can be employed. Classification as used herein also is inclusive of statistical regression that is utilized to develop models of priority.
As will be readily appreciated from the subject specification, the subject innovation can employ classifiers that are explicitly trained (e.g., via a generic training data) as well as implicitly trained (e.g., via observing user behavior, receiving extrinsic information) so that the classifier is used to automatically determine according to a predetermined criteria which answer to return to a question. For example, with respect to SVM's that are well understood, SVM's are configured via a learning or training phase within a classifier constructor and feature selection module. A classifier is a function that maps an input attribute vector, x=(x1, x2, x3, x4, xn), to a confidence that the input belongs to a class—that is, f(x)=confidence(class).
The compiler 820 can accept as input a file having source code associated with processing of a sequence of elements. The source code can include various expressions and associated functions, methods and/or other programmatic constructs. The compiler 820 may process source code in conjunction with one or more components for analyzing constructs and generating or injecting code.
A front-end component 820 reads and performs lexical analysis upon the source code. In essence, the front-end component 820 reads and translates a sequence of characters (e.g., alphanumeric) in the source code into syntactic elements or tokens, indicating constants, identifiers, operator symbols, keywords, and punctuation among other things. The converter component 830 parses the tokens into an intermediate representation. For instance, the converter component 830 can check syntax and group tokens into expressions or other syntactic structures, which in turn coalesce into statement trees. Conceptually, these trees form a parse tree 870. Furthermore and as appropriate, the converter module 830 can place entries into a symbol table 830 that lists symbol names and type information used in the source code along with related characteristics.
A state 880 can be employed to track the progress of the compiler 810 in processing the received or retrieved source code and forming the parse tree 870. For example, different state values indicate that the compiler 810 is at the start of a class definition or functions, has just declared a class member, or has completed an expression. As the compiler progresses, it continually updates the state 880. The compiler 810 can partially or fully expose the state 880 to an outside entity, which can then provide input to the compiler 810.
Based upon constructs or other signals in the source code (or if the opportunity is otherwise recognized), the converter component 830 or another component can inject code corresponding to facilitate efficient and proper execution. Rules coded into the converter component 830 or other component indicates what must be done to implement the desired functionality and identify locations where the code is to be injected or where other operations are to be carried out. Injected code typically includes added statements, metadata, or other elements at one or more locations, but this term can also include changing, deleting, or otherwise modifying existing source code. Injected code can be stored as one or more templates or in some other form. In addition, it should be appreciated that symbol table manipulations and parse tree transformations can take place.
Based on the symbol table 860 and the parse tree 870, a back-end component 840 can translate the intermediate representation into output code. The back-end component 840 converts the intermediate representation into instructions executable in or by a target processor, into memory allocations for variables, and so forth. The output code can be executable by a real processor, but output code that is executable by a virtual processor can also be provided.
Furthermore, the front-end component 820 and the back end component 840 can perform additional functions, such as code optimization, and can perform the described operations as a single phase or in multiple phases. Various other aspects of the components of compiler 810 are conventional in nature and can be substituted with components performing equivalent functions. Additionally, at various stages during processing of the source code, an error checker component 850 can check for errors such as errors in lexical structure, syntax errors, and even semantic errors. Upon detection error, checker component 850 can halt compilation and generate a message indicative of the error.
As used in herein, the terms “component,” “system” and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, software or software in execution. For example, a component can be, but is not limited to being, a process running on a processor, a processor, an object, an instance, an executable, a thread of execution, a program and/or a computer. By way of illustration, both an application running on a computer and the computer can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers.
The word “exemplary” is used herein to mean serving as an example, instance or illustration. Any aspect or design described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects or designs. Similarly, examples are provided herein solely for purposes of clarity and understanding and are not meant to limit the subject innovation or portion thereof in any manner. It is to be appreciated that a myriad of additional or alternate examples could have been presented, but have been omitted for purposes of brevity.
Furthermore, all or portions of the subject innovation can be implemented as a system, method, apparatus, or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware or any combination thereof to control a computer to implement the disclosed innovation. For example, computer readable media can include but are not limited to magnetic storage devices (e.g., hard disk, floppy disk, magnetic strips . . . ), optical disks (e.g., compact disk (CD), digital versatile disk (DVD) . . . ), smart cards, and flash memory devices (e.g., card, stick, key drive . . . ). Additionally it should be appreciated that a carrier wave can be employed to carry computer-readable electronic data such as those used in transmitting and receiving electronic mail or in accessing a network such as the Internet or a local area network (LAN). Of course, those skilled in the art will recognize many modifications may be made to this configuration without departing from the scope or spirit of the claimed subject matter.
In order to provide a context for the various aspects of the disclosed subject matter,
With reference to
The system bus 918 can be any of several types of bus structure(s) including the memory bus or memory controller, a peripheral bus or external bus, and/or a local bus using any variety of available bus architectures including, but not limited to, 11-bit bus, Industrial Standard Architecture (ISA), Micro-Channel Architecture (MSA), Extended ISA (EISA), Intelligent Drive Electronics (IDE), VESA Local Bus (VLB), Peripheral Component Interconnect (PCI), Universal Serial Bus (USB), Advanced Graphics Port (AGP), Personal Computer Memory Card International Association bus (PCMCIA), and Small Computer Systems Interface (SCSI).
The system memory 916 includes volatile memory 920 and nonvolatile memory 922. The basic input/output system (BIOS), containing the basic routines to transfer information between elements within the computer 912, such as during start-up, is stored in nonvolatile memory 922. By way of illustration, and not limitation, nonvolatile memory 922 can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM), or flash memory. Volatile memory 920 includes random access memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in many forms such as synchronous RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), and direct Rambus RAM (DRRAM).
Computer 912 also includes removable/non-removable, volatile/non-volatile computer storage media.
It is to be appreciated that
A user enters commands or information into the computer 912 through input device(s) 936. Input devices 936 include, but are not limited to, a pointing device such as a mouse, trackball, stylus, touch pad, keyboard, microphone, joystick, game pad, satellite dish, scanner, TV tuner card, digital camera, digital video camera, web camera, and the like. These and other input devices connect to the processing unit 914 through the system bus 918 via interface port(s) 938. Interface port(s) 938 include, for example, a serial port, a parallel port, a game port, and a universal serial bus (USB). Output device(s) 940 use some of the same type of ports as input device(s) 936. Thus, for example, a USB port may be used to provide input to computer 912, and to output information from computer 912 to an output device 940. Output adapter 942 is provided to illustrate that there are some output devices 940 like monitors, speakers, and printers, among other output devices 940 that require special adapters. The output adapters 942 include, by way of illustration and not limitation, video and sound cards that provide a means of connection between the output device 940 and the system bus 918. It should be noted that other devices and/or systems of devices provide both input and output capabilities such as remote computer(s) 944.
Computer 912 can operate in a networked environment using logical connections to one or more remote computers, such as remote computer(s) 944. The remote computer(s) 944 can be a personal computer, a server, a router, a network PC, a workstation, a microprocessor based appliance, a peer device or other common network node and the like, and typically includes many or all of the elements described relative to computer 912. For purposes of brevity, only a memory storage device 946 is illustrated with remote computer(s) 944. Remote computer(s) 944 is logically connected to computer 912 through a network interface 948 and then physically connected via communication connection 950. Network interface 948 encompasses communication networks such as local-area networks (LAN) and wide-area networks (WAN). LAN technologies include Fiber Distributed Data Interface (FDDI), Copper Distributed Data Interface (CDDI), Ethernet/IEEE 802.3, Token Ring/IEEE 802.5 and the like. WAN technologies include, but are not limited to, point-to-point links, circuit switching networks like Integrated Services Digital Networks (ISDN) and variations thereon, packet switching networks, and Digital Subscriber Lines (DSL).
Communication connection(s) 950 refers to the hardware/software employed to connect the network interface 948 to the bus 918. While communication connection 950 is shown for illustrative clarity inside computer 912, it can also be external to computer 912. The hardware/software necessary for connection to the network interface 948 includes, for exemplary purposes only, internal and external technologies such as, modems including regular telephone grade modems, cable modems and DSL modems, ISDN adapters, and Ethernet cards.
What has been described above includes various exemplary aspects. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing these aspects, but one of ordinary skill in the art may recognize that many further combinations and permutations are possible. Accordingly, the aspects described herein are intended to embrace all such alterations, modifications and variations that fall within the spirit and scope of the appended claims.
Furthermore, to the extent that the term “includes” is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term “comprising” as “comprising” is interpreted when employed as a transitional word in a claim.