1. Field of the Invention
The present application relates generally to an improved data processing apparatus and method and more specifically to an apparatus and method for performing a derivation of augmented serialized data sets from base serialized data sets.
2. Background of the Invention
Currently, there are two important trends motivating new enterprise information integration methods. The first trend is happening inside the enterprise where there is an increasing demand by enterprise business leaders to be able to exploit information residing outside traditional information technology (IT) silos in efforts to react to situational business needs. The predominant share of enterprise business data resides on desktops, departmental files systems, and corporate intranets in the form of spreadsheets, presentations, email, Web services, HyperText Markup Language (HTML) pages, etc. There is a wealth of valuable information to be gleaned from such data; consequently, there is an increasing demand for applications that may consume the data, combine the data with data in corporate databases, content management systems, and other IT managed repositories, and then to transform the combined data into timely information.
Consider, for example, a scenario where a prudent bank manager wants to be notified when a recent job applicant's credit score dips below 500, so that she might avoid a potentially costly hiring mistake by dropping an irresponsible applicant from consideration. Data on recent applicants resides on her desktop, in a personal spreadsheet. Access to credit scores is available via a corporate database. She persuades a contract programmer in the accounting department to build her a Web application that combines the data from these two sources on demand, producing an Atom feed that she may view for changes via her feed reader.
The second trend is happening outside the enterprise where the Web has evolved from primarily a publication platform to a participatory platform, spurred by Web 2.0 paradigms and technologies that are fueling an explosion in collaboration, communities, and the creation of user-generated content. The main drivers propelling this advancement of the Web as an extensible development platform is the plethora of valuable data and services being made available, along with the lightweight programming and deployment technologies which allow these “resources” to be mixed and published in innovative new ways.
Standard data interchange formats such as Extensible Markup Language (XML) and JavaScript™ Object Notation (JSON), as well as prevalent syndication formats such as Really Simple Syndication (RSS) and Atom, allow resources to be published in formats readily consumed by Web applications, while lightweight access protocols, such as Representational State Transfer (REST), simplify access to these resources. Furthermore, Web-oriented programming technologies like Asynchronous JavaScript™ and XML (AJAX), Php: Hypertext Preprocessor (PHP), and Ruby on Rails™ enable quick and easy creation of “mashups”, which is a term that has been popularized to refer to composite Web applications that use resources from multiple sources.
In one illustrative embodiment, a method, in a data processing system, is provided for data integration in a data processing system. The illustrative embodiments receive a data mashup specification and execute an interleaved sequence of operations as defined by the data mashup specification. In the illustrative embodiments, the interleaved sequence of operations comprises at least one of an import operation, an augment operation, or a publish operation. In executing the interleaved sequence of operations, the illustrative embodiments determine a next operation to execute, form an outer context, and add the outer context to a binding context of the next operation. Responsive to the next operation being the import operation, the illustrative embodiments import a data resource from a data source and generating an input generic feed. Responsive to the next operation being the augment operation, the illustrative embodiments produce a set of augmented generic feeds from a set of input generic feeds. Responsive to the next operation being the publish operation, the illustrative embodiments produce a new data resource from a specified augmented generic feed.
In other illustrative embodiments, a computer program product comprising a computer useable or readable medium having a computer readable program is provided. The computer readable program, when executed on a computing device, causes the computing device to perform various ones, and combinations of, the operations outlined above with regard to the method illustrative embodiment.
In yet another illustrative embodiment, a system/apparatus is provided. The system/apparatus may comprise one or more processors and a memory coupled to the one or more processors. The memory may comprise instructions which, when executed by the one or more processors, cause the one or more processors to perform various ones, and combinations of, the operations outlined above with regard to the method illustrative embodiment.
These and other features and advantages of the present invention will be described in, or will become apparent to those of ordinary skill in the art in view of, the following detailed description of the exemplary embodiments of the present invention.
The invention, as well as a preferred mode of use and further objectives and advantages thereof, will best be understood by reference to the following detailed description of illustrative embodiments when read in conjunction with the accompanying drawings, wherein:
As will be appreciated by one skilled in the art, the present invention may be embodied as a system, method or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer usable program code embodied in the medium.
Any combination of one or more computer usable or computer readable medium(s) may be utilized. The computer-usable or computer-readable medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CDROM), an optical storage device, a transmission media such as those supporting the Internet or an intranet, or a magnetic storage device. Note that the computer-usable or computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory. In the context of this document, a computer-usable or computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The computer-usable medium may include a propagated data signal with the computer-usable program code embodied therewith, either in baseband or as part of a carrier wave. The computer usable program code may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, radio frequency (RF), etc.
Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java™, Smalltalk™, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
The illustrative embodiments are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the illustrative embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The illustrative embodiments provide a mechanism for data integration that allows enterprise mashups (i.e. situational applications) to be built quickly and easily. The data integration mechanism performs data integration logic of an application, thereby allowing the enterprise mashup developer to focus on the application's business logic. In particular, the illustrative embodiments disclose a system for data integration that:
Thus, the illustrative embodiments may be utilized in many different types of data processing environments including a distributed data processing environment, a single data processing device, or the like. In order to provide a context for the description of the specific elements and functionality of the illustrative embodiments,
With reference now to the figures and in particular with reference to
With reference now to the figures,
In the depicted example, server 104 and server 106 are connected to network 102 along with storage unit 108. In addition, clients 110, 112, and 114 are also connected to network 102. These clients 110, 112, and 114 may be, for example, personal computers, network computers, or the like. In the depicted example, server 104 provides data, such as boot files, operating system images, and applications to the clients 110, 112, and 114. Clients 110, 112, and 114 are clients to server 104 in the depicted example. Distributed data processing system 100 may include additional servers, clients, and other devices not shown.
In the depicted example, distributed data processing system 100 is the Internet with network 102 representing a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages. Of course, the distributed data processing system 100 may also be implemented to include a number of different types of networks, such as for example, an intranet, a local area network (LAN), a wide area network (WAN), or the like. As stated above,
With reference now to
In the depicted example, data processing system 200 employs a hub architecture including north bridge and memory controller hub (NB/MCH) 202 and south bridge and input/output (I/O) controller hub (SB/ICH) 204. Processing unit 206, main memory 208, and graphics processor 210 are connected to NB/MCH 202. Graphics processor 210 may be connected to NB/MCH 202 through an accelerated graphics port (AGP).
In the depicted example, local area network (LAN) adapter 212 connects to SB/ICH 204. Audio adapter 216, keyboard and mouse adapter 220, modem 222, read only memory (ROM) 224, hard disk drive (HDD) 226, CD-ROM drive 230, universal serial bus (USB) ports and other communication ports 232, and PCI/PCIe devices 234 connect to SB/ICH 204 through bus 238 and bus 240. PCI/PCIe devices may include, for example, Ethernet adapters, add-in cards, and PC cards for notebook computers. PCI uses a card bus controller, while PCIe does not. ROM 224 may be, for example, a flash basic input/output system (BIOS).
HDD 226 and CD-ROM drive 230 connect to SB/ICH 204 through bus 240. HDD 226 and CD-ROM drive 230 may use, for example, an integrated drive electronics (IDE) or serial advanced technology attachment (SATA) interface. Super I/O (SIO) device 236 may be connected to SB/ICH 204.
An operating system runs on processing unit 206. The operating system coordinates and provides control of various components within the data processing system 200 in
As a server, data processing system 200 may be, for example, an IBM® eServer™ System p® computer system, running the Advanced Interactive Executive (AIX®) operating system or the LINUX® operating system (eServer, System p, and AIX are trademarks of International Business Machines Corporation in the United States, other countries, or both while LINUX is a trademark of Linus Torvalds in the United States, other countries, or both). Data processing system 200 may be a symmetric multiprocessor (SMP) system including a plurality of processors in processing unit 206. Alternatively, a single processor system may be employed.
Instructions for the operating system, the object-oriented programming system, and applications or programs are located on storage devices, such as HDD 226, and may be loaded into main memory 208 for execution by processing unit 206. The processes for illustrative embodiments of the present invention may be performed by processing unit 206 using computer usable program code, which may be located in a memory such as, for example, main memory 208, ROM 224, or in one or more peripheral devices 226 and 230, for example.
A bus system, such as bus 238 or bus 240 as shown in
Those of ordinary skill in the art will appreciate that the hardware in
Moreover, the data processing system 200 may take the form of any of a number of different data processing systems including client computing devices, server computing devices, a tablet computer, laptop computer, telephone or other communication device, a personal digital assistant (PDA), or the like. In some illustrative examples, data processing system 200 may be a portable computing device which is configured with flash memory to provide non-volatile memory for storing operating system files and/or user-generated data, for example. Essentially, data processing system 200 may be any known or later developed data processing system without architectural limitation.
The mechanisms of the illustrative embodiments for data integration described herein integrate data resources using a process of generic feed augmentation.
Import operations 308 typically retrieves a data resource from a data source via popular Web protocols such as Hypertext Transfer Protocol (HTTP), File Transfer Protocol (FTP), Simple Object Access Protocol (SOAP) or the like. Import operations 308 map a retrieved data resource into generic feeds 316 by:
Augment operations 310 produces augmented generic feeds 318 by subsequently applying augment operations to generic feeds 316 produced by import operations 308. For example, a group augmentation operation partitions and aggregates the payloads of an input generic feed from generic feeds 316 according to specified grouping key. Augmented generic feeds 318 produced by the group augmentation operation has one new payload per distinct grouping key value, where each new payload represents the aggregation of all input payloads having the same grouping key value. Publish operations 312 transform one or more of augmented generic feeds 318 into new data resource 320 by calling an appropriate transformation function specific to the desired data resource type such as XML, RSS, Atom, JSON, CSV, HTML, or other MIME types.
Thus, a data mashup is a parameterized program of operators. Each operator corresponds to one of import operations 308, augment operations 310, or publish operations 312. Data integration mechanism 300 receives the data mashup and relevant parameter values in the form of data mashup specification 314 from calling application 322. Data integration mechanism 300 executes the operators of the data mashup specification and returns the integrated data resources produced by executing the data mashup as new data resource 320 to calling application 322.
In the preferred embodiment of the illustrative embodiments, a data mashup is represented as a data flow network of operators that interoperate in a demand-driven data flow fashion. The producer and consumer relationship between operators in the data flow network determines the sequence in which the import, augment, and publish operations of the data mashup are applied. Operators exchange data in the form of tuples. Each tuple may contain one or more named data objects. A data object represents either a generic feed, as might be produced by an operator representing an import or augment operation, or a data resource, as might be produced by an operator representing a publish operation.
A generic feed is represented by a sequence of nodes according to the XDM data model. Each node in a sequence representing a generic feed corresponds to a feed entry. The root node of the feed entry represents a container for the feed payload. Generic feeds 316 restrict the child nodes of a container node to element nodes; however, other nodes in the sub-tree rooted at the container node may be any XDM node. Child nodes of a container node correspond to the payload of the feed entry. In general, operators iterate over the container nodes of the sequence and perform filtering, joins, aggregation, and other set manipulations that involve the extraction and comparison of attribute and element values of the payload.
Data mashup operators may also have operands. Operands provide an operator with input parameters. A Uniform Resource Locator (URL) of a data resource is an example of an operand that might be provided to an operator representing an import operation. Operands may also be used to define an operator's relationship to other operators in the data mashup. For example, the operands to an operator representing a group augmentation operation would include the operator that produces the generic feed to be grouped and aggregated.
Operands may refer to variables. For example, a URL identifying a data resource that represents hotel reviews might receive the hotel name via a URL variable. A binding context provided to each operator provides the values of any variables the operator requires. The values of variables provided by the binding context might come either from parameters passed to the data mashup by the calling application, or from data imported into the data mashup via the execution of other operators. In one illustrative embodiment, a data mashup exchanges data with an application according to a REST protocol.
The main data processing logic of operators, such as a Merge operator, Filter operator, Annotate operator, Group operator, Transform operator, Sort operator, Union operator, or the like, in the illustrative embodiments may be implemented by evaluating XQuery expressions using XQuery engine 304 over the XDM sequences used to represent the generic feeds and data resources that are input to the operator. There are a variety of ways to implement XQuery engine 304 (e.g. DB2, Oracle, or the like) with bindings to popular programming languages (e.g. PHP, Java, or the like) that may be used by the data integration mechanism to evaluate such expressions. The specific XQuery expression(s) used by a particular operator instance to perform its data manipulation logic may be generated dynamically from a basic template and the operands passed to the operator.
Import operators 402 and 404 are responsible for performing import operations. As shown in box 418, Import operator 402 imports a data resource containing policy holder data into a generic feed. The HTTP protocol (as specified by the “protocol” operand) may be used to retrieve the policy holder data from an intranet data source. The URL http://w3.dept3.com/policies.csv (as specified by the “data resource locator” operand) identifies the data resource. The data resource type may be a comma separated values (text/csv MIME type) file (as specified by the “data resource type” operand). An exemplary CSV representation 500 of policy holder data is shown in
Returning to
The output of one straightforward implementation of an ingestion function that maps CSV formatted data to an XML representation is illustrated in
Returning to
The primary xpath statement is given by “//row” in the example and so the payload is extracted from under each of the “row” elements. The secondary xpath statement is given by “./node( )” in the example and so the payload of each entry in the resultant generic feed contains all child elements of the corresponding row element.
Returning to
As shown in box 420, Import operator 404 maps a data resource representing severe weather data from a web data source into a generic feed. The severe weather events for a given state are made available via an RSS feed (data resource type application/rss+xml) at http://www.nws.com/$state (the data resource locator operand). Note that URL references the variable $state. (Variables are denoted with a $ in the first character). The value of the variable is provided to the Import operator via its binding context. The binding context provided to each operator is initialized with any input parameters passed to the data mashup when it is invoked. In the example, the value “Texas” is provided for $state (the box labeled “Data mashup binding context). Import operator 404 replaces the $state variable in the URL with the value “Texas” to form the URL http://www.nws.com/Texas which it then uses to retrieve the RSS feed data resource.
Returning to
As shown in box 422, Merge operator 406 is one type of augment operator. Merge operator 406 produces a new generic feed by merging two input generic feeds according to a specified merge condition. In the example, Merge operator 406 merges the generic feeds produced by Import operators 402 and 404 (specified by the “left feed” and “right feed” operands, respectively) in order to produce a new generic feed whose entries represent policy holders affected by severe weather events. Merge operator 406 may be analogous to a relational join operator. Merge operator 406 forms the new feed by concatenating the payloads of the two input feeds that match according to the specified merge condition (provided by the “merge condition” operand). In the example, the payload of the output feed entries produced by Merge operator 406 is comprised of both policy holder payload and severe weather payload that match according to city and state.
Returning to
As shown in box 424, Publish operator 408 is responsible for performing publish operations. In the example, Publish operator 408 transforms the generic feed produced by Merge operator 406 (as specified by the “input feed” operand) into an Atom formatted data resource (as specified by the “output data type” operand). In general, Publish operator 408 transforms a generic feed into a data resource by applying a transformation function specific to the desired output data resource type. Any arguments required by the transformation function are passed as operands to Publish operator 408. In the example, the transformation function that translates a augmented generic feed, such as augmented generic feed 1000 of
As aforementioned, the basic data manipulation logic of an operator is performed through the generation and evaluation of XQuery expressions by an XQuery engine, such as XQuery engine 304 of
In the preferred embodiment of the invention, REST interfaces (i.e. XML over HTTP) are provided for defining a data mashup and for retrieving the result. The data mashup may be described to the data integration system by an XML document. Elements and attributes of the XML representation of the data mashup are understood by the data integration system as data mashup operators and operands. When the data integration system receives the data mashup, the data integration system performs basic processing of the data mashup and returns a URL that can be invoked by an application in order to retrieve the data mashup result. Parameters to the data mashup are provided to the application via typical mechanisms, such as GET or POST mechanisms of the HTTP protocol.
As previously discussed, there are many operators that may be used by the data integration mechanism of the illustrative embodiments. The following is a detailed description of some exemplary data mashups operators according to the illustrative embodiment, although many modifications to the depicted environments may be made without departing from the spirit and scope of the present invention.
An Import operator performs an import operation by retrieving a data resource from a data source and mapping it to a generic feed. The Import operator uses a protocol, data resource locator, repeating element specification, and binding context as operands. The previous detailed discussion of
A Publish operator performs a publish operation by transforming a generic feed into a data resource of a specified data type. The Publish operation uses an input feed, binding context, output data type, transformation function, and transformation function arguments, as operands. The Publish operator invokes the transformation function with transformation function arguments on the input generic feed to produce a data object of the specified output data type. The output data type may be one of the MIME types for which a transformation function exists. The Publish operator then serializes (forms a string representation of) the data object, producing a data resource. The previous discussion of
A Merge operator performs an augment operation by concatenating the payloads of two different input feeds that match according to a specified merge condition. The Merge operator may also return entries in either input feed that have no corresponding match in the other feed. The Merge operator uses as operands a “left feed”, “a right feed”, a “merge condition”, and an “outer merge specification”. The previous detailed discussion of
A Filter operator performs an augment operation by effectively removing entries from an input feed that fail to satisfy a specified filter condition. The Filter operator uses an input feed and a filter condition as operands.
An Annotate operator performs an augmentation operation by combining each entry of an input feed with all entries of an “annotation feed” that is produced in the context of a given input feed entry. The Annotate operator uses an input feed, a binding context, an annotation operator, and an outer context specification as operands. The Annotate operator passed to the augment operations may be any type of operator, such as an Import operator, Filter operator, Merge operator, Sort operator, Union operator, Group operator, Publish operator, another Annotate operator, or the like. The outer context specification is used to derive an outer context from a given feed entry. An outer context is essentially a set of variable name-variable value associations and is essentially a binding context that is formed anew for each entry of the input feed. An outer context specification is a set of variable name-expression associations that is used to derive the outer context for each feed entry. A given outer context member is formed by applying the expression to each entry. The result of applying the expression to each entry (i.e. a sequence) is then associated with the associated variable name. For example, if an outer context specification associates variable “$hotel” with expression “./hotel/name/text( )”, variable “$city” with expression “./hotel/city/text( )”, and variable “$state” with expression “/hotel/state/text( )” then the outer context derived from the entry
would contain the associations “$hotel” with “Palace Hotel”, “$city” with “San Francisco”, and “$state” with “CA”.
The annotation operator operand is evaluated anew for each entry using a binding context formed by combining the outer context derived from each entry with the input binding context. For example, the operation may get the next entry, form an outer context and new binding context, evaluate the operator, and repeat. Each evaluation of the annotation operator operand produces a new augmented feed. Variable names specified in the outer context specification are typically variables referenced by operands of the annotation operator or by operands of the operators contributing to the production of input feeds to the annotation operator; hence, the annotation operator essentially behaves like a function whose result depends upon values in the input feed entries.
For example, an input feed entry may contain information for an IBM approved hotel, and the annotation operator may be an Import operator that retrieves hotel reviews from a web service that requires a hotel name, city, and state as input. In general, the annotation operator creates one new entry in the result feed for each entry in the annotation feed returned by evaluating the annotation operator. The payload of a new entry in the result feed is formed by concatenating the payload of the input feed entry and the payload of the entry in the annotation field. Continuing the example, the payload of a given result feed entry would contain information about an IBM approved hotel and a single review for that hotel. Thus, there would be one entry in the result feed per IBM hotel and review combination. The default construction is similar to that shown for the Merge operator, which also merges entries of two feeds. Note that alternate result feed constructions are possible. For example, each result feed entry might contain the payload of all corresponding annotation feed entries.
A Group operator performs an augment operation by grouping the entries of an input feed according to the values of specified grouping expressions; thereby, producing one result feed entry per group. The payload of each result feed entry combines the payload of all entries of the input feed that are in the same group. The Group operator uses an input feed, group expressions, and nest expressions as operands. The group operator:
Although not illustrated in XQuery expression 1902, a Group operator may receive multiple group expressions. In such cases, input feed entries are grouped according to the combination of values extracted by applying each of the group expressions. Note that the result of each group or nest expression can be a sequence containing more than one item; therefore, there is not a 1-1 correspondence between the number of group expressions and the number of values in the group key. Nor is there a correspondence between the number of nest expressions and the number of nest expressions values. In such cases, the group key and nest key values are formed by combining all values extracted through application of the group or nest expressions. Note that alternate result feed constructions are possible. For example, one might add elements or attributes to the result feed in order to delineate the group key values and/or the nest expression values for a group.
A Transform operator performs an augment operation by reconstructing the payload of each input feed entry. The Transform operator uses an input feed, a transformation context specification, and a payload template, as operands. The transformation context specification is similar to an outer context specification used by an Annotate operator in that it specifies a set of variable-expression associations that are used to form a transformation context, which is a set of variable-value associations computed from each input feed entry. The values of variables in a given transformation context can be substituted for variables referenced in the received payload template. The transform operator produces a result feed as follows:
A transformation context is formed for each entry in input feed 2000 by applying the expressions in the transformation context specification to an entry of input feed 2000. An entry in the new feed is then formed by substituting those values into a copy of the payload template. For example, the transformation context computed for the first entry of input feed 2000 would contain the variable-value associations: $title and “High Wind Warning—Dallas, Highway 54 Corridor (Texas)”, $link and “http://www.weather.gov/alerts/TX.html#TXZ057.MAFRFWMAF. 115000”, $description and “FIRE WEATHER WATCH Issued At: 2007-12-26T11:50:00 Expired At: 2007-12-28T03:00:00”, $cityText and “Dallas”, $stateText and “Texas” (the regexp functions extract substrings from strings using regular expression patterns—similar to the regexp functions available in the xpath or java languages). The payload of the first entry in the result feed is formed from this transformation context by substituting its variable values for the corresponding variables referenced in the payload template.
A Sort operator performs an augment operation by ordering the entries of an input feed. The Sort operator uses an input feed and a sort key specification, as operands. A sort key specification is used to form a sort key for each input feed entry. Each entry is then added to the result feed in the appropriate relative location according to the sort key. The sort key specification contains a set of associated sort expression-ordering attribute pairs. Each sort expression is used to extract a component value of the sort key while the associated ordering attribute determines how result entries are ordered relative to that value.
The Sort operator:
A Union operator performs an augment operation by creating a new feed that contains a copy of each entry in an array of input feeds. The Union operator uses an array of input feeds F[ ], as operands. The Union operator iterates over each input feed F[i] in F and appends a copy of each entry E in F[i] to the result generic feed.
Thus, the mechanisms for data integration integrate data resources using a process of generic feed augmentation. The process of data integration by generic feed augmentation involves execution of an interleaved sequence of import operations, augment operations, and publish operations as defined by a received data mashup specification. An import operation retrieves a data resource from data source and maps the data resource into a generic feed. A generic feed may be comprised of an ordered set of payloads which represent an instance of some real world entities such as a stock quote, news article, or customer order. Augment operations may then filter, join, group, sort, or otherwise manipulate payloads of one or more generic feeds in order to produce augmented generic feeds. A publish operation essentially performs the inverse of an import operation, transforming a generic feed into a new data resource, and making the new data resource available to Web or other applications.
If at step 2508 the operation is not an import operation, then the data integration mechanism determines if the operation is an augment operation (step 2512). If at step 2512 the operation is an augment operation, then the data integration mechanism produces an augmented generic feed from one or more of the generic feeds generated by an import operation (step 2514), with the operation proceeding to step 2518 thereafter. A detailed description of step 2514 is described in
From steps 2510, 2514, and 2516, after either an import operation, an augment operation, or a publish operation has completed, the data integration mechanism determines if there are any more operations associated with the data mashup that need to be processed (step 2518). If at step 2518 there are more operations to be processed, the operation returns to step 2504. If at step 2518 there are no more operations to be processed, then the data integration mechanism outputs the new data resource(s) (step 2520), with the operation ending thereafter.
If at step 2906 there are more unprocessed entries in the received generic feed, then the data integration mechanism retrieves the first or next unprocessed entry from the received generic feed (step 2910). The data integration mechanism then evaluates the filter condition on the payload of the entry (step 2912). The data integration mechanism determines if the result of the filter condition is true (step 2914). If at step 2914 the result of applying the filter condition is not true, then the operation returns to step 2906. If at step 2914 the result of applying the filter condition is true, then the data integration mechanism adds a new entry to the result generic feed value whose payload is the payload of the entry (step 2916), with the operation returning to step 2906 thereafter.
If at step 3008 there are more payload pairs, then the data integration mechanism retrieves the first or next unprocessed payload pair associated with the generic feeds (step 3010). Then the data integration mechanism evaluates the merge condition on the unprocessed payload pair (step 3012). The data integration mechanism determines if the result of the merge condition is true to the unprocessed payload pair (step 3014). If at step 3014 the result of applying the merge condition to the unprocessed payload pair is not true, then the operation returns to step 3008. If at step 3014 the result of applying the merge condition to the unprocessed payload pair is true, then the data integration mechanism constructs a new augmented feed entry to the result generic feed value whose payload is formed by concatenating right feed components and left feed components of the current payload pair and adding the new augmented feed entry to the result generic feed (step 3016), with the operation returning to step 3008 thereafter.
If at step 3008 there are no more payload pairs associated with the generic feeds, then the data integration mechanism determines if the value of the outer merge specification is “left” or “full” (step 3018). If at step 3018 the outer merge specification value is a “left” or “full”, then the data integration mechanism adds a new entry to the result generic feed value for each left entry in the left generic feed that had no match in right generic feed (step 3020). The payload of the new entry is comprised of the payload of left entry concatenated with a special “no right match” payload element. From step 3020, or if at step 3018 the outer merge specification value is not “left” or “full”, then the data integration mechanism determines if the outer merge specification value is “right” or “full” (step 3022). If at step 3022 the outer merge specification value is a “right” or “full”, then the data integration mechanism adds a new entry to the result generic feed value for each right entry in right generic feed that had no match in left generic feed. The payload of the new entry is comprised of the payload of the right entry concatenated with a special “no left match” payload element (step 3024). From step 3024, or if at step 3022 the outer merge specification value is not “right” or “full”, then the data integration mechanism returns the result generic feed value (step 3026), with the operation ending thereafter.
If at step 3108 there are more unprocessed entries in the input feed, then the data integration mechanism retrieves the first or next unprocessed entry from the input feed (step 3110). The data integration mechanism forms an outer context from the payload of the entry using the outer context specification (step 3112). The data integration mechanism then forms a new binding context by combining bindings in the outer context and the original binding context (step 3114). The data integration mechanism retrieves a new augmentation feed by evaluating the annotation operator in the context of the new binding context (step 3116). The data integration mechanism then determines if there are any more unprocessed augmentation feed entries in the new augmentation feed (step 3118). If at step 3118 there are no more unprocessed augmentation feed entries in the new augmentation feed, then the operation returns to step 3108.
If at step 3118 there are more unprocessed augmentation feed entries in the new augmentation feed, then the data integration mechanism retrieves the first or next unprocessed augmentation feed entry from the new augmentation feed (step 3120). The data integration mechanism adds a new entry to the result generic feed value whose payload is formed by concatenating the current payload of the entry from the input generic feed and the payload of the augmentation feed entry from the new augmentation feed (step 3122), with the operation returning to step 3118 thereafter. If at step 3108 there are no more unprocessed entries in the input feed, then the data integration mechanism returns the result generic feed value (step 3124), with the operation ending thereafter.
If at step 3206 there are more unprocessed entries in the input feed, then the data integration mechanism retrieves the first or next unprocessed entry from the input generic feed (step 3208). The data integration mechanism forms group key values by evaluating the one or more group expressions on the entry (step 3210). The data integration mechanism then forms nest expression values by evaluating the one or more nest expressions on the entry (step 3212). The data integration mechanism then determines if there is an existing entry in the result generic feed value with one of the formed group key values (step 3214). If at step 3214 there is an existing entry in the result generic feed value with one of the formed group key values, then the data integration mechanism adds the nest expression values associated with the existing entry into the payload of the existing entry (step 3216), with the operation returning to step 3206 thereafter.
If at step 3214 there is not an existing entry in the result generic feed value with one of the formed group key values, then the data integration mechanism creates a new entry in the result generic feed value and adds the group key values associated with the entry into the payload of the new entry in the result generic feed value (step 3218). Then the data integration mechanism adds the nest expression values associated with the new entry into the payload of the new entry (step 3216), with the operation returning to step 3206 thereafter. If at step 3206 there are no more unprocessed entries in the input feed, then the data integration mechanism returns the result generic feed value (step 3220), with the operation ending thereafter.
If at step 3306 there are more unprocessed entries in the input generic feed, then the data integration mechanism retrieves the first or next unprocessed entry from the input generic feed (step 3308). The data integration mechanism forms a transformation context by applying the transformation context specification to the entry (step 3310). The data integration mechanism then forms an instantiated payload by making a copy of the payload template and substituting variable references in the copied payload with the corresponding variable values in the transformation context (step 3312). Then the data integration mechanism creates a new entry in the result generic feed value whose payload is the instantiated payload (step 3314), with the operation returning to step 3306 thereafter. If at step 3306 there are no more unprocessed entries in the input feed, then the data integration mechanism returns the result generic feed value (step 3316), with the operation ending thereafter.
If at step 3406 there are more unprocessed entries in the input feed, then the data integration mechanism retrieves the first or next unprocessed entry from the input generic feed (step 3408). The data integration mechanism forms a sort key for the entry by applying the sort key specification to the entry (step 3410). The data integration mechanism then makes a copy of the entry and inserts the copy into the result generic feed in the appropriate relative order according to the sort key (step 3412) with the operation returning to step 3406 thereafter. If at step 3406 there are no more unprocessed entries in the input generic feed, then the data integration mechanism returns the result generic feed value (step 3414), with the operation ending thereafter.
Thus, the illustrative embodiments provide a mechanism for data integration that allows enterprise mashups to be built quickly and easily. The data integration mechanism performs data integration logic of an application, thereby allowing the enterprise mashup developer to focus on the application's business logic. In particular, the illustrative embodiments disclose a mechanism for data integration that enables access to various types of data resources, provides the capability to integrate and augment the data resources retrieved from those sources, and allows for the further transformation and delivery of the augmented data to all types of applications.
As noted above, it should be appreciated that the illustrative embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In one exemplary embodiment, the mechanisms of the illustrative embodiments are implemented in software or program code, which includes but is not limited to firmware, resident software, microcode, etc.
A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements may include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) may be coupled to the system either directly or through intervening I/O controllers. Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modems and Ethernet cards are just a few of the currently available types of network adapters.
The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention, the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.