The present invention relates generally to network data management tools and, more particularly, but not exclusively to enabling the automated retrieval, transformation, and/or normalization of arbitrary content over a network.
As is generally known in the art, the volume of digital data over the Internet is expected to continue to increase over the coming years. This may not be so surprising considering that more businesses, educational institutions, and the like, are using the Internet. Thus, there are literally terabytes of data potentially accessible over the Internet.
Such a vast resource of data could provide businesses, researchers, consumers, or the like, with information never available to them in the past. However, despite all of this available data, collecting this data into a format that is easy to analyze, can be a time-intensive and expensive endeavor.
For example, while search engines may assist a user in finding some information over a network, today's search engines may be unable to access data that is accessible through steps other than those pertaining to a query. Examples of such data include that which may be provided through execution of an application, requires the user to submit additional information to access the data, or even where the data is in a more unconventional data formats. Moreover, many of today's search engines may return data in a format that is inconsistent with the user's needs. Thus, it is with respect to these considerations and others that the present invention has been made.
Non-limiting and non-exhaustive embodiments of the present invention are described with reference to the following drawings. In the drawings, like reference numerals refer to like parts throughout the various figures unless otherwise specified.
For a better understanding of the present invention, reference will be made to the following Detailed Description, which is to be read in association with the accompanying drawings, wherein:
The present invention now will be described more fully hereinafter with reference to the accompanying drawings, which form a part hereof, and which show, by way of illustrations, specific embodiments by which the invention may be practiced. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Among other things, the present invention may be embodied as methods or devices. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. The following detailed description is, therefore, not to be taken in a limiting sense.
Throughout the specification and claims, the following terms take the meanings explicitly associated herein, unless the context clearly dictates otherwise. The phrase “in one embodiment” as used herein does not necessarily refer to the same embodiment, though it may. Furthermore, the phrase “in another embodiment” as used herein does not necessarily refer to a different embodiment, although it may. Thus, as described below, various embodiments of the invention may be readily combined, without departing from the scope or spirit of the invention.
In addition, as used herein, the term “or” is an inclusive “or” operator, and is equivalent to the term “and/or,” unless the context clearly dictates otherwise. The term “based on” is not exclusive and allows for being based on additional factors not described, unless the context clearly dictates otherwise. In addition, throughout the specification, the meaning of “a,” “an,” and “the” include plural references. The meaning of “in” includes “in” and “on.”
Briefly stated the present invention is directed towards employing a set of expressions in a database-like structured language syntax to manage data retrieval, often but not necessarily over a network, and the transformation, and/or normalization of the arbitrary content. Arbitrary content includes virtually any digital data, whether it is structured, or un-structured. In one embodiment, the retrieval expressions are configured as database-like structured query clauses that may be performed upon at least a non-database arrangement of content over the network, an application, a form, or even a database. As used herein, the term “database-structured query,” refers to a form of a query that is configured to interrogate related files, documents, applications, or the like, for data.
In one embodiment, the tools are configured to retrieve content from a wide variety of sources. Such sources include but are not limited to those accessible using various standard protocols over a computer network, files in local storage, or those accessible through execution of an arbitrary application, script, applet, or the like. Processes for transforming data may be composed in a reactive and variable manner based on a physical layout of the data, the presence, or absence of a particular user input or preference, the intended use of the data, and/or a logical structure. After the data is transformed, various tools may be applied to arbitrarily normalize the data. In one embodiment, at least some of the normalization tools may be used to ensure that the data conforms to an application-specific requirement.
A programmer may write scripts, or the like, using a database-like structured programming language, which may then be interpreted by a Runtime System. These scripts may include instructions for various components within the Runtime System on how to retrieve, transform, and/or normalize the desired content.
In particular, a programmer, or other user of the Runtime System, may retrieve data sources as specified by a URI, URL, or the like, using a variety of schemes, including, but not limited to HTTP, FTP, ODBC, TCP, UDP, or the like, as well as several propriety schemes, such as “exec” to retrieve data from the output of executing an arbitrary external program; “invoke” to retrieve data from the output of executing code in an arbitrary external component; or even retrieving data recursively invoking the Runtime System on an arbitrary script. For example, in one embodiment, a user may cause an arbitrary external program to execute, and while it is executing, provide automatically through a script, or the like, various inputs, responses to questions, or the like, from the program, and retrieve output data from the program, without having the user to continually interact with the executing program.
The user, or programmer, may further, through query clauses in the script, perform conversions and/or transformations on the content by exporting the data for subsequent processing in either a record-based, a byte-based, or in a file-based format. In one embodiment, the data may be automatically converted from physical to a logical format using a lazy execution of a conditionally and variably composed sequence of operations. In one embodiment, at least some of the procedures may perform one or more of the following:
Moreover, a mechanism for automatically generating and performing the procedures may, in one embodiment, be based on a shortest sequence of operations to transform the data from the available physical to a logical format used by the script being executed. However, the invention is not so constrained, and other transformation paths may be selected, for example, but not limited to being based on a cost factor indicative of the computational cost of a transform path and/or a computational speed of the transform path. The sequence of transformation may be determined using a logical translation graph or mapping of conversions.
Normalization of retrieved data may be performed using an arbitrary application-specific logic, in one embodiment. For example, in one embodiment, validation rules may be employed that may be indicated with a URL that resolves to an Extensible Markup Language (XML) specification of the validation procedure. Several validation rules are further provided for such as regular expression matching, table lookups based on regular expressions and/or approximate string matching, or the like. In one embodiment, a facility also may be provided for calling out to arbitrary external code.
The retrieval and integration of digital content as described herein may provide several benefits over more traditional approaches. For example, because the approach automatically carries out many routine data retrievals, transformation, and/or normalization processes, a user or programmer, may instead devote more of their effort towards other activities, for example, such as the data management requirements of the application being developed. Processes that might take hundreds or even thousands of lines of code to implement using traditional techniques can be accomplished as described herein with, perhaps, just dozens of lines of script code.
Client devices 111-112 may include virtually any computing device capable of receiving and sending a message over a network, such as network 104, to and from another computing device, such as content servers 101-103, each other, or the like. The set of such devices generally includes mobile devices that are usually considered more specialized devices with limited capabilities and typically connect using a wireless communications medium such as cell phones, smart phones, pagers, walkie talkies, radio frequency (RF) devices, infrared (IR) devices, CBs, integrated devices combining one or more of the preceding devices, or virtually any mobile device, and the like. However, the set of such devices may also include devices that are usually considered more general purpose devices and typically connect using a wired communications medium at one or more fixed location such as laptop computers, desktops, and the like. Similarly, client devices 111-112 may be any device that is capable of connecting using a wired or wireless communication medium such as a personal digital assistant (PDA), POCKET PC, wearable computer, and any other device that is equipped to communicate over a wired and/or wireless communication medium.
Client devices 111-112 may be configured with a browser application that is configured to receive and to send content in a variety of forms, including, but not limited to markup pages, web-based messages, audio files, graphical files, file downloads, applets, scripts, cookies, and the like. The browser application may be configured to receive and display graphics, text, multimedia, and the like, employing virtually any mobile markup based language or Wireless Application Protocol (WAP), including, but not limited to a Handheld Device Markup Language (HDML), such as Wireless Markup Language (WML), WMLScript, JavaScript, Standard Generalized Markup Language (SGML), HyperText Markup Language (HTML), Extensible Markup Language (XML), EXtensible HTML (XHTML), or the like.
Client devices 111-112 may further be configured and arranged to enable a user to provide scripts, commands, or the like, to DCM server 108, to request retrieval, transformation, and/or normalization of data obtained over network 104, from content servers 101-103, and even from client devices 111-112. In one embodiment, a user, programmer, or the like, may prepare database-like structured queries to be scheduled, and/or executed by DCM server 108. Examples of such database-like structured queries are described in more detail below. Client devices 111-112 may employ any of a variety of available applications to develop the scripts, including text editors, word processors, command line interpreters, or the like. Client devices 111-112 may then receive the resulting data from DCM server 108 based on the queries.
Network 104 is configured to couple one computing device to another computing device to enable them to communicate. Network 104 is enabled to employ any form of medium for communicating information from one electronic device to another. Also, network 104 may include a wireless interface, such as a cellular network interface, and/or a wired interface, such as the Internet, in addition to local area networks (LANs), wide area networks (WANs), direct connections, such as through a universal serial bus (USB) port, other forms of computer-readable media, or any combination thereof. On an interconnected set of LANs, including those based on differing architectures and protocols, a router acts as a link between LANs, enabling messages to be sent from one to another. Also, communication links within LANs typically include twisted wire pair or coaxial cable, while communication links between networks may utilize cellular telephone signals over air, analog telephone lines, full or fractional dedicated digital lines including T1, T2, T3, and T4, Integrated Services Digital Networks (ISDNs), Digital Subscriber Lines DSLs), wireless links including satellite links, or other communications links known to those skilled in the art. Furthermore, remote computers and other related electronic devices could be remotely connected to either LANs or WANs via a modem and temporary telephone link. In essence, network 104 includes any communication method by which information may travel between client devices 111-112, and/or content servers 101-103. Network 104 is constructed for use with various communication protocols including wireless application protocol (WAP), transmission control protocol/internet protocol (TCP/IP), code division multiple access (CDMA), global system for mobile communications (GSM), and the like.
The media used to transmit information in communication links as described above generally includes any media that can be accessed by a computing device. Computer-readable media may include computer storage media that typically embodies computer-readable instructions, data structures, program modules, or other data in a transport mechanism and includes any portable or non-portable storage delivery media.
Content servers 101-103 include virtually any network device that may be configured to provide content over a network. In one embodiment, content servers 101-103 are configured to operate as a web site server. Content servers 101-103 are not limited to web servers, however, and may also operate as a messaging server, a File Transfer Protocol (FTP) server, a database server, application server, or the like. Moreover, while content servers 101-103 may operate as other than a website, they may still be enabled to receive and/or send an HTTP communication.
Devices that may operate as content servers 101-103 include personal computers desktop computers, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, network appliances, servers, and the like.
One embodiment of DCM server 108 is described in more detail below in conjunction with
Devices that may operate as DCM server 108 include personal computers desktop computers, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, network appliances, servers, and the like.
Although
Network device 200 includes central processing unit 212, video display adapter 214, and a mass memory, all in communication with each other via bus 222. The mass memory generally includes RAM 216, ROM 232, and one or more permanent mass storage devices, such as hard disk drive 228, or the like. Mass memory storage may also include portable storage 226 devices, such as tape drive, optical drive, removable flash memory storage devices, and/or floppy disk drive. The mass memory stores operating system 220 for controlling the operation of network device 200. Any general-purpose operating system may be employed. Basic input/output system (“BIOS”) 218 is also provided for controlling the low-level operation of network device 200. As illustrated in
The mass memory as described above illustrates another type of computer-readable media, namely computer storage media. Computer storage media may include volatile, nonvolatile, removable, and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. Examples of computer storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computing device.
The mass memory also stores program code and data. One or more applications 250 are loaded into mass memory and run on operating system 220. Examples of application programs may include transcoders, schedulers, calendars, database programs, word processing programs, messaging programs, HTTP/HTTPS programs, customizable user interface programs, IPSec applications, web crawlers, spreadsheet programs, database programs, encryption programs, security programs, FTP servers, and so forth. Runtime System 252 may also be included as application programs within applications 250. In one embodiment, Runtime System 252 may include retrieval manager 254, transformer 256, and normalizer 258. However, the invention is not so limited, and one or more of retrieval manager 254, transformer 256, or normalizer 258 may reside external to Runtime system 252, and/or even on another computing device substantially similar to network device 200.
Retrieval manager 254 is configured to receive a query for data, perform operations over the network to retrieve data requested by the query, and to retrieve the matching data. Examples of database-like structured queries are described in more detail in a co-pending U.S. patent application Ser. No. 09/833,846, entitled “Method And System For Extraction And Organizing Selected Data From Sources On A Network,” which is incorporated herein by reference.
Briefly, sets of query conditions (or clauses) may be created that are used with various network devices to retrieve content from content servers on the network. Typically, the requested content is specified using URLs, but URIs, IP addresses, addresses or locators from other layers of Open Systems Interconnection (OSI) Basic Reference Model, or the like, may also be employed, without departing from the scope of the invention. Data may also be accessed using propriety or non-proprietary protocols or schemes such as FTP, IMAP, ODBC, or the like. In addition, retrieval manager 254 supports additional query structures, including: invoke: where data may be retrieved from an output of executing code in an arbitrary external component; exec: where data may be retrieved from an output obtained by executing an arbitrary external program, and webql: where data may be retrieved by recursively invoking the Runtime System 252 on an arbitrary query script.
Each of these query structures may include one or more retrieval options. For example, when fetching an HTTP URI, the user may provide a specific value for a User-Agent, or the like. The query structure enables a range of other mechanisms that allow scripts to specify such options.
Each supported data retrieval query structure may provide physical access to data in some particular scheme-specific manner. For example, some schemes (e.g., ODBC or the like) may provide programmatic access to data through an Application Programming Interface (API), or the like, in an inherently Record-based manner, in which components of the data are delivered one at a time or in small batches. In another example, other schemes (e.g., http, ftp, etc) may provide access to the data in the form of a stream of bytes. In a third example, other schemes (e.g., file) may provide access to byte data backed by local files.
Furthermore, retrieval manager 254 may access at least two distinct kinds of data, including data (e.g., results from an ODBC or the like) that are inherently Record-based—where the data comprises a number of smaller components, or data (e.g., text/html, application/pdf) that are inherently byte-based—where the data consists of a sequence of bytes that may be interpreted according to their Internet Media Type (IMT).
The approach for content retrieval used by retrieval manager 254 has at least two benefits over more traditional approaches. First, by abstracting details away from the many ways of accessing data, programmers or users can quickly write complex scripts that may perform complex data retrieval processes from heterogeneous sources, instead of having to write long and/or cumbersome programs using traditional methods. Second, substantial performance benefits may be realized by providing a uniform interface to heterogeneous data sources while preserving all data in its native format. “Native” format, as used herein, refers to a format of the data as originally retrieved by the retrieval manager. Active or formal recognition of the “native format” by the retrieval manager is not required so long as the underlying bits that comprise the data are able to be retrieved.
Transformer 256 is configured to automatically perform dynamic data transformation on retrieved data for virtually any form of data regardless of its original native format. For example, in one embodiment, where the retrieved data is a MS WORD® document, the following script may be employed to fetch the document and convert it to plain text.
In the above example, without explicit indication in the query otherwise, plain text may be chosen as the format to which the document is converted based on a default output format associated with the MS WORD® format, as is further explained below.
As another example, in one embodiment, the following script may be used to convert the document to an HTML format:
As shown above, transformer 256 may convert a wide variety of document formats, using a built-in capability of transforming documents or data sources. Moreover, transformer 256 is configured to employ various intermediate formats to convert to a requested format. For example, a user may request to convert a MS WORD® document into XML. Transformer 256 may perform such transformation, in one embodiment, by determining a sequence of intermediate formats (or IMTs) to employ to ultimately convert the document. Thus, for example, transformer 256 may automatically, and in a manner that the user may be unaware, convert the document into an HTML document, and then convert the HTML document into XML. Similarly, transformer 256 may automatically determine a sequence of intermediate formats to convert an MS EXCEL® document into XML, or the like. One example of such a user script might be:
This conversion, as noted above, may be performed automatically, and in a manner, such that the script writer does not need to instruct transformer 256 on the intermediate transformation sequences. Such a conversion process will be further discussed herein with reference to
The examples so far have been related to byte-based data. However, transformer 256 is not so limited, and supports record-based data, as well, in which the content may include a sequence of component objects, or the like. Thus, transformer 256 may also convert between byte-based and record-based formats of content, and back again
The invention provides for at least two ways to convert byte-based data to records. The first approach includes converting the bytes to records using a “natural” interpretation associated with the data's IMT. Such a “natural” interpretation, as used herein, pertains to interpreting a document based on a data structure or type of component object associated with the data's IMT. This data structure or type of component object is applicable or recognized among many different IMTs because it pertains to the logical interpretation of the underlying data and not just the IMT in which the data or information is formatted. For example, for text/csv data, a natural interpretation of the bytes as records is one record per physical line in the document, with the records split into columns by “,” (comma) characters according to the definition of the text/csv standard. As a second example, in one embodiment, the natural interpretation of text/html data as records may directly mirror a <TABLE> tag, or related tag types in the data. In one embodiment, the Transformer 256 has a library of procedures like these examples that convert byte-based data to records for a wide variety of IMTs.
A second method that may be employed by Transformer 256 to convert byte-based data to records enables the script writer to specify the sorts of component objects desired. Transformer 256 may extract a wide range of objects from a wide variety of document types. Objects, as referred to herein, and similar to above, pertain to a data structures or manners of data organization that are independent from a particular data format or IMT, yet are recognized and may be logically retained within data of a particular IMT. Thus, for example, in one embodiment, the script writer may extract hyperlinks from within an HTML document using the following:
As shown above, the from clause invokes transformer 256 “links” converter to extract the hyperlinks from within the specified HTML document, passing them onward for possible subsequent processing as a table, or the like, that may include one record for each hyperlink.
In addition, other objects may also be employed as options, including, for example:
In one embodiment, for example, a script writer may generate the following, wherein * is defined as a symbol for “all”:
The script writer may also generate, in another embodiment:
As suggested, the output may be converted from MS WORD® or PDF, respectively, to HTML prior to link translation. MS WORD® or PDF are two types of document formats correlated to two different IMTs.
The details of how to implement each of these translations—from text/html to a table of hyperlinks or images, from application/pdf to a table of application/pdf records each representing one page, and the like, may employ a variety of readily available approaches, without departing from the scope of the invention.
Moreover, transformer 256 exposes a series of records substantially similar to how the records may be exposed that are retrieved from a database. Thus, for example, the following queries both convert data to records, where ‘c’ means column and the number indicates a column number:
and
In the first example, data may already be in a desired format. In the second example, the data may automatically be converted from its native format (e.g., searching the HTML data for <TABLE> tags, or the like).
An Internet Media Type (IMT) is a standard machine-understandable label, maintained in a formal registry with the Internet Assigned Numbers Authority, indicating how a given sequence of raw bytes may be interpreted by a computer program. The format of the label refers to a type/subtype for the given data. For example, the IMT text/html indicates that a given piece of content may be interpreted as an HTML document, whereas application/pdf indicates that the content is to be interpreted as a PDF document. IMTs can also indicate that a given sequence of raw bytes is to be interpreted as a composite object comprising several sub-parts. For example, the multipart/mixed IMT indicates that the data is to be broken into several parts, where each part has a distinct IMT. An email message with an attached file is usually encoded as multipart/mixed data with two parts: one part is the email message proper, and the other part is the attachment. As an example, a ZIP archive that includes an HTML file and an Excel spreadsheet may be encoded with IMT application/zip; and then when uncompressed the result may be two objects, one of type text/html and the other of type application/vnd.ms-excel.
Each retrieved data item may include a native IMT. The native IMT is usually specified by the source (although occasionally it is desirable to force a specific native IMT and Runtime System allows scripts to do so).
A Runtime system 252 converter may map a given piece of content together with its IMT, to a new piece of content of a different IMT. Such conversion may be written, in one embodiment as:
C
IMT1,IMT2(data)→>data’.
For example, transformer 256 may be configured to provide a converter from text/html to text/plain, which corresponds to a function such as:
Ctext/html,text/plain(data)→data’.
For example, one example of such function is:
Ctext/html,text/plain(“<html><body>howdy</body></html>”)+“howdy”.
Transformer 256 may use a variety of procedures to convert data from one IMT to another IMT. As another example, in another embodiment, transformer 256 may use an algorithm to convert application/pdf data into either text/plain or text/x-layout. Transformer 256 may further employ an optical character recognition algorithm to convert any sort of image (e.g., image/* data) to application/rtf, application/vnd.ms-excel, application/vnd.ms-powerpoint, text/html, text/plain, text/x-layout, text/xml, or the like.
In addition, transformer 256 may also provide converters that are configured to extract records from byte-based data. Examples include: a text/html document that can be converted into a series of records each of which describes a single hyperlink in the original document. Similarly, a text/html document can be converted into a table of its images. In addition, transformer 256 may be configured to convert an application/pdf document into a sequence of application/pdf objects that represent each individual page in the original. Transformer 256 may also extract data from text/xml data using XPATH expressions. Transformer 256 in another embodiment, may employ a regular expression to convert any kind of text/* document into a sequence of records indicating the matches. In addition, transformer 256 may also provide converters that extract tabular (row and column) structure from text/* data. Any of a variety of available mechanisms to implement each of these translators may be employed without departing from the scope of the invention.
Taken as a whole, these converters can be represented, in one embodiment, as a directed graph, where nodes indicate IMTs, and there is a transition from node IMT1 to node IMT2, where transformer 256 may provide a conversion between the two.
In the course of executing a script, the Runtime System 252 may fetch data from one type IMT1, and convert it to another type IMT2. This may be accomplished, in one embodiment, by searching its graph of converters for the shortest path between IMT1 and IMT2. This path through the graph corresponds to a sequence of converters that can be applied to the original data to convert it to the desired type.
In one embodiment, transformer 256 may automatically determine the most effective way to convert the available data into the format required by a script. The script does not need to specify a “route” (sequence of converters) to take, and script-writers are generally unaware of the various intermediate formats to which their data is converted. Furthermore, in one embodiment, performance may be improved by use of a lazy content conversion, and the data may be cached in case they can be reused for subsequent conversions. That is, transformer 256 may employ lazy evaluation, also called delayed evaluation that includes delaying a computation until such time as the result of the computation is known to be needed.
The approach for transforming data described here may provide several benefits over prior art data retrieval systems. For example, as with retrieval, script-writers or users can generally implement a script to perform a given data retrieval and transformation task using fewer lines of code compared to more traditional programming languages, which may therefore provide benefits in terms of an initial cost of development, as well as a cost of maintenance, and re-use. Considerable performance and scaling benefits may also be realized by retaining a native data format unless and until a different format is required. In addition, a flexible architecture is provided that may make it more straightforward to add or remove capabilities, such as new conversions from one IMT to another, or decoding procedures, new user-directed methods for decomposing bytes into records, and the like.
After the Runtime system has retrieved and transformed some data, a script may specify that it is to be normalized. In one embodiment, the Runtime System may provide a flexible mechanism for normalizing data according to arbitrary application-specific criteria, and taking various actions in case the criteria are not satisfied.
As illustrated in
In one embodiment, normalizer 258 enables script-writers to define normalization procedures using a simple XML-based language. For example, to use a regular expression lookup table to normalize a piece of data as a U.S. state, one might use the following notation:
As a second example, the following normalization procedure checks a U.S. addresses by making an external communication such as Web Service call (via Perl) to a service such as Geocoder. US's address normalization service:
Normalizer 258 may enable script-writers to identify such XML descriptions using a URI, in one embodiment. For example, a script-writer could put the above “US State” XML document at http://mycorp.com/norm/usstate.xml, and then this URI could be used in a script construct to reference the normalization procedure.
Furthermore, normalizer 258 may allow script-writers or users to aggregate any number of such normalization procedures. For example, one embodiment could allow the procedures to simply be concatenated into a large file. In another embodiment, the script-writer could use a mechanism such as the ZIP archive format or the like to encapsulate a number of procedures in an archive). Normalizer 258 may then provide a procedure for normalizing data according to one specific procedure in such an aggregate. Still, in another embodiment, the script-writer could allow the syntax URL#NAME to reference the normalization procedure NAME in the aggregate located at URL (similar to http URLs such as http://blahcorp.com/index.html#loc).
In one embodiment, the operation of Normalizer 258 may occur in more details as follows.
These and similar normalization procedures that may be present in an embodiment of this invention may be implemented using any of a variety of mechanisms, without departing from the scope of the invention.
As well as modifying a given input into some canonical form, normalization procedures can also recognize that no such transformation is possible. If such a fault condition is encountered, in one embodiment, normalizer 258 may allow the script-writer to indicate what action should be taken. Options include (but are not limited to): leaving the original data intact, replacing the data with a special “null” value, halting script execution, or logging the problem in the script execution log.
Normalizing and validating data as described herein may provide several benefits over traditional methods. For example, it may seamlessly integrate standard built-in normalization rules, user-configurable normalization procedures, and invocation of arbitrary external code. In addition, normalization procedures stored, maintained, and re-used across a plurality of data resources including scripts and applications that may be distributed over multiple machines on a network, rather than being bound to a specific column of a particular database table on a particular network location, as in many of the traditional approaches.
The operation of certain aspects of the invention will now be described with respect to
Process 300 begins, after a start block, at block 302, where a script writer, or the like, creates a script that may direct a search and retrieval of data. Such a script, as described above, may be composed using a database-like structured query syntax. However, the query may be performed on non-database structured data, and/or databases, applications, or the like. Moreover, the user may employ the above described select, from clauses, or the like, to create the database-like structured query. The query clauses may then be passed to block 304.
At block 304, the query may be paused to determine which locations such as network sites, applications, and the like, to commence a search, how deep to search a site, and what data to retrieve. In one embodiment, various network crawlers may be employed to search for and retrieve data. In some embodiments, an application may be executed at the network site to obtain the data, a form may be completed to further obtain data, or the like, based on the clauses used within the query.
Processing continues to blocks 305 and 307 where the retrieved data may be transformed into another format, again, based, in part, on the clauses within the query. Block 305 is further discussed in details with reference to
Processing continues next to block 308, where the data may be manipulated (filtered, sorted, etc), for example, according to application-specific requirements, or the like.
Processing continues to block 310, where the query may also include a request to normalize the data. Thus, at block 310, the retrieved data may be normalized, such as described above. The data may then be provided to the client device of the requester for further actions. Processing continues to block 311, where the data may be output to external files, network devices, or external executing processes, and/or directed back to earlier stages of Process 300. When completed, Process 300 then returns to a calling process to perform other actions.
Also shown in
Process 400 begins, after a start block at block 410, where a first Internet Media Type associated with the retrieved data is determined. As discussed above, the first IMT may be explicitly indicated in the received data or by the source of the retrieved data. Such a first IMT may also be forced upon the retrieved data when desired. The first determined IMT serves as a starting point for generating a sequence of conversions or transforms, as is discussed below with reference to step 440. The first IMT is analogous, though by no manner limited, to a starting node, such as node 610, in the translation graph 600 shown in
Next, the process 400 continues to block 420 where a second IMT to be associated with the retrieved data is determined. In one embodiment, the second IMT may be explicitly entered in a query clause, such as through a “convert to” clause in above noted examples. The second IMT may also be implicitly determined based on the intended use of the data, as suggested by component objects referenced in a query clause such as “select” clause in the above noted examples. The second IMT may also be implicitly determined from an indication, locally stored with Run Time System 252 or otherwise, of a default IMT associated with the native IMT of the data source. A default IMT may also be stored and used as the second IMT for all sequences of conversions made by Transformer 256 of
Next, processing flows to block 430, where a sequence selection scheme is determined from a plurality of predetermined selection schemes that are available for application. Each available selection scheme may at least define the criteria or principles that may be applied to determine an ideal or preferred sequence. Such a determined selection scheme may include, though is not limited to, at least one of a logically shortest sequence, a lowest computational cost, or a computational fastest sequence. The logically shortest sequence refers to the fewest number of total transformations between a given first IMT and second IMT, regardless of other factors such as computational cost or speed, or the like. The lowest computational cost scheme refers to selecting the sequence of conversions that consumes the fewest resources, regardless of speed or number of transforms, or the like. The computationally fastest sequence refers to selecting the sequence that completes the conversion in the shortest amount of time, regardless of the number of resources consumed or number of transforms, or the like. Determining which of these selection schemes, or others not listed herein but also applicable, may be based on explicit indication in a query clause. The employed sequence selection scheme may also be determined from an indication, among available schemes, of a default scheme when, for example, no particular scheme is indicated in a query.
After at least this minimal amount of information is determined, processing flows to block 440 where a sequence of transforms may be generated using the first and second IMTs and the sequence selection scheme. Using the first and second IMTs as initial and final conversion formats, respectively, the generation comprises application of the sequence selection scheme to generate a sequence of transforms that best meets or conforms to the principles for the sequence selection scheme. For example, the generated sequence may be based on a shortest path between the first and second IMTs in the translation graph. Alternately, such a generated sequence may be determined using a computational cost factor associated with each available transformation between one IMT to another IMT. Application of either of these schemes is further discussed below with regard to
After the generation of the sequence at block 440, processing continues to blocks 450 and 460, where the conversions or transformations represented in the sequence are formally applied to the received data. That is, at block 450, a transform or sequence of transforms is applied to the received data, which has been associated with a determined first IMT, to convert the received data into a format consistent with at least one intermediate format. After this application of transforms, the retrieved data is converted at block 460 from the at least one other or intermediate IMT to the data format consistent with the second IMT. Regardless of path or length of sequence, such transformations may be performed without further input or even breaks between the involved steps of transformation. After application of process 400 to received data, the process returns to perform other types of data handling, including, but not limited to normalization, such as described above in conjunction with
The retrieval stage 304 from
The conversion stage 305 from
The transitions in
Decoding 528 includes procedures for decoding, decompressing, decrypting, de-archiving, character set transcoding, and other similar operations. The transitions from byte stream 524 to decoding 528, and from file 526 to decoding 527, indicate that decoding 528 may be configured to operate on byte data originating from a native byte stream or data from local storage.
Conversion 530 includes procedures for automatically converting data from one Internet Media Type (IMT) to another IMT, as explained in
The natural decomposition 534 includes converting data from some IMT using the particular conventional view of the IMT in terms of records. For example, the conventional view of a text/csv document in terms of records, may involve generating one record per physical line, with records delimited by commas as specified by the text/csv standard. Many IMTs have similar conventional decompositions into records. The transition from conversion 530 to natural decomposition 534 indicates that data of any IMT can be converted to records using its conventional decomposition into a default data structure, including potentially mixed or just a single data structure.
Composition 532 includes a process of aggregating records into a sequence of bytes formatted according to a specific IMT. For example, a table (sequence of records) can be composed into text/csv according to the text/csv standard. The transition from tabular API 522 to composition 532, and from natural decomposition 534 to composition 532, indicate that records from be composed into bytes, regardless of their origin. The transition from composition 532 to conversion 530 indicates that the bytes generated from a set of records may be converted into another IMT if required. The transition from composition 532 to byte stream 524 indicates that an embodiment may permit a set of composed records to be pushed over network 512 to a network device that can receive it, or passed to an executing external program 514 for processing, backed by file 526, or the like.
Translation 307 includes a process of applying additional transformations to the bytes or records retrieve, decoded, converted, composed, and/or decomposed from their sources. User-specified decomposition 552 refers to applying one of many non-conventional procedures to extracted records from bytes, such as (but not limited to) extracting links from HTML, images from HTML, individual pages from PDF, etc. The user-specified decomposition 522 was described in greater detail previously in this document. The transitions from composition 532 to user-specified decomposition 553 and from conversion 530 to user-specified decomposition 552 indicate that user-directed decompositions can be invoked on any byte data with an associated IMT, regardless of origin. Direct-access decomposition 554 refers to any form of selection, reconfiguration, or filtering of a set of records. For example, an embodiment may enable the elimination or renaming of columns in tabular data produced by a tabular API 522, or a natural decomposition 534, or the like.
Manipulation 308 includes generating and combining expressions over the columns in a set of records. One embodiment may allow one or more of (but are not limited to) the following: arithmetic operations, string operations, logical operations, date/time operations, array operations, and the like, or arbitrary compositions of such operations. For example, a user-specified decomposition 552 may generate records containing the hyperlinks in a text/html document where each record comprises the anchor text and the destination URL, and manipulation 308 may allow an expression 562 that is the concatenation of the link anchor text, followed by “(” (parenthesis) followed by the destination URL, followed by “)” (parentheses). These expressions generally correspond directly to the logic of the application being implemented. Manipulation 564 refers to the use of arbitrary expressions 562 in order to perform standard database operations on the data, such as (but not limited to) filtering, sorting, grouping, aggregating and/or joining the data.
Normalization 310 includes validating the data to check that it satisfies specific constraints as specified in normalization 572, and/or modifying the data to ensure that the constraints are satisfied. Normalization was described in great detail previously in this document.
Output 311 includes passing of records on for subsequent processing. The transitions from output 311 to tabular API 522, and from output 311 to composition 532, indicate that an embodiment may direct that the entire process depicted in
Application of the other two schemes, the lowest computational cost scheme and the computationally fastest sequence scheme, would involve assessment of the cost factors, such as cost factors 611, 612, 621, 631, and 641, which are shown in
Cost factors, such as factors 611 and 641, are shown in
Regardless of the manner in which a sequence is determined, the resulting generated sequence from block 440 may include at least one intermediate IMT, as discussed above. Such a sequence would also include of the necessary conversions to and from this at least one intermediate IMT. This at least one intermediate IMT is included in the resulting sequence in a manner that is independent of any explicit indication the IMT within any query clause in the query. Rather, this at least one other IMT may be determined based on a query clause in the query that indicates at least one component object from which at least one record is extracted from the retrieved data. As noted above, if a component object indicated in a query clause references tables, then the initial or first IMT, prior to converting to a second and final IMT, needs to be converted to an IMT that is also compatible with the indicated component object. The fulfillment of this requirement is apparent in the generated sequence. An end user may be unaware of this necessary conversion, yet is still able to obtain data from an otherwise incompatible IMT through the application of the conversion process 400 disclosed herein.
It will be further understood that each block of the flowchart illustrations, and combinations of blocks in the flowchart illustrations, can be implemented by computer program instructions. These program instructions may be provided to a processor to produce a machine, such that the instructions, which execute on the processor, create means for implementing the actions specified in the flowchart block or blocks. The computer program instructions may be executed by a processor to cause a series of operational steps to be performed by the processor to produce a computer implemented process such that the instructions, which execute on the processor to provide steps for implementing the actions specified in the flowchart block or blocks. The computer program instructions may also cause at least some of the operational steps shown in the blocks of the flowcharts to be performed in parallel. Moreover, some of the steps may also be performed across more than one processor, such as might arise in a multi-processor computer system. In addition, one or more blocks or combinations of blocks in the flowchart illustrations may also be performed concurrently with other blocks or combinations of blocks, or even in a different sequence than illustrated without departing from the scope or spirit of the invention.
Accordingly, blocks of the flowchart illustrations support combinations of means for performing the specified actions, combinations of steps for performing the specified actions and program instruction means for performing the specified actions. It will also be understood that each block of the flowchart illustrations, and combinations of blocks in the flowchart illustrations, can be implemented by special purpose hardware-based systems which perform the specified actions or steps, or combinations of special purpose hardware and computer instructions.
The above specification, examples, and data provide a complete description of the manufacture and use of the composition of the invention. Since many embodiments of the invention can be made without departing from the spirit and scope of the invention, the invention resides in the claims hereinafter appended.
This application claims the benefit of U.S. Provisional Application Ser. No. 60/891,935 filed Feb. 27, 2007 the benefit of the earlier filing date is hereby claimed under 35 U.S.C. § 119 (e) and which is further incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
60891935 | Feb 2007 | US |