Embodiments are directed generally to a system and method for inserting document text into a database and for retrieving portions of the document text from that database. In particular, various embodiments can comprise a system and methods for generating one or more keys from selected attributes occurring in input information, and to insert output information comprising the keys into a database.
With respect to
In various embodiments, the information storage and retrieval application 105 can comprise one or more servlets that includes a sequence of programmed instructions that, when executed by a processor of the server 101, cause the server 101 to be configured to perform database insertion and retrieval functions as described herein.
The database 107 can comprise a memory manager 109 and a storage device 111 provided in communication with the memory manager 109. In various embodiments, the database 107 can store and retrieve information or data in response to one or more (Structured Query Language) SQL instructions. The storage device 111 can comprise a hard disk drive configured to store information in accordance with SQL. Further, the memory manager 109 can comprise a database manager that includes a local memory 112. In various embodiments, the memory manager 109 local memory 112 can comprise a hash table index 113 and recently accessed database information from the storage device 111. In various embodiments, the local memory 112 of the memory manager 109 can have a faster access time latency than the storage device 111. For example, the local memory 112 can comprise a Random Access Memory (RAM) and the storage device 111 can comprise a hard disk drive, in which case the local memory 112 can have an access time latency on the order of ten times faster than the storage device 111. In various embodiments, the local memory 112 can comprise a fixed memory size specified by a target threshold size parameter. The memory manager 109 can be configured to remove the oldest information in local memory 112 to provide capacity to store the transformed information and maintain the size of the local memory 112 below the target threshold size. The target threshold size and the frequency of checking whether or not the target threshold size has been exceeded can each be configurable parameters controlled by the user.
The client device 102 can comprise a Personal Computer (PC) or workstation including, but not limited to, a desktop PC, laptop PC, tablet PC, Personal Digital Assistant (PDA), cellular terminal or handset, wireless terminal or handset, Internet appliance, or any other such device. In various embodiments, the client device 102 can comprise a communication interface configured to accomplish packet-based communication using the network 103. For example, the client device 102 can include a browser application such as Microsoft® Internet Explorer™ available from Microsoft Corporation of Redmond, Wash., or Mozilla Firefox™ available from the Mozilla Foundation of Mountain View, Calif. In various embodiments, the client device 102 can communicate with the server 101 using the network 103 in accordance with the HyperText Transfer Protocol (HTTP). For example, a user can establish a session with the server 101 by entering the Uniform Resource Locator (URL) associated with the server 101 into an address field of the browser application. In various embodiments, the client device 102 can also comprise a standard set of hardware and software such as, but not limited to, a processor, Read Only Memory (ROM), Random Access Memory (RAM), communication ports, user interface, operating system, application programs, as well as standard peripherals such as, but not limited to, a data entry device such as a keyboard, a pointing and selection device such as a mouse or trackball, and a display. The operating system can be configured to support application programs configured to accept user input via the user interface in the form of interactive pages comprising static and dynamic display data and data entry fields.
In various embodiments, the network 103 can comprise a packet-based network configured to transfer packet-based information. For example, the network 103 can comprise an Internet Protocol (IP) based network in which information is transferred in accordance with the Transmission Control Protocol (TCP)/IP standard such as, for example, the Internet. In various embodiments, the network 103 can comprise an intranet, a wireless communication network such as Global System for Mobile Communications (GSM) or Code Division Multiple Access (CDMA), a satellite communication network, or a Local Area Network (LAN) or Wireless LAN based on, for example, the IEEE 802.11 standard. Other variations are possible. For example, the network 103 can also comprise a connection-based network such as, for example, the Public Switched Telephone Network (PSTN).
With respect to
In various embodiments, the input/output portion 150 can comprise a sequence of Java™ instructions that configure the information storage and retrieval application 105 to input and output information in accordance with the HyperText Transfer Protocol (HTTP). Other embodiments are possible. For example, in various alternative embodiments, the information storage and retrieval application 105 can comprise one or more Common Gateway Interface (CGI) scripts.
Further, in various embodiments, the translator portion 160 can comprise a markup language translator configured to read input information and translate the input information into output information in accordance with translation instructions. In various embodiments, the input information and output information can be a text stream formatted in accordance with the Extensible Markup Language (XML) markup language. Further, the markup language translator can be configured to perform Extensible Style Language Transformation (XSLT) in accordance with translation instructions specified by one or more Extensible Style Language (XSL) stylesheets 165. The translator portion 160 can accept the input information as an input file or as a document contained in an input file. The translator portion 160 can provide the output information as an output file. The translator portion 160 can thus operate as an XSLT parser configured to translate a first XML document into a second XML document, for example. In various embodiments, the stylesheets 165 can be instantiated at time of application installation. In various embodiments, the stylesheets 165 are maintained in non-volatile storage of the server 101, but are not included in the database 107.
In various embodiments, the database interface portion 170 can be configured to communicate with the database 107. For example, the database interface portion 170 can be configured to generate and output to the database 107 an information storage request or an information retrieval request. The information storage and information retrieval requests can be formatted in accordance with the Structured Query Language (SQL). Database requests from the database interface portion 170 can be received by the memory manager 109 of the database 107. In various embodiments, the database interface portion 170 can comprise a Java™ servlet.
In operation, in various embodiments, the translator portion 160 can be configured to receive input information and translate the input information, in accordance with translation instructions specified by one or more stylesheets 165, into output information to be stored in the database 107. In particular, the translator portion 160 can be configured to generate a key from an attribute occurring in the input information, the input information being formatted in accordance with a markup language. In various embodiments, the key can be an index key used for retrieving the output information from the database 107. A different key can be associated with each of many different types of attributes. In various embodiments, the attributes in the input information that are used by the translator portion 160 to generate the keys can be defined in one or more stylesheets 165. The stylesheets 165 can be customized to generate keys from a variety of attribute types according to the needs of the user.
Furthermore, stylesheets 165 can be used to specify to the translator portion 160 the manner in which to add the keys to a hash table index. In various embodiments, the hash table index can comprise an internal database index.
With respect to
In various embodiments, the database interface portion 170 can be configured to apply an insertion instruction page 303 to select insertion of the output information 302 into the database 107 as either a single document or file, or as several compressed documents or files. The insertion instruction page 303 can comprise a markup language file such as, for example, a HyperText Markup Language (HTML) page. The database interface portion 170 can then upload the input information 301 for insertion into the database 107. In various embodiments, the database interface portion 170 can comprise a Java™ servlet. The input information 301 can comprise XML formatted information. In various embodiments, the input information 301 can be compressed using a compression algorithm such as, for example, the java.util.zip Java™ compression utility of the Java™ 2 Platform Std. Ed. v 1.4.2 available from Sun Microsystems of Santa Clara, Calif. In various alternative embodiments, another ZIP compression algorithm can be used such as, for example, PKZIP available from PKWARE, Inc. of Milwaukee, Wis., or the WinZip™ product available from Microsoft Corporation.
Furthermore, in various embodiments, the translator portion 160 can be configured to generate multiple levels of identifiers. Each level of identifiers can be hierarchically related to another one of the levels (for example, the immediately preceding level or the immediately following level). In various embodiments, a top-level identifier can serve to identify an entire input information 301 file such as, for example, an XML file. Multiple sub-level identifiers can be provided, wherein each sub-level identifier serves to identify any XML in the input information 301 that meets the attribute criteria specified in the applicable stylesheet 165. Further, the translator portion 160 can be configured to index all of the identifiers, or keys, by associating each sub-level identifier with its immediately preceding (for example, next highest priority) sub-level identifier, and by associating each sub-level identifier with its top-level identifier.
Example input information 301 is set forth in Table 1 below. As shown in Table 1, the input information 301 can comprise an XML file.
Upon receiving the input information 301 shown in Table 1, the translator portion 160 can apply the first stylesheet 165 to generate the identifiers. For example, if the stylesheet 165 specifies the “ID” attribute in the input information 301 to be used to generate identifiers, the translator portion 160 can generate one identifier for every occurrence of the “ID” attribute encountered in the input information 301. Each generated identifier is included in the output information 302. Thus, the output information 302 generated by the translator portion 160 can comprise one or more of the identifiers, each of which each identifiers corresponds to an occurrence of the selected attribute(s) in the input information 301, each of which identifiers identifies the information associated with the attribute in the input information 301, and each of which identifiers is added or inserted into the database 107.
In various embodiments, the output information 302 can comprise keys in a hash table index. A second stylesheet 165 can be used to specify to the translator portion 160 the manner in which to add the keys to a hash table index. The hash table index can comprise an internal database index. In various embodiments, the hash table index can be stored using the hash table 113 of the memory manager 109.
Example output information 302 is set forth in Table 2 below. As shown in Table 2, the output information 302 can comprise an XML file.
With respect to
After insertion into the database 107, the inserted document text, for example, markup language information of the input information 301, can be retrieved from the database 107 using the hash table index (for example, output data 302). With respect to
Although six keys are shown in Table 3, it is to be understood that any number of keys can be included in the hash table index. The input/output portion 160 can forward the database read request to the database interface portion 170. Upon receiving the database read request, the database interface portion 170 can search the keys in the hash table index 113, via table look-up or other method, for the identifier contained in the database read request. For example, the database interface portion 170 can perform a table lookup of the keys in the hash table index 113 to determine that the second key in Table 3 corresponds to the specific identifier (“my.test.link”) contained in the example database read request. The database interface portion 170 can then form a database request using the sub-level identifier and top-level identifier located in the hash table index 113. The database interface portion 170 can then send the database request to the database 107.
In various embodiments, upon receiving the database request, the memory manager 109 of the database 107 can determine if the information corresponding to the identifier is contained in local memory 112 at the memory manager 109. If so, then the memory manager 109 can return the information (for example, XML) associated with the identifier in the database request to the database interface portion 170, without reading the information from the storage device 111. Because the local memory 112 has a faster access time latency than the storage device 111, storing information locally using the memory manager 109 reduces the access time to the client device 102 to obtain the requested information.
If the requested information is not contained in memory manager 109 local memory 112, then the memory manager 109 performs a database read operation to obtain the requested information from the storage device 111. The memory manager 109 also can add the information read from the storage device 111 to a hash table contained in local memory 112, for faster access to the information in response to subsequent requests for it. In various embodiments, the information obtained from the database can comprise the entire file or entire amount of information associated with the top-level identifier. For example, for the located key “ID=‘my.test.link’, Top-level=‘my.test’” will result in the database 107 returning the entire file (for example, XML document) associated with the “my.test” top-level identifier.
In various embodiments, upon receiving the information from the database 107, the database interface portion 170 can forward the received information to the translator portion 160. The translator portion 160 can apply a third stylesheet 165 parses the information received from the database to strip out unwanted information prior to presenting or outputting the information to the client device 102. For example, for the database access request comprising the sub-level identifier “my.test.link,” the translator portion 160 can remove all but the following information as shown in Table 4:
In this case, for information flowing from the database to the client device, the information obtained from the database 107 can comprise information input to the translator portion 160, and the transformed information provided to the client device can comprise information output by the translator portion 160. The transformed database output information can then be forwarded to the input/output portion 150 and transferred to the client device 102 for further processing such as, for example, display to a user.
Therefore, unlike other databases available for maintaining markup language information, various embodiments comprising a system and method for inserting document text into a database and for retrieving portions of the document text from that database as described herein can provide, among other things, improved speed and efficiency in indexing and searching of information as well as improved speed of information retrieval from a database, because only the desired data is transferred to the requesting device. Further, various embodiments can be implemented using a relatively small number of instructions compared to other systems. While other databases use XPATH mechanisms to extract markup language from a database, various embodiments use unique keys created from attribute names to identify and obtain information from a database. In addition, various embodiments comprising the customized stylesheets allow the user the capability to customize how information is parsed into the database and also how information is displayed to the user.
With respect to
With respect to
With respect to
In various embodiments, the stylesheets 165 of
With respect to
Control can then proceed to 607, at which the file for database insertion can be received by the input/output portion 150 of the database servlet. Upon recognizing a file for database insertion, the input/output portion 150 can forward the file to the translator portion 160. Control can then proceed to 609, at which, upon receiving the input information (for example, the file for database insertion), the translator portion 160 can select the first stylesheet 165. In various embodiments, the first stylesheet 165 can be retrieved from a memory of the server 101 or using the network 103. Control can then proceed to 611, at which the translator portion 160 can apply the first stylesheet 165 to the received input information to generate a key for each occurrence of one of the attributes to in the input information specified in the first stylesheet 165. In various embodiments, the key can comprise one or more identifiers. Control can then proceed to 613, at which the translator portion can construct a hierarchy of related identifiers as the keys are generated. In various embodiments, the keys can comprise, for example, a first sub-level identifier and another identifier that is the immediately preceding level identifier to which the first sub-level identifier belongs. Control can proceed to 615, at which the translator portion 160 can determine if the end of the input information has been reached (for example, end of file). If not, then control can return to 611 to search for the next attribute in the input information selected by the first stylesheet 165, until keys have been generated for all matching attributes found in the input information.
Control can then proceed to 617, at which the translator portion 160 can select the second stylesheet 165. In various embodiments, the second stylesheet 165 can be retrieved from a memory of the server 101 or using the network 103. Referring to
Control can then proceed to 623, at which the database interface portion 170 can retrieve the insertion instruction page 303 from the database 107. The insertion instruction page 303 can comprise a markup language file such as, for example, a HyperText Markup Language (HTML) page. Control can then proceed to 625, at which the database interface portion 170 can apply the insertion instruction page 303 to select the insertion mode for adding the input information 301 into the database 107. Control can proceed to 627, 629, or 631 for insertion of the input information 301 into the database 107 in accordance with the insertion instruction page 303. For example, at 627, the database interface portion 170 can format the input information 301 for insertion into the database 107 without using any compression. Alternatively, at 629, the database interface portion 170 can format the input information 301 for insertion into the database 107 by performing data compression of the input information 301 as a single document. In various embodiments, the input information 301 can be compressed using a compression algorithm such as, for example, the java.util.zip compression utility. Alternatively, at 631, the database interface portion 170 can format the input information 301 for insertion into the database 107 by performing data compression of the input information 301 as multiple distinct files. For example, if the input information 301 is received as a single ZIP file, then the database interface portion 170 can unzip the ZIP file and insert individually each compressed file that is included in the ZIP file. In various embodiments, the database insertion portion 170 can be configured to insert the input information 301 into the database 107 using the METHOD=“POST” HTML instruction.
Control can then proceed to 633, at which the database interface portion 170 can store in, or upload to, the database 107, the input information 301 from 629 or the compressed input information 301 from 631 or 633 as either a single document or file, or as several compressed documents or files.
With respect to
The method can then proceed to 705, at which, at which the information storage & retrieval application (for example, database servlet) can receive the database read request from the client device 102. In particular, upon receiving a database read request from the client device 102, the input/output portion 150 can forward the database read request to the database interface portion 170. For example, the database read request can comprise the sub-level identifier, “ID=‘my.test.link.’ The input/output portion 160 can forward the database read request to the database interface portion 170.
Control can then proceed to 707, at which, upon receiving the database read request, the database interface portion 170 can search the keys in the hash table index 113, via table look-up or other method, for the identifier contained in the database read request. For example, the database interface portion 170 can perform a table lookup of the keys in the hash table index 113 to determine the key that corresponds to the specific identifier contained in the database read request. Control can then proceed to 709, at which the database interface portion 170 can determine if the hash table index 113 contains keys matching the specific identifier contained in the database read request. If not, control can proceed to 711, at which the database interface portion 170 can send (via the input/output portion 150) an error message to the client device 102 indicating no matching entry in the database 107. In various embodiments, the error message can comprise an HRTP response indicating request failure.
If a key is located within the hash table index, then control can then proceed to 713, at which the database interface portion 170 can form a database request using the sub-level identifier, if received, and top-level identifier located in the hash table index 113, and then send the database request to the database 107.
Control can then proceed to 715, at which, upon receiving the database request, the memory manager 109 of the database 107 can determine if the information corresponding to the identifier is contained in local memory 112 at the memory manager 109. If so, then control can proceed to 717, at which the memory manager 109 can return the information (for example, XML) associated with the identifier in the database request to the database interface portion 170, without reading the information from the storage device 111. Because the local memory 112 has a faster access time latency than the storage device 111, storing information locally using the memory manager 109 reduces the access time to the client device 102 to obtain the requested information.
If the requested information is not contained in memory manager 109 local memory 112, then control can proceed to 719, at which the memory manager 109 performs a database read operation to obtain the requested information from the storage device 111. In various embodiments, the information obtained from the database storage device 111 can comprise the entire file or entire amount of information associated with the top-level identifier. For example, for the located key “ID=‘my.test.link’, Top-level=‘my.test’” will result in the database 107 returning the entire file (for example, XML document) associated with the “my.test” top-level identifier.
Control can then proceed to 721, at which, upon receiving the information from the database 107, the database interface portion 170 can forward the received information to the translator portion 160 and the translator portion 160 can apply a third stylesheet 165 parses the information received from the database to strip out unwanted information prior to presenting or outputting the information to the client device 102, such that the transformed information returned to the client device 102 is only the information associated with the selected sub-level identifier, and not the remaining information in the document stored in the database. Therefore, only the information needed by the client device 102 is actually transferred to the client device 102, resulting in more efficient and timely responses to database requests. In various embodiments, the translator portion 160 can be configured to perform an XSL translation that results in only pertinent data being obtained. For example, the translator portion 160 can be configured to extract information by identifier and by attributes passed to the database 107. Values that do not agree with the attributes can be removed. Elements that do not contain the attributes or match can be passed back to the client.
Referring to
If at 727 the memory determines that the local memory 112 size has exceeded the target threshold size, then control can proceed to 729, at which the memory manager 109 can remove the oldest information in local memory 112 to provide capacity to store the transformed information and maintain the size of the local memory 112 below the target threshold size. In various embodiments, the target threshold is configurable and can be modified by, for example, updating an input parameter specifying the target size threshold contained in a configuration file.
Control can then proceed to 731, at which the input/output portion 150 can send the transformed database information to the client device 102 for further processing such as, for example, display to a user. In various embodiments, the transformed information obtained from the database can be output to the client device 102 as an HTTP response. Control can then proceed to 733, at which the method can end.
Thus has been disclosed a system and method for inserting document text into a database and for retrieving portions of the document text from that database. The system and method can provide, among other things, improved speed and efficiency in indexing and searching of information as well as improved speed of information retrieval from a database, because only the desired data is transferred to the requesting device.
Various embodiments can be implemented using hardware and software components including the PC and related peripherals as described herein. However, it is further apparent to those skilled in the art that the disclosed system may be readily implemented in software using object or object-oriented software development environments that provide portable source code that can be used on a variety of computer platforms. Alternatively, the disclosed system may be implemented partially or fully in hardware using standard logic circuits or a VLSI design. Other hardware or software can be used to implement the systems in accordance with this invention depending on the speed and/or efficiency requirements of the systems, the particular function, and/or a particular software or hardware system, microprocessor, or microcomputer system being utilized. The system and method herein can be readily implemented in hardware and/or software using any known or later developed systems or structures, devices and/or software by those of ordinary skill in the applicable art from the functional description provided herein and with a general basic knowledge of the computer and mark-up language arts.
Moreover, the disclosed methods may be readily implemented in software executed on programmed general-purpose computer, a special purpose computer, a microprocessor, or the like. In these instances, the systems and methods of this invention can be implemented as program embedded on personal computer such as Java™ or CGI script, as a resource residing on a server or graphics workstation, as a routine embedded in a dedicated encoding/decoding system, or the like. The system can also be implemented by physically incorporating the system and method into a software and/or hardware system, such as the hardware and software systems of an image processor.
While embodiments of the invention have been described above, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the applicable arts. Accordingly, the embodiments of the invention, as set forth above, are intended to be illustrative, and should not be construed as limitations on the scope of the invention. Various changes may be made without departing from the spirit and scope of the invention. Accordingly, the scope of the present invention should be determined not by the embodiments illustrated above, but by the claims appended hereto and their legal equivalents.