The present disclosure relates in general to databases, and, in particular, to methods and apparatus for modifying a plurality of markup language files.
The vast majority of documents we create and/or archive are stored electronically. In order to quickly find certain documents, the relevant data from these documents is typically extracted, catalogued, and organized in a database to make them searchable in a document review application. For example, as part of the discovery process in a law suit, millions of documents may need to be reviewed.
One type of document that is frequently reviewed is a web page. Web pages are defined by a markup language file, such as a hypertext markup language (HTML) file. Web pages typically contain links to other web pages, and the path to the linked web page is stored in the markup language file. However, the process of bringing the documents in to the document review application typically renames the files, thereby breaking these links.
Briefly, methods and apparatus for modifying a plurality of markup language files are disclosed. In general, web pages are renamed as they are brought in to a document review application, and a data structure is created that associates the old name of each web page with the new name of each web page. Then, all of the links in the web pages are modified to also use the new names. As a result, users of the document review application may review the web pages with functional links.
Turning now to the figures, the present system is most readily realized in a network communication system 100. A block diagram of certain elements of an example network communications system 100 is illustrated in
The web server 106 stores a plurality of files, programs, and/or web pages in one or more databases 108 for use by the client devices 102 as described in detail below. The database 108 may be connected directly to the web server 106 and/or via one or more network connections. The database 108 stores data as described in detail below.
One web server 106 may interact with a large number of client devices 102. Accordingly, each server 106 is typically a high end computer with a large storage capacity, one or more fast microprocessors, and one or more high speed network connections. Conversely, relative to a typical server 106, each client device 102 typically includes less storage capacity, a single microprocessor, and a single network connection.
In this example, user 114a is using client device 102a and client device 102b. For example, user 114a may be reviewing documents displayed on a desktop display of client device 102a and coding those documents using a touch screen on client device 102b.
Each of the devices illustrated in
The memory 208 may include various types of non-transitory memory including volatile memory and/or non-volatile memory such as, but not limited to, distributed memory, read-only memory (ROM), random access memory (RAM) etc. The memory 208 typically stores a software program that interacts with the other devices in the system as described herein. This program may be executed by the processing unit 204 in any suitable manner. The memory 208 may also store digital data indicative of documents, files, programs, web pages, etc. retrieved from a server and/or loaded via an input device 214.
The interface circuit 212 may be implemented using any suitable interface standard, such as an Ethernet interface and/or a Universal Serial Bus (USB) interface. One or more input devices 214 may be connected to the interface circuit 212 for entering data and commands into the main unit 202. For example, the input device 214 may be a keyboard, mouse, touch screen, track pad, camera, voice recognition system, accelerometer, global positioning system (GPS), and/or any other suitable input device.
One or more displays, printers, speakers, monitors, televisions, high definition televisions, and/or other suitable output devices 216 may also be connected to the main unit 202 via the interface circuit 212. One or more storage devices 218 may also be connected to the main unit 202 via the interface circuit 212. For example, a hard drive, CD drive, DVD drive, and/or other storage devices may be connected to the main unit 202. The storage devices 218 may store any type of data used by the device 200. The computing device 200 may also exchange data with one or more input/output (I/O) devices 220, such as network routers, camera, audio players, thumb drives etc.
The computing device 200 may also exchange data with other network devices 222 via a connection to a network 110. The network connection may be any type of network connection, such as an Ethernet connection, digital subscriber line (DSL), telephone line, coaxial cable, wireless base station 230, etc. Users 114 of the system 100 may be required to register with a server 106. In such an instance, each user 114 may choose a user identifier (e.g., e-mail address) and a password which may be required for the activation of services. The user identifier and password may be passed across the network 110 using encryption built into the user's browser. Alternatively, the user identifier and/or password may be assigned by the server 106.
In some embodiments, the device 200 may be a wireless device 200. In such an instance, the device 200 may include one or more antennas 224 connected to one or more radio frequency (RF) transceivers 226. The transceiver 226 may include one or more receivers and one or more transmitters operating on the same and/or different frequencies. For example, the device 200 may include a blue tooth transceiver 216, a Wi-Fi transceiver 216, and diversity cellular transceivers 216. The transceiver 226 allows the device 200 to exchange signals, such as voice, video and any other suitable data, with other wireless devices 228, such as a phone, camera, monitor, television, and/or high definition television. For example, the device 200 may send and receive wireless telephone signals, text messages, audio signals and/or video signals directly and/or via a base station 230.
In general, web pages are renamed as they are brought in to a document review application, and a data structure is created that associates the old name of each web page with the new name of each web page. Then, all of the links in the web pages are modified to also use the new names. As a result, users of the document review application may review the web pages with functional links.
More specifically, in this example, the process 300 begins when the processor 204 receives a first markup language file (block 302). For example, the processor may read a first hypertext markup language file (HTML) file into an electronic document review application. The processor 204 then renames the first markup language file from a first name to a second different name (block 304). For example, the processor may rename the file from “ProductDescription.htm” to “0001.htm.” The processor 204 then optionally removes or modifies footer information and/or converts the first markup language file to a Page Description Format (PDF) (block 306). An example of a portion of an HTML file 702/704 before and after footer information is removed is illustrated in
The processor 204 then stores the first markup language file in an electronic document review database using the second name (block 308). For example, the processor may store the document as “0001.htm” in a legal discovery application environment. The processor 204 then creates a data structure including an association between the first name and the second name (block 310). For example, the processor may store “ProductDescription.htm” in association with “0001.htm” in the electronic document review database.
The processor 204 then determines that a link in a second different markup language file includes the first name (block 312). For example, the processor may find a hypertext reference (HREF) attribute in another HTML file that includes “ProductDescription.htm.” The processor 204 then creates a modified second markup language file including a modified link by modifying the link in the second markup language file to include the second name (block 314). For example, the processor may replace “ProductDescription.htm” in the second HTML file with “0001.htm.” Example of portions of HTML files before and after modification are illustrated in
The processor 204 then stores the modified second markup language file in the electronic document review database (block 316). For example, the processor may store the modified document in the legal discovery application environment. The processor 204 then receives a user selection of the modified link (block 318). For example, the user of the legal discovery application may click on the hyperlink containing “0001.htm” (block 318). The processor 204 then displays the first markup language file in response to receiving the user selection (block 320). For example, the processor may show the webpage “0001.htm.”
In summary, persons of ordinary skill in the art will readily appreciate that methods and apparatus for modifying a plurality of markup language files have been provided. The foregoing description has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the exemplary embodiments disclosed. Many modifications and variations are possible in light of the above teachings. It is intended that the scope of the invention be limited not by this detailed description of examples, but rather by the claims appended hereto.