The exemplary embodiment relates to a system and method for encoding and handling self-contained and incremental document history for documents encoded in a markup language such as Extensible Markup Language (XML).
XML is a widely used standard for encoding document information. For example, many word processing programs save documents in an XML format as a way of preserving the content and arrangement of the document. Additionally, XML documents (or XML files) may be passed between distinct software applications as a way of exchanging data in a universal format. These XML documents may change over time through the addition and deletion of information in the document. However, there is currently no universal way of keeping track of these changes within the document in a manner that allows any application or user to determine the history of changes in the document (“history”). Change history of an XML document may be useful for numerous reasons. Among other things, an application or user may wish to view a prior version of the document or merge a version of the document with another version of the same or a different document. Currently, such XML document management requires a content management system or database separate from the XML document itself. Models and operations pertaining to document change history are therefore hidden from view, and there exists no universal and reliable mechanism to allow for making this information manageable across diverse platforms. Additionally, many versioning systems are compatible and optimized only from the implementing application's point of view. Therefore, some basic universal functionality relating to the encoding and management of the incremental change history of an XML document within the XML document itself is of interest as XML documents become more prevalent within the user community.
In one aspect of the exemplary embodiment, a computer-implemented method for processing a markup language document is provided. The method includes receiving a first and second version of a target document into computer memory. Using a computer processor, either the first or second version of the target document is encapsulated within an encapsulating document. The method then encodes a change history corresponding to a difference between the first version and second version of the target document, and encapsulates the change history within the encapsulating document. The encapsulated document is output.
In another aspect, a storage medium containing, in a computer readable form, an encapsulating document is provided. The encapsulating document includes a version of a target document and an encoded change history corresponding to a difference between versions of the target document. The encoded change history includes at least one versioning point which includes a version difference expressed in a change description language.
In yet another aspect, a computer-based system for encoding a target document and its change history within an encapsulating document is provided. The system includes a computer processor and computer memory which stores an encapsulation module. The encapsulation module is configured to receive a first and a second version of a target document in computer memory and, using the computer processor, encapsulate one of the first and second versions of the target document within an encapsulating document. The encapsulation module is further configured to encode a change history corresponding to a difference between the first version and second version of the target document, encapsulate the change history within the encapsulating document, and output the encapsulating document.
Disclosed herein are a system and method for encoding and managing self-contained and incremental document history for markup language documents, which are referred to herein for convenience as XML documents.
The exemplary method and system encapsulate together an XML document (target XML document) and its change history. A target XML document's change history contains descriptions of one or more significant states (versioning points) a document adopted during its existence. This change history is encapsulated as a section within a single standalone XML document along with a version of the target XML document itself. Moreover, the exemplary method and system focuses on the coherence of this change history information which allows for the processing of the encapsulated change history within the XML document and retrieval of prior or subsequent versions. For example, the exemplary method and system include a change history section within the standalone XML document suited to capture and encapsulate document versioning information and to allow for operations that enable basic usage of such encapsulated data. Specifically, the change history section describes a set of transformations between versioning points that allow for navigation within the history of the target XML document and consistent extraction of document versions. The change history section also allows for the creation of new versioning points and branches, as well as the merging of existing version branches. Each of the above operations produces novel and consistent encapsulations of a version of a target XML document and its change history within the encapsulating XML document.
Characteristics of the change history data model include the use of a universal XML data structure and related transformations that allow for the abstraction of the change history from the underlying storage system and execution model. These characteristics favor long term preservation of XML documents, infrastructure and vendor independence, and open the way to interoperable processing of XML versioning information.
The exemplary system and method utilize a particular namespace in order to embed a target XML document (which may itself use any syntax and vocabulary) without any change content and tag/attribute set of the target XML document. The document change history is encoded using a specific vocabulary that captures change operations in a formal and universal way such that an XML diff-enabled processor can generate differences between document versions (version differences). An XML diff-enabled processor (“diff engine”) is any application or processor that computes differences between XML documents. The output from a diff engine is generally in the form of a change description language that describes, in a formalistic manner, the differences between any two documents. A “delta” (Δ), as used herein, is the difference between two document versions described in a change description language. Any of the several commercially or publicly available XML diff engines that can be adapted to generate formal and reliable deltas may be used in conjunction with the exemplary method and system.
With reference to
The encapsulating XML document 20 is represented logically as a tree of interconnected nodes:
x-version[x-bodyvi[d],x-history[v0 . . . vi . . . vn-1 ]] (1)
The x-bodyvi node contains a version vi of target XML document d, and the x-history node begins a subtree containing a change history that includes versioning points v0 to vn-1. The full syntax of the encapsulating document 20 may be provided through a RelaxNG Schema (see Appendices A and B for an example), or through other suitable markup language mechanism.
With respect to the directed graph 22a of
The exemplary system may utilize a diff engine that operates to formalize the changes between versions of a target document. The signature for the diff engine can be represented as:
diff(config, d, d′)→Δ (2)
where config is a set of parameters used to configure the diff engine (e.g., filter to apply, mode commutative/non-commutative, algorithm, etc), d is a first document, and d′ is a second document. In the exemplary embodiment, d and d′ are different versions of the same target XML document. The output of the diff engine (Δ) represents a set of basic operations (δ) described in a change description language. The basic operations 6 are selected from a predetermined set of basic operations available to the diff engine. The output has the following properties:
δ denotes a basic operation that modifies an XML document. Examples of δ include the insert, insert-attr, and delete operations shown above, and are discussed in more detail below. A series of basic operations δ with commutative properties are denoted as a “commutative snapshot.” A commutative snapshot is a series of basic operations δ performed on an XML tree (such as a version of the target XML document) that, when performed in any order, consistently produce the same resultant XML tree. As indicated above, Δ can represent a null (empty) set (i.e., no change between the versions), a commutative snapshot, or a sequence of commutative snapshots. It will be appreciated that, in general, not all versions have an empty set as the Δ.
The basic operations δ specify paths, denoted p or pp, to designate the tree location where the modification should be applied. The paths can be described by the following grammar:
p::=pp|pp/@nm
pp::=i/pp|i (5)
where i is a positive integer and nm is an attribute name. The paths are interpreted relative to the root of the encapsulating document (i.e., the x-body element), and are easily translated to XPath expressions. XPath expressions are expressions formed in a query language that are used to select nodes from an XML document. For example, the path p 1/2/1 translates into the XPath expression *[1]/*[2]/*8 1], and path p 1/3/@id into *[1]/*[3]/@id. In these examples, the *[n] notation refers to a node level within a document and the @id notation refers to an element attribute. The basic operations listed above all require a path pp as an operand in order to perform their particular function. The insert and insert-attr operations additionally require a tree A as an operand. The tree A may be a single XML node such as a <p> (paragraph) node or a tree containing multiple nodes (such as a <p> element containing multiple <p> children). Specifically, the insert operation receives two parameters, a path pp and tree A, and inserts the tree A at location pp in the target XML document. The insert-attr operation performs in a manner similar to the insert operation, except that the tree A is inserted in the target XML document at the path designated by pp and the attribute value of nm. In order for the target XML document to be considered well-formed, the tree A must be a leaf within the target XML document. The delete operation simply deletes whatever tree is found at path pp.
In order to increase the generality of basic operations δ capable of being processed by the exemplary method and system, the following supplemental operations extend the existing basic operations δ through definitions based on the fundamental δ operations described above:
d>{move(pp1, pp2)}>d′<==>
d>{insert (pp2,get(d, pp1)) delete(pp1)}d′ (6)
d>{copy(pp1, pp2)}>d′<==>
d>{insert (pp2,get(d, pp1))}>d′ (7)
d>{replace(pp1, A)}>d′<==>
d>{insert (pp1,A) delete(pp1⊕§pp1)}>d′ (8)
Thus, the move operation simply inserts the tree at path pp1 at path pp2 and deletes the tree at path pp1. The copy operation inserts tree pp1 at the path designated by pp2, and the replace operation inserts tree A at path pp1 and then deletes the tree previously at path pp1 in original document d.
The encapsulating document 20 shown in
Versioning points 58, 60, 62 within an encapsulating document 20 are represented through the use of a dedicated XML element (e.g. <version>) associated with an id attribute that uniquely identifies the version. Each version element may contain from zero to multiple delta elements. In
The delta elements 54, 56 capture the transition from the focused versioning point 27 to another versioning point. This information is conveyed by an attribute (e.g., “fwd” for forward links from an earlier version to a later version and “bwd” for backward links) contained in the version element 58, 60, 62. A forward link links a prior version to a later version, and a backward link links a later version to a prior version. As noted above, each Δ contains at least one non order-significant sequence of δ operations, or “snapshot”. Sequences of delta elements with the same orientation correspond to the Δ1; . . . ; Δk syntactic form.
Each basic operation δ is described using a dedicated element name according to its semantics (insert, insert-attr, delete, move, copy, etc). The path information may be concisely encoded, e.g. through an “ipath” attribute in the delta element. Copies of subtrees may be expressed through a “copy” attribute attached to the delta element. In that case, the latter copy attribute is an ipath with respect to the focused versioning point.
Since basic operations δ all require at least a path operand, basic operations δ may be written as δp to express that a basic operation δ is realized on a path p. Each basic operation δ belonging to a snapshot complies with a structural constraint that ensures orthogonality such that the snapshot is indeed commutative. In order to ensure orthogonality, it is assumed that both paths in every pair of δp do not designate sibling trees, and that one path does not designate a sibling tree of the parent node designated by the other path. This assumption is in place to avoid conflicting δp's.
Version differences Δ can be encoded in a document change history in multiple ways. The version differences Δ can be encoded with respect to the focused version of the target XML document, or they can be encoded in a “linear” mode. A change history with version differences Δ encoded in linear mode chronologically describes each incremental change between versions starting at the earliest version and ending with the latest version. In a linear mode encoding, all the delta elements only have “fwd” encodings and no “bwd” encodings since each delta is encoded with respect to the next version chronologically. For example, the delta change elements 54 and 56 of
To help remedy this redundancy,
The output from the diff engine describes the changes that occur between versions within an XML tree. The changes are described in a change description language such that the changes can be performed on the original XML document to produce the changed XML document. The analytical transformation of document d into a changed document d′ is noted by applying Δ as follows:
d>Δ>d′ (9)
In other words, the description (6) above is a logical assertion saying that a well-formed document d is changed into a well-formed document d′ after the application of the well-formed Δ operation. A Δ operation is well-formed if it adheres to the properties listed below.
Formally, for any subtree A, path p, document d and d′, version differences Δi and δi, the document transformations can have the following abstract properties:
The definitions of the (a-ins) and (a-ins-@) properties make use of function get, which extracts the subtree rooted at a given location p. The invar property used in (a-ins), (a-ins-@) and (a-del) expresses that the subtree defined by the operand path pp is not modified by the Δ operation performed on original document d. Mathematically, the invar property is defined as:
The ⊕, and § functions used in the properties above are inductively defined. Specifically, the path addition function ⊕ is defined over pure paths (noted pp), is commutative, and operates on paths of any depth. In other words, the path addition function ⊕ combines two given paths into a single path for evaluation. Mathematically, the path addition function ⊕ is defined as follows:
i/pp⊕j/pp′=(i+j)/(pp⊕pp′) (10)
i/pp⊕j=(i+j)/pp (11)
i⊕j/pp′=(i+j)/pp′ (12)
i⊕j=(i+j) (13)
The fingerprint extraction function § calculates the depth level of a given path. Mathematically, the fingerprint extraction function § is defined as follows:
§(i/pp)=0/§(pp) (14)
§(i)=1 (15)
The (a-seq) property states that there exists an intermediate document d″ such that when Δ1 is applied to original document d, d″ is produced, and when Δ2 is applied to the intermediate document d″, then the changed document d′ is produced.
The (a-snap) property states that when a snapshot (i.e., a series of basic operations δ) is applied to an original document d, then changed document d′ is produced, no matter what order the basic operations δ were performed. This supports the notion of a commutative property. In other words, all deltas Δ (which may comprise one or more snapshots) are pair wise orthogonal.
The (a-void) property states that if a version difference Δ containing no changes is applied to original document d, then the changed document d′ will be the same as original document d,
The (a-ins) property states that after an insert operation is performed on original document d, (by inserting tree A at path pp), then the changed document d′ will contain the new tree A at path pp. The (a-ins-@) property is the same as the (a-ins) property, except that it applies when the insert-attr operation inserts tree A in original document d at the path with the attribute value denoted by operand pp/@nm.
The (a-del) property states that after the delete operation is performed on original document d at path pp, the tree previously at path pp in d will no longer exist in changed document d′.
The (a-del-@) property states that after a delete operation is performed on original document d (by deleting the tree at path pp), the get function will return a null value for the changed document d′ at path pp/@nm.
An inverted delta (Δ) describes the changes that will restore an operand (i.e., a target XML document) to a prior state before a change was made. Inverted version differences Δ's are used to increase operational optimization in the exemplary method and system. The inversion of a version difference Δ requires knowing the original operand on which changes will be applied.
The inversion function can be inductively defined by the following:
Delta inversion is characterized by the following property:
d>Δ>d′===>d′> invert(d, Δ)>d (22)
The inversion of version differences Δ provides useful functionality which allows for optimal navigation within an XML tree. Moreover, it allows for a more compact representation of changes, especially when successive versions represent documents which contain only incremental changes, which is common in a standard document life-cycle.
Indeed, in such cases, subgraphs of the form:
vi→insert(p,A) vj→insert(p′B) vk (23)
can be rewritten using delta inversion as:
vi←delete(p) vj←delete(p′) vk (24)
In other words, description (23) describes changing from version i to version j to version k using the insert operations. Description (24) uses an inverted Δto illustrate going from version k to version j to version i by working backwards from version k. This allows the system to move efficiently between versions without having to recalculate a very long list of changes. Thus, one benefit of using inverted Δ's is that subtrees A and B are redundantly stored: once inside the history and once inside the current instance itself. In case the focus is set to a non-terminal versioning point (e.g. version j in the example above), deltas and inverted deltas may be combined to form:
vi←delete(p) vj→insert(p′,B) vk
which is still quite meaningful as the subtree B is only stored once inside the delta (indeed, the target XML document is conformant to vj and does not comprise the B subtree.
Note that inversion of operands of a diff operation also lead to reversed deltas (or a delta operationally equivalent to reversed delta):
diff(c, d, d′)=Δ===>diff(c, d′, d)=invert(d, Δ)
The encapsulation of change history data for a target XML document within an encapsulating XML document allows for multiple management operations to be performed. The operations performed by the exemplary method and system allow for the creation of an encapsulating document, the versioning of a modified target XML document, the merging of two separate versions of a target XML document, the focusing of a target XML document on a specified version, and the extraction of a target XML document from an encapsulating document.
The system 100 includes data memory 104 for storing the target XML document 2 and the version number 106 while the document 2 is being processed. Main memory 108 of the system 100 stores an encapsulation module 110, versioning module 112, merging module 114, focusing module 116, extraction module 118, and diff engine 120. Outputs from modules 110, 112, 114, 116, 118, 120 may be stored in memories 104, 108 or output via an output device 124 to a client terminal 40, optionally through a network 132 such as the internet.
The encapsulation module 110 receives as input the target XML document 2 via the input device 102, and encapsulates the target XML document 2 within a newly created encapsulating document 20. The encapsulation module 110 then outputs the encapsulating document 20 via output device 124, or stores it in memory 104. The versioning module 112 receives as input an encapsulating document 20 or 20′ and a modified target XML document 2′ (which may have been created by a user through modifications to target XML document 2). The versioning module 112 creates a versioning point within the input encapsulating document 20 with the assistance of the diff engine 120. The versioning module 112 then outputs the modified encapsulating document 20 via output device 124, or stores it in memory 104. The merging module 114 receives as input an encapsulating document 20 (or 20′) and a version number or other user-selected identifier 106. The merging module 114 merges the target XML document within the encapsulating document 20 with a versioning point identified by the input version number 106. The merging module 114 then encapsulates and outputs the merged target XML document within an encapsulating document 20 for output. The focusing module 116 receives as input an encapsulating document 20 (or 20′) and a selected version number 106. The focusing module 116 then re-encodes the focused encapsulating document 20 such that the target XML document and history contained within the encapsulating document reflect the version state indicated by the input version number 106. The focusing module 116 then outputs the re-encoded encapsulating document 20. The extraction module 118 receives as input an encapsulating document 20 (or 20′) and extracts the encapsulated target XML document 2 from within the encapsulating document 20. The extraction module 118 then outputs the extracted target XML document 126.
The encapsulation module 110, versioning module 112, merging module 114, focusing module 116, extraction module 118, and diff engine 120 may be implemented as hardware or a combination of hardware and software thereof. In the exemplary embodiment, the components 110, 112, 114, 116, 118, 120 comprise software instructions stored in main memory 108, which are executed by a computer processor 128. The processor 128, such as the computer's CPU, may control the overall operation of the computer system 100 by execution of processing instructions stored in memory 108. Components 102, 104, 108, 110, 112, 114, 116, 118, 120, 124, 128 may be connected by a data control bus 130. As will be appreciated, system 100 may include fewer components. For example, the merging and focusing modules 114, 116 may be omitted.
As will be appreciated, the document history encapsulation and management system 100 may comprise one or more computing devices, such as a personal computer, PDA, laptop computer, server computer, or combination thereof. Memories 104, 108 may be integral or separate and may represent any type of computer readable medium such as random access memory (RAM), read only memory (ROM), magnetic disk or tape, optical disk, flash memory, or holographic memory. In one embodiment, the memories 104, 108 comprise a combination of random access memory and read only memory. In some embodiments, the processor 128 and memory 104 and/or 108 may be combined in a single chip.
With reference to
As a general overview, the encapsulation operation can be represented by the following signature:
create-history(d)→x-version[x-bodyv0[d],x-history[v0]]
where d is a target XML document 2. The signature reflects that target XML document d is encapsulated inside the <x-body> element (as shown in
The method begins at step S2. At step S4, a version of target XML document 2 is received into data memory 104.
At step S6, an initial version number is assigned to the target XML document 2, as shown by 26 (
At step S8, an initial change history element 22 (
At step S10, the original version of the target XML document 2 and initial change history 22 (
At step S12, the encapsulating document 20 is output to memory 104, or to another output device such as client terminal 140 via the output device 124.
The method ends at step S14.
With reference to
With reference to
As a general overview,
create-version(x-version[x-bodyvi[d], x-history[vi], d′)→x-version[x-bodyvj[d],x-history[vi→Δvj]]with diff(c, d, d′)=Δ (28)
where d is a target XML document, vi is the version 62 of the input target XML document 2, and vj is the new versioning point 27, c is a set of parameters used to configure the diff engine, d is the original target XML document 2, and d′ is a modified target XML document 2′ (i.e., a new version of the original document 2).
With reference to
At step S24, the diff engine 120 computes basic operations δ between the original version (e.g., v0) of the target XML document 2 within the encapsulating document 20 and the modified version (e.g., v1) of the target XML document 2′. In the exemplary method, the previous version number 62 is increased by one (i.e., from v0 to v1) to create the new version number 26.
At step S30, the calculated basic operations δ for the new target XML document are encoded into the history 22 of the modified encapsulating document 20 with respect to the new version number 26.
At step 532, the new modified encapsulating document 20 is output to memory 104, or to another output device such as client terminal 140 via the output device 124.
The method ends at S34.
With reference to
As a general overview,
merge(x-version[x-bodyvi[d], x-history[vi]), vj→x-version[x-bodyvk[d′], x-history[vi→Δivk, vj→Δjvk]] with dvi>Δi>d′ and dvj>Δj>d′ (29)
where d is the target XML document 2, vi is the focused version 2′ of the target XML document, vj is the input version number 106 to merge with version vi, d′ is the merged target XML document, and vk is the new merged version 60 of the target XML document 2′. Basically, the merge creates a new version vk with a set of deltas (Δi and Δj) such that if Δi is applied to the target XML document focused on version vi and Δj is applied to version vj of the target XML document, then the merged document version vk is produced in both cases.
With reference to
At step S44, the merging module 114 extracts the focused version 62 (
At step S46, the merging module 114 focuses the encapsulating document 20 onto the version corresponding to input version number 106. This step provides for easier extraction of the version of the target document 2 corresponding to the input version number 106. As explained below, once the newly focused version (corresponding to the input version number 106) is extracted, it is easily merged with the previously extracted version 62 of the target XML document 2. The merging module 114 may utilize the focusing module 116 for this step.
At step S48, the merging module 114 extracts the focused version of target XML document 2 (which is focused on the input version number 106) from the encapsulating document 20 and stores this version of target XML document 2 in memory. The merging module 114 may utilize the extraction module 118 to perform this step.
At step S50, the merging module 114 creates a new version of target XML document 2 by merging the two extracted target XML document versions (one version corresponding to the input version number 106 and the other version corresponding to the previously focused version number 62) into a new target XML document 2. Any merging algorithm which is capable of merging XML document trees may be used. The merging module 114 also creates a new version number 26 that will be used for the versioning point in the change history 60.
At step S52, the merging module 114 replaces the target XML document 2 in the encapsulating document 20 with the merged target XML document 2′, to form a new encapsulating document 20.
At step S54, the merging module 114 re-encodes the versioning points in the encapsulating document 20 change history 60 relative to the new version number 26 created in step S50. As discussed above, the previous versioning points 62, 106 include new deltas which will allow transformation from the previous versioning points 62, 106 to the new versioning point 26. Optionally, the new versioning point 26 may contain deltas to allow for transformation backwards to the previous versioning points 62, 106.
At step S56, the newly modified encapsulating document 20 is output to memory 104, or to another output device such as client terminal 140 via the output device 124.
The method ends at step S58.
With reference to
As a general overview,
focus(x-version[x-bodyvi[d],x-history[vi]), vj)→x-version[x-bodyvj[d′],x-history[vj]] with vi→Δi . . . →Δjvj and d>Δ; . . . ; Δj>d′. (30)
where d is the target XML document 2, vi is the originally focused version 62 of the target XML document 2, vj is the input version number 106 on which to focus, and d′ is the focused target document corresponding to focused version vj. Specifically, the focus operation finds and applies the set of deltas (Δi to Δj) that will transform the target XML document 2 from version vi to version vj. Note that the set of deltas may contain deltas of the form vn←Δn vm which requires the computation of one or more inverse deltas.
With reference to
At step S74, the focusing module 116 applies the appropriate deltas contained within the document change history 22 to the target XML document 2 such that the target XML document 2 is in the state corresponding to the input version number 106. For example, with reference to
At step S78, the focusing module 116 sets the focused version point (i.e., the version link) 26 to the input version number 106 so that the focused target XML document 2 is directly tied to a versioning point in the change history 20.
At step S80, the focusing module 116 outputs the encapsulating document 20 containing the focused target XML document 2 and re-encoded change history 60 to memory 104, or to another output device such as client terminal 140 via the output device 124.
The method ends at step S82.
Method for Extracting a Target XML Document from within an Encapsulating Document (Step E)
With reference to
As a general overview,
extract(x-version(bodyvi[d], x-history[vi]))→d (31)
where “x-version(bodyvi[d], x-history[vi] ” is the encapsulating document 20, d is the target XML document, and vi is the currently focused version 62 of the encapsulating document 20.
With respect to
At step S94, the extraction module 118 isolates and extracts the target XML document 2 from the encapsulating document 20. This can be performed by any XML parser capable of extracting nodes and/or subtrees from an XML document.
At step S96, the extraction module 118 outputs the extracted target XML document 2 to memory 104, or to another output device such as client terminal 140 via the output device 124 (if not already stored in memory 104, 108).
The method ends at step S98.
It will be appreciated that various of the above-disclosed and other features and functions, or alternatives thereof, may be desirably combined into many other different systems or applications. Also that various presently unforeseen or unanticipated alternatives, modifications, variations or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims.