The following description includes several examples and/or embodiments of computer-driven systems and/or methods for carrying out automated information storage, processing and/or access. In particular, the examples and embodiments are focused on systems and/or methods oriented specifically for use with the U.S. National Archives and Records Administration (NARA). However, it will be recognized that, while one or more portions of the present specification may be limited in application to NARA's specific requirements, most if not all of the described systems and/or methods have broader application. For example, the implementations described for storage, processing, and/or access to information (also sometimes referred to as ingest, storage, and dissemination) can also apply to any institution that requires and/or desires automated archiving and/or preservation of its information, e.g., documents, email, corporate IP/knowledge, etc. The term “institution” includes at least government agencies or entities, private companies, publicly traded corporations, universities and colleges, charitable or non-profit organizations, etc. Moreover, the term “electronic records archive” (ERA) is intended to encompass a storage, processing, and/or access archives for any institution, regardless of nature or size.
As one example, NARA's continuing fulfillment of its mission in the area of electronic records presents new challenges and opportunities, and the embodiments described herein that relate to the ERA and/or asset catalog may help NARA fulfill its broadly defined mission. The underlying risk associated with failing to meet these challenges or realizing these opportunities is the loss of evidence that is essential to sustaining a government's or an institution's needs.
At Ingest—the ERA needs to identify and capture all components of the record that are necessary for effective storage and dissemination (e.g., content, context, structure, and presentation). This can be especially challenging for records with dynamic content (e.g., websites or databases).
Archival Storage—Recognizing that in the electronic realm the logical record is independent of its media, the four illustrative attributes of the record (e.g., content, context, structure, and presentation) and their associated metadata, still must be preserved “for the life of the Republic.”
Access—NARA will not fulfill its mission simply by storing electronic records of archival value. Through the ERA, these records will be used by researchers long after the associated application software, operating system, and hardware all have become obsolete. The ERA also may apply and enforce access restrictions to sensitive information while at the same time ensuring that the public interest is served by consistently removing access restrictions that are no longer required by statute or regulation.
Data Management—The amount of data that needs to be managed in the ERA can be monumental, especially in the context of government agencies like NARA. Presented herewith are embodiments that are truly scalable solutions that can address a range of needs—from a small focused Instance through large Instances. In such embodiments, the system can be scaled easily so that capacity in both storage and processing power is added when required, and not so soon that large excess capacities exist. This will allow for the system to be scaled to meet demand and provide for maximum flexibility in cost and performance to the institution (e.g., NARA).
Satisfactorily maintaining authenticity through technology-based transformation and re-representation of records is extremely challenging over time. While there has been significant research about migration of electronic records and the use of persistent formats, there has been no previous attempt to create an ERA solution on the scale required by some institutions such as NARA.
Migrations are potentially loss-full transformations, so techniques are needed to detect and measure any actual loss. The system may reduce the likelihood of such loss by applying statistical sampling, based on human judgment for example, backed up with appropriate software tools, and/or institutionalized in a semi-automatic monitoring process.
Table 1 summarizes the “lessons learned” by the Applicants from experience with migrating different types of records to a Persistent Object Format (POF).
It is currently not possible to migrate a number of file formats in a way that will be acceptable for archival purposes. One aspect is to encourage the evolution and enhancement of third-party migration software products by providing a framework into which such commercial off-the-shelf (COTS) software products could become part of the ERA if they meet appropriate tests.
When an appropriate POF cannot be identified to reduce the chances of obsolescence, the format may need to be migrated to a non-permanent but more modern, proprietary format (this is known as Enhanced Preservation). Even POFs are not static, since they still need executable software to interpret them, and future POFs may need to be created that have less feature loss than an older format. Thus, the ERA may allow migrated files to be migrated again into a new and more robust format in the future. Through the Dutch Testbed Project, the Applicants have found that it is normally better to return to the original file(s) whenever such a re-migration occurs. Thus, when updating a record, certain example embodiments may revert to an original version of the document and migrate it to a POF accordingly, whereas certain other example embodiments may not be able to migrate the original document (e.g., because it is unavailable, in an unsupported format, etc.) and thus may be able to instead or in addition migrate the already-migrated file. Thus, in certain example embodiments, a new version of a record may be derived from an original version of the record if it is available or, if it the original is not available, the new version may be derived from any other already existing derivative version (e.g., of the original). As such, an extensible POF for certain example embodiments may be provided.
In view of the above aspects of the OAIS Reference Model, the ERA may comprise an ingest module to accept a file and/or a record, a storage module to associate the file or record with information and/or instructions for disposition, and an access or dissemination module to allow selected access to the file or record. The ingest module may include structure and/or a program to create a template to capture content, context, structure, and/or presentation of the record or file. The storage module may include structure and/or a program to preserve authenticity of the file or record over time, and/or to preserve the physical access to the record or file over time. The access module may include structure or a program to provide a user with ability to view/render the record or file over time, to control access to restricted records, to redact restricted or classified records, and/or to provide access to an increasing number of users anywhere at any time.
During the “Identify” stage, producers and archivists develop a Disposition Agreement to cover records. This Disposition Agreement contains disposition instructions, and also a related Preservation and Service Plan. Producers submit records to the ERA System in a SIP. The transfer occurs under a pre-defined Disposition Agreement and Transfer Agreement. The ERA System validates the transferred SIP by scanning for viruses, ensuring the security access restrictions are appropriate, and checking the records against templates. The ERA System informs the Producer of any potential problems, and extracts metadata (including descriptive data, described in greater detail below), creates an Archival Information Package (or AIP, also described in greater detail below), and places the AIP into Archival Storage. At any time after the AIP has been placed into Archival Storage, archivists may perform Archival Processing, which includes developing arrangement, description, finding aids, and other metadata. These tasks will be assigned to archivists based on relevant policies, business rules, and management discretion. Archival processing supplements the Preservation Description Information metadata in the archives.
At any time after the AIP has been placed into Archival Storage, archivists may perform Preservation Processing, which includes transforming the records to authentically preserve them. Policies, business rules, Preservation and Service Plans, and management discretion will drive these tasks. Preservation processing supplements the Preservation Description Information metadata in the archives, and produces new (transformed) record versions.
With respect to the “Make Available” phase, at any time after the AIP has been placed into Archival Storage, archivists may perform Access Review and Redaction, which includes performing mediated searches, verifying the classification of records, and coordinating redaction of records where necessary. These tasks will be driven by policies, business rules, and access requests. Access Review and Redaction supplement the Preservation Description Information metadata in the archives, and produces new (redacted) record versions. Also, at any time after the AIP has been placed into Archival Storage, Consumers may search the archives to find records of interest.
The Preservation system-level package includes the services necessary to manage the preservation of the electronic records to ensure their continued existence, accessibility, and authenticity over time. The Preservation system-level service also provides the management functionality for preservation assessments, Preservation and Service Level plans, authenticity assessment and digital adaptation of electronic records. The Archival Storage system-level package includes the functionality to abstract the details of mass storage from the rest of the system. This abstraction allows this service to be appropriately scaled as well as allow new technology to be introduced independent of the other system-level services according to business requirements. The Dissemination system-level package includes the functionality to manage search and access requests for assets within the ERA System. Users have the capability to generate search criteria, execute searches, view search results, and select assets for output or presentation. The architecture provides a framework to enable the use of multiple search engines offering a rich choice of searching capabilities across assets and their contents.
The Local Services and Control (LS&C) system-level package includes the functional infrastructure for the ERA Instance including a user interface portal, user workflow, security services, external interfaces to the archiving entity and other entities' systems, as well as the interfaces between ERA Instances. All external interfaces are depicted as flowing through LS&C, although the present invention is not so limited.
The ERA System contains a centralized monitoring and management capability called ERA Management. The ERA Management hardware and/or software may be located at an ERA site. The Systems Operations Center (SOC) provides the system and security administrators with access to the ERA management Virtual Local Area Network. Each SOC manages one or more Federations of Instances based on the classification of the information contained in the Federation.
Also shown are the three primary data stores for each Instance:
This diagram provides a representative illustration of how a federated ERA system can be put together, though it will be appreciated that the same is given by way of example and without limitation. Also, the diagram describes a collection of Instances at the same security classification level and compartment that can communicate electronically via a WAN with one another, although the present invention is not so limited. For example,
The ERA's components may be structured to receive, manage, and process a large amount of assets and collections of assets. Because of the large amount of assets and collections of assets, it would be advantageous to provide an approach that scales to accommodate the same. Beyond the storage of the assets themselves, a way of understanding, accessing, and managing the assets may be provided to add meaning and functionality to the broader ERA. To serve these and/or other ends, an asset catalog including related, enabling features may be provided.
In particular, to address the overall problems of scaling and longevity, the asset catalog and storage system federator may address the following underlying problems, alone or in various combinations:
Electronic records are manifested, in some way, as electronic data files. There are several requirements for managing the relationship between electronic records and data files. These requirements include, but are not limited to: 1) ensuring that all data files stored in the system are associated with the records they constitute; 2) specifying the relationship of each ingested data file with an electronic record; 3) specifying the relationship of each transformed data file to an electronic record; and 4) verifying the data files associated with electronic records contained in a transfer.
The relationship between electronic records and data files appears simple at first glance, but is in reality somewhat complex, particularly when considering the relationship between an individual electronic record and data files, as is required by requirements 2) and 3) above. Although it is tempting to think of electronic records as being directly composed of data files, this is incorrect, as explained in more detail below.
The present solves this complexity through an intermediate layer called a digital component extractor, which establishes a bridge between electronic records and data files. This bridge allows archivists and transferring entities to model the true semantic relationship between individual electronic records and data files.
The concept of a record originates in the archival and records management domains, where a record represents a “unit of recorded information”. As used herein, the term “record” means a unit of recorded information created, received, and maintained as evidence or information by an organization or person, in pursuance of legal obligations or the transaction of business.
This definition has a conceptual basis, in the sense that records are recognized and understood by humans to represent information. It is necessary when discussing electronic records to distinguish the archival and records management term “record” with the computer science concept of the same name. The computer science concept of “record” formally represents a matrix-tuple in linear algebra which is analogous to a row in a database table. The present invention uses the unqualified term “record” to indicate the archival and records management concept, and uses the qualifier “tuple record” to indicate the computer science concept. As used herein, the term “tuple record” means a matrix-tuple (defined by linear algebra), which is a finite function that maps field names to a certain value.
Archivists and records managers typically manage numerous records. The requirements discussed above require the system to manage not only records (in the plural), but also individual records (in the singular). The requirement to manage both individual and plural records presents several questions, including, but not limited to: 1) what defines the exact extent of an individual record? and 2) where precisely does an individual record start and where precisely does it end?
The answers to these questions must be precisely specified in the context of electronic records, where individual electronic records are managed independently.
Given the conceptual nature of records, a conceptual approach to defining the exact extent of a particular individual record is needed. A record can be said to exhibit a characteristic known as strong “semantic coherence,” which is implied by the “unit of recorded information” phrase in the definition of a record. As used herein, the term “semantic coherence” is defined as a conceptual meaning that is closely related through connections and consistency, and holds together firmly as parts of the same mass.
Semantic coherence covers a scale, from weak (no coherence) to strong (high coherence), and the exact point on the scale for any particular set of information will involve subjective (archival) judgment. A record represents conceptual meaning that “sticks together” strongly enough on the semantic coherence scale to be considered an individual record.
Consider the following examples of semantic coherence:
Consider a record of a particular veteran's military service. Information about that individual's service dates, ranks, and defined benefits is strongly logically connected. Is the same information for a different individual the same record? No, because the logical connection for information about one particular individual is very strong whereas the logical connection for information across individuals is weaker.
Consider again a record of a veteran's military service. Now consider information about a battle plan for a particular military engagement in which the individual participated. Is the battle plan part of the individual's military service record? No, while the battle plan is in itself a record (and is loosely connected to the individual's service record), its meaning is inconsistent with the service record, and is therefore a separate record.
Put another way, strong semantic coherence is the characteristic that allows a distinction between one particular record and another particular record.
With paper records, archivists often do not identify individual records, due to time and resource constraints. Instead, archivists typically manage records in the aggregate. With electronic records, archivists may have the capability and desire to identify individual electronic records as standard practice.
Each individual record has an attribute that defines its particular “record type.” As used herein, the term “record type” refers to the abstract form of the records, such as letter, memo, greeting card, or portrait, etc. As such, each record type represents a distinctive class of electronic records defined by their form. A record type represents a distinctive class of records defined by their function or use. Consider the following example of record types:
A parish church will typically maintain many different types of electronic records, including baptismal records, deeds to parish properties, ledgers of the parish financial accounts, minutes of parish meetings, and official parish correspondence. Each of these different record types has a distinct intellectual form. For example, baptismal records almost always list at least the name of the person baptized, the date and place of birth, and the date and place of the baptism. In contrast, financial account ledger records might include a chart of accounts with debit/credit entries. It would be rather surprising to find an infant's birth date in a financial ledger.
The abstract form of a record type is specified by a “record type template.” As used herein a “record type template” is template that identifies specific attributes for a specific type of record. The record type template specifies the essential characteristics of the record, which are used to ensure authenticity.
Referring again to Example 3, the record type template for baptismal records would identify the information expected in that type of record, such as the name of the person baptized, date and place of birth, etc.
The Record Type Template also specifies the essential characteristics of the record, which are used to ensure authenticity as documented in co-pending, commonly assigned U.S. Application (Attorney Docket No 4870-25), entitled SYSTEM AND METHOD FOR PRESERVATION OF DIGITAL RECORDS.
Electronic records are accumulated and organized into “record aggregates” to facilitate organization and archival processing. As used herein, the term “record aggregate” means an intellectual aggregation of documentary material arising because they result from the same accumulation of filing process, the same function, or the same activity; have a particular form; or because of some other relationship arising out of their creation, receipt, or use; or because the aggregate was required for the purposes of archival arrangement. Record aggregates may be composed of other record aggregates, or records.
Record aggregates can themselves be accumulated and organized into higher order record aggregates. Consider the following example of a record aggregates:
An archivist might place military service records into an aggregate for the branch of the military (e.g., Army) which itself is within an aggregate for the Department of Defense, which itself is within an aggregate for the Federal Government.
Record aggregates may follow standard levels: record groups, collections, series, file units, and items. Each record aggregate has name and title attributes which help identify it. Record aggregates may be composed of other record aggregates, or electronic records.
Record aggregates may either be homogeneous, i.e., they contain electronic records of the same record type, or heterogeneous, i.e., they contain electronic records of different record types.
Like electronic records, record aggregates have a degree of semantic coherence—they are organized according to principles of original order and provenance, which ensures that related electronic records are aggregated together. However, the semantic coherence that binds together a record aggregate is somewhat weaker than the semantic coherence that binds together a particular individual record. Put another way, an individual record within an aggregate has an independent identity because its semantic coherence is “strong enough” to be considered a record.
Computer software applications operate on data files, and data files represent the atomic unit of recorded information for computers. Where electronic records are conceptual in nature, data files are clearly physical. As used herein, the term “data file” means: 1) a collection of data that is stored together and treated as a unit by a computer software application; and 2) related data (e.g., numeric, textual, and/or graphic information) and fields that are organized in a strictly prescribed form and format. This definition includes two characteristics of data files, which are described in more detail below.
The first characteristic is that data files typically require interpretation by a computer software application, which the OAIS model calls “access software.” The OAIS definition for “access software” is a type of software that presents part of or all of the information content of an Information Object in forms understandable to humans or systems.
While it is conceivable that a person might look at all the individual bits of a data file to try to make sense of it, people generally use access software to present the information in some usable manner. The access software performs some kind of “presentation processing” to accomplish this. “Presentation processing” is defined as the software processing algorithms (including transformation, consolidation, tabulation, formatting, rendering, querying, filtering, interpretation, etc.) which access software employs to present the information contained in data files in a form understandable to humans.
Presentation processing covers a scale, from low (little to no processing required) to high (complex processing required), and the exact point on the scale for any particular set of information will involve subjective judgment. Presentation processing often involves presenting data files visually, but could also include presenting data files audibly or through any other human sensory perception.
Some data files are “eye readable” with minimal presentation processing. “Eye readable” is defined as data files whose information is inherently understandable to humans through visual inspection using access software that supports minimal presentation processing.
Only the simplest of data files are eye readable and most data files are completely unintelligible without a high degree of presentation processing. Using access software specifically suited to presenting a certain class of data files is necessary when the access software performs a high degree of software processing because without this access software, the information in the data files would be incomprehensible. Consider the following examples:
A fixed-length tabular dataset might be composed of one data file that structures tabular data into a regular row/column format that can easily be read and understood by a person. In this case, using access software might be optional.
A single web page might be composed of dozens of individual data files. For example, the web page might include multiple Hyper-Text Markup Language (HTML) data files, multiple Cascading Style Sheet (CSS) data files, client-side JavaScript script files, and multiple image files in various formats, such as Graphics Interchange Format (GIF) and Portable Network Graphics (PNG).
While a person could look through the individual bytes in each of these individual files, doing so would not provide an accurate sense of the data files' information content. This is because the access software, a web browser, actually performs a great deal of software processing to apply style sheets to transform and render content, more software processing to render images, and more software processing to render the behavior contained in the client-side scripts. This kind of software processing cannot easily be imagined or replicated by a person, so using access software is required.
Many data file formats are either undocumented, or are essentially incomprehensible to a person. For example, Microsoft Word's native binary (DOC) data file format is incompletely documented (due to the fact that it is proprietary) and is incomprehensible to a person who might look at the individual bytes within the data file. Using access software for these kinds of data files is required.
Historically, data files created in the earlier days of computing require low presentation processing, but as computers, software, data, and algorithms have continually increased in complexity over time, the amount of required presentation processing has also increased.
The second characteristic is that data files have a prescribed form and format. The above examples reference several data file formats, including Hyper-Text Markup Language (HTML) and Microsoft Word's native binary (DOC). This prescribed form and format is specified by a “data file type template.” As used herein, the term “data file type template” means a set of specifications about a data type that governs its format and behaviors.
The “specifications” in the above definition are essentially the instructions required by the access software to perform presentation processing.
Data files are often aggregated to facilitate management and presentation processing. In the web page example (Example 6), the web page is composed of many individual data files, which is known as a “data file set.” The term “data file set” means one or more data files that are logically related for purposes of presentation processing by access software.
Data file sets can either be “explicit,” or “implicit.” “Explicit” data file sets are defined by information contained in the data files, whereas “implicit” data file sets are defined through inscrutable software processing algorithms. Consider these examples:
Consider again the example of a web page. When an HTML data file refers to a CSS style sheet data file, it does so explicitly by data file name. This name can be resolved to find the CSS data file.
Consider an example of a set of database tables that include multiple data files for different kinds of information. One data file might contain simple data, another might contain binary data, and yet another data file might contain index information. The relationship between these data files is implicit, meaning it is not specified within the data files. Only the database application software defines these relationships as part of its presentation processing.
As discussed above, electronic records are conceptual and data files are physical. Electronic records are manifested in some way as electronic data files, but the manner in which the electronic records are manifested must first be determined.
First, the options to describe the relationship between electronic records and data files should be considered. An individual record may be composed of:
All of these options may apply, as explained in the following examples, which extend the example of the parish church (Example 3).
The parish church maintains each baptismal record as a separate word processing document data file, and its financial ledger as a separate spreadsheet data file. In this case, there is a one-to-one correspondence between a record and each data file.
The parish church maintains two separate spreadsheet data files for its financial ledger record, one spreadsheet for the balance statement and a second spreadsheet for the profit/loss statement. In this case, one record is composed of multiple data files.
The parish church has a sophisticated content management software application to manage all of its documents. The content management application stores all documents (including baptismal records, correspondence, financial ledgers, etc.) in one single database data file. In this case, one record is composed of a portion of one data file.
Again, the parish church has a sophisticated content management software application to manage all of its documents. The content management application stores all documents in one single database data file and all metadata about the documents in a separate database data file. In this case, one record is composed of portions of multiple data files.
In Examples 10-13, the intellectual form, content, and number of electronic records remains fixed, while the relationship of those electronic records to data files varies, depending on the particulars of how the parish church manages and uses its data files at a specific point in time.
The reason that the relationship varies between a record and data files is that a record has strong semantic coherence, while data files may not have strong semantic coherence. A particular data file might contain many different kinds of information, or even bits and pieces of information, which sometimes cannot be eye readable without significant presentation processing and access software. In other words, semantic coherence is not a requirement for data files per se—the semantic coherence is realized by the presentation processing and access software and the human understanding gained through using that software.
The relationship between electronic records and data files, then, is potentially many-to-many at a portion level—a record might be composed of one or more portions of data files, and data files might contain one or more portions of electronic records.
Based on Examples 10-13, it should be appreciated that the gap between electronic records (conceptual view) and data files (physical view) must be bridged. As the InterPARES I Preservation Task Force concluded, “Digital data inscribed on a physical medium do not have the form of a record. It is necessary to transform the inscribed bits into the form of the record.” (“Preserving Electronic Records,” Presentation on the work of the InterPARES I Preservation Task Force, Jun. 19, 2002)
The present invention provides a solution to the gap between electronic records an data filed by adding a logical view which transforms between the conceptual and physical views. To perform this task, the present invention provides a “digital component extractor.” As used herein, the term “digital component extractor” is defined as a software component that extracts digital components from a data file set, guided by a set of instructions. A “digital component” is defined herein as a set of digital information that exhibits strong semantic coherence and is expressed as a bit stream.
The purpose of the digital component extractor is to extract digital components from data files in a data file set that together comprise a record.
One implication of this model is that electronic records are composed of digital components (which exhibit strong semantic coherence) and not data files (which can exhibit any range of semantic coherence, including none whatsoever). Another implication is that digital component extractors are instructed as to how to extract digital components from data file sets.
Digital component extractors establish the map between data files and electronic records, and because this map is many-to-many, the exact method by which digital component extractors extract digital components varies. Consider the following examples:
If there is a one-to-one correspondence between a record and a data file, the digital component extractor simply needs to return the specified data file as the digital component. For example, a digital component extractor for a record that corresponds to a single word processing document data file would simply return that data file as the digital component.
If a record is composed of portions from one data file, the digital component extractor includes an algorithm to extract portions of the specified data file. For example, a digital component extractor for a record that corresponds to an e-mail archive data file would extract individual e-mails as digital components.
If a record is composed of portions from more than one data file, the digital component extractor includes an algorithm to extract portions of the specified data files. For example, a digital component extractor for a record that corresponds to a document spread across multiple database tables (and data files) in a content management software application would perform appropriate queries on those database tables to extract the digital component.
Put another way, digital component extractors contain the instructions necessary to extract digital components from data file sets.
Table 2 documents the approaches for specifying digital component extractors, and their advantages and disadvantages.
It would be efficient for transferring entities to establish intellectual control over the semantic coherence of their electronic records as they develop their information systems, but this will not always happen. It would also be efficient if transferring entities, with assistance from the archivist, at least defined their electronic records before the point of transfer, but again this will not always happen, because this is a burden on records officers. The system of the present invention imputes digital component extractors from templates as discussed below, and this generally will be acceptable. In the cases where none of these approaches work, the ERA must allow archivists to establish intellectual control over the electronic records at an item level through defining the digital component extractors.
Generally, ERA imputing the digital component extractors from the relevant templates will work quite well. Consider this example:
The record type template indicates a particular set of records is correspondence, and the data file template indicates the data file is in Microsoft Outlook (PST) format. A reasonable set of digital component extractors can be imputed that extract individual e-mails into separate digital components. Each digital component represents an individual e-mail, which exhibits strong semantic coherence.
In some rare cases, there may be no workable digital component extractors, because they are not defined by either the transferring entity or archivist, and the ERA system cannot impute reasonable alternatives. Consider this example:
The record type template indicates a particular set of records is geospatial information, and the data file template is in an unknown proprietary format that is not human readable and not documented. ERA cannot impute a reasonable set of digital component extractors because it is not aware of the data type format.
In the case where there are no workable digital component extractors, the ERA of the present invention will create a default set of digital component extractors, known as “placeholder digital component extractors,” which are defined as a set of digital component extractors that assume each data file is a single digital component
The levels of available preservation, access, and authenticity services that the ERA of the present can provide may be constrained for electronic records with placeholder digital component extractors, so these should be the exception rather than the norm. In other words, placeholder digital component extractors are only consistent with the most basic level of service in ERA.
All of the entities modeled by the present invention, such as electronic records, record aggregates, digital components, data files, etc., must be identifiable and resolvable. An approach to identifiers is more fully documented in co-pending, commonly assigned U.S. Application (Attorney Docket 4870-9), filed Apr. 26, 2007, entitled SYSTEM AND METHOD FOR AN IMMUTABLE IDENTIFICATION SCHEME IN A LARGE SCALE COMPUTER SYSTEM.
All identifiers within THE ERA must exhibit the following characteristics:
An approach to generating identifiers according to the present invention involves using a cryptographic hash algorithm (such as SHA-256) based on the initial content of the thing being identified. This approach meets the required constraints.
It should be noted that some entities have an identity which is independent of its content. For example, the identity of a record is independent of the content digital components and/or data files that make up any particular version of that record. New versions of electronic records can arise from redaction and preservation activities, and each record version will have its own independent identifier that is related back to the record.
In these cases, the identifier will be generated from the content of the entity when it is first created within ERA and immutable thereafter. Thus, the identifier for electronic records would be generated and assigned when the record is created within ERA based on the content of the first version's digital components, and that identifier would be immutable thereafter.
An approach to preservation and authenticity issues are more fully documented in co-pending, commonly assigned U.S. application (Attorney Docket 4870-25), entitled SYSTEM AND METHOD FOR PRESERVATION OF DIGITAL RECORDS.
The notion of digital components and digital component extractors has some interesting implications for preservation. The InterPARES I Preservation Task Force states “It is impossible to preserve an electronic record. It is only possible to preserve the ability to reproduce an electronic record.” (“Preserving Electronic Records”, Presentation on the work of the InterPARES I Preservation Task Force, Jun. 19, 2002.) A record's digital components, along with access software, allow reproduction of the electronic record. As such, the preservation strategy of the present invention ensures the digital component extractors produce digital components that authentically represent the record. This means that digital component extractors must honor the essential characteristics associated with the record (and which are specified in the record type template).
The process of redaction involves deleting specific content from a record to produce a new version of the record, and the new version of the record typically has reduced access restrictions.
In the electronic record context, digital content is contained in both data files and digital components, so in theory redaction (deleting digital content) could occur in either place. In practice, most redaction tools redact content from data files, so the present invention will support this approach. This means that redaction will occur against data files, which will produce a new version of the data files, and the digital component extractors will produce new digital components from these redacted data files. This process will result in a new version of the record, that is composed of redacted digital components that have been extracted from redacted data files.
Like records, original order and arrangement are conceptual and not physical. Thus, order and arrangement both apply to records, but not data files. The order of data files is essentially arbitrary and meaningless from an archival context, since data files exhibit low semantic cohesion.
It is possible that electronic records might have no meaningful original order, in the same way paper records might have no meaningful original order. In these cases, the present invention will follow the advice of Frank Boles in “Disrespecting Original Order” to maintain records in a state of simple usability. (Boles, F., “Disrespecting Original Order”, The American Archivist, Vol. 45 No. 1, pp. 26-32, 1982.) Simple usability for electronic records implies dynamic sorting, filtering, and querying capabilities.
It is possible that the digital component extractors of the present invention will be executed to produce a physical representation of a digital component. In this case, a digital component would be a bit stream serialized as a managed file within the system. It is also possible that the digital component extractors will be executed on-demand to produce a transient digital component, as needed. In this case, a digital component would be a transient in-memory bit stream. The present invention allow for both options, and the decisions on which to use will be a matter of policy and design.
Templates play a large part in NARA's vision of the ERA both as a means to manage electronic records, in respect to scheduling, and as a means to preserve records, in respect to defining preservation formats and processing.
Because there are many potential applications of templates, and because templates are sometimes described by examples of documents that conform to the templates rather than the template itself, there is a need to define what templates are and how they are used.
As discussed in more detail below, the present invention utilizes a taxonomy of templates and the relationships between templates and instances of templates to identify and manage records. The present invention also utilizes the relationship between hierarchical templates and hierarchical information using a matrix. Furthermore, the present invention provides for managing templates.
It is helpful to begin with an example of templates and instances of templates, and to provide an illustrative listing of some kinds of templates that might be used within the ERA system of the present invention.
According to the present invention, the use of template may be associated with all of the following:
It can therefore be seen that templates are being used according to the present invention to:
Some of these uses of templates have been described with reference to instantiations of the templates and some have been described with reference to the templates themselves. It is necessary to distinguish between templates and instances of templates.
Using XML technologies as an example, an example of templates, and instances of documents that conform to or are generated by those templates that might be used in the preservation and presentation of a document displayed on a web page is provided.
The first template is an XML schema that defines the structure of the record catalog which lists the digital objects that are part of the web page and their hierarchical relationships. An instance of that template is a selection from the record catalog for the page in question.
Referring to
Referring to
Referring to
Other types of templates, may orchestrate a sequence of pages. The instantiation of that template is the web page—which is the record that is being preserved.
Additional templates may be involved in defining the behavior of a web application, including templates that define the work flow within the application, templates that define the orchestration of pages within the application and templates that describe the animation of items on a page.
Table 3 provides an overview of some of the types of templates that may occur in the ERA of the present invention. Although each example has been mapped to an appropriate XML syntax that might be used to create the template, it should be appreciated that the present invention is not limited to the use of any particular format. It should also be appreciated that the list of templates Table 3 is not intended to be exhaustive. There are many possible applications for templates and there are other XML technologies, and non-XML technologies, which may be used.
Templates may be used to define the relationships between records in the archives, such as defining the original order of records, the structure of the record catalog, and the structure of transfers to the archives or the delivery of copies to users (Submission Information Packages and Dissemination Information Packages).
Capturing the original order of a record represents a case where a template can be used within a template. The structure of the Record Catalog can be described in a template that defines the information elements that make up an entry in the catalog. The content of some of those information elements may be other templates, or they may be become values in the instantiation of an object that conforms to another template.
Templates may be used to define the content and structure of records schedules and other Life Cycle Documents.
Templates may be used to define the structure of record description, and the elements of information that compose the metadata of records.
A template for Archival Metadata, which includes description and Life cycle data, will define which elements of information that must be present, what type of information they should contain, and how they are related to each other.
Templates may be used as inputs to processes that transform digital objects in the archive, including templates that may be used to define the presentation of assets to users.
The System component templates cover the widest variety of use of templates. This includes defining persistent object formats, defining the information needed by a processor to render those formats in a current format, defining the choreography and behaviors of objects in aggregate multimedia records, etc.
The System Components will be constantly evolving, adding new templates as new digital technologies evolve. Each type of system component will have its own family of templates.
Templates may be used to define the structure of component description. The ERA system will archive itself and be self-describing. Templates will define elements of information needed for components to be self describing.
Templates may also be used to define the nature and rights of entities and the access restrictions on assets in the archive.
A records-centric access model will define restrictions and rights in relation to records using the internal structure of the records themselves. Templates will define the instructions on records and create the framework for aligning identity—role—authorization to protect the records.
Templates may further be used to describe system services and orchestrate services within work flow processes.
The Service Architecture describes the arrangement and delivery of services in the ERA system of the present invention, including the work flow processes and the functionality at each step in the process. Templates, expressed for example in Business Process Execution Language (BPEL), may be used to describe the orchestration of functional services, and at a lower level, describe the inputs and outputs to each individual functional services, using for example Web Services Description Language (WSDL).
A hierarchical scheme according to the present invention may be implemented for managing templates. The introduction of hierarchy to the management of templates adds another level of abstraction. A template abstracts from a specific instance to the general case. Such a template is associated to a single type of object. With hierarchy, another layer of abstraction may be added that can be applied to any of: 1) the template, 2) the content which it controls, or 3) both.
As an object subject to a hierarchical arrangement the template becomes a mirror of the organization of objects into increasing larger aggregate structures which is a method of organization common to the ERA system of the present invention as a whole.
Templates can have a hierarchical connotation either because: (a) the template itself can only be instantiated with reference to a hierarchy of templates which collectively define its content, or (b) the object the template describes can only be instantiated with reference to a hierarchy of digital items or conceptual arrangements of digital items.
In the first case (a), instantiating the template requires retrieving elements from within different templates within a hierarchy. For example, Life Cycle Data document templates (Transfer Agreements, Disposition Agreements, etc) will have their own specific information elements but will also likely share a set of information elements common to all Life Cycle Data documents.
The template hierarchy might look like:
ERA.xsd (elements common to the ERA, such as identifiers)
In XML Schema, this may be implemented by having each template in each child level of the template hierarchy begin with an <include/> instruction that incorporates in the child template all the data elements described in its parent, which in turn will <include/> all the data elements in its parent, etc.
In the second case (b), to instantiate a document that conforms to a template requires retrieving elements of information from hierarchically organized assets within the archive.
For example the template for archival metadata may include elements of information some of which are associated to a record catalog item that represents the conceptual concept of the entire record (the parent or root element of the record) while other elements of information are associated to individual digital items that are components of the record.
To create a document that represents the archival metadata for a specific digital item, and which conforms to the archival metadata template, requires retrieving all the information elements from each level in the record's internal hierarchy from that digital item up to the record's “root”.
For example, suppose that the family of a noted physicist donates her personal papers to NARA. The record hierarchy that might look like:
Metadata that describes the <Origin> of the record will likely be associated with the highest level in the record hierarchy, the “//Curie Collection” level, as the description of <Origin> applies to all the documents in that collection.
Metadata that describes the <Digital Object Type> of a specific document will be associated with a specific document, such as “//Curie Collection/Professional Papers/Research Activities/Reagents”.
To create an instance of the metadata for the “//Reagents” document requires the accretion of the metadata for itself and all its ancestors as we traverse the record hierarchy up to the collection level.
The possible intersections of templates and hierarchies can be presented in a matrix as shown in Table 4. Along one axis are the templates; either derived from a hierarchy or self-contained. Along the other axis are the conforming content, again either derived from a hierarchy or self-contained.
The matrix below illustrates where some types of templates may fall in the matrix.
In a self-describing system, each template is both a functional component of the system and a record in the system. As a record in the system, the template is treated the same as any other record, with its own metadata, life cycle management, and preservation. The ERA system of the present invention may be regarded, therefore, as an aggregate record, with its own hierarchy of documents, so that part of our ERA record hierarchy might look like
Each instance of a system component, including templates, has its own archival metadata (metadata that describes a record). This latter metadata makes the component self describing.
For example, a WSDL file is an instance of the template for defining a service and a BPEL file is an instance of the template that defines a work flow.
The archival metadata of the WSDL file will include information such as;
This sort of information could be included in the WSDL file as comments (or <Documentation/> elements) but would not be very manageable as a result. The system would not be able to apply its record management functionality to its own templates, which is based on archival metadata held exterior to the digital object the metadata describes,
To make description of the system components manageable, they should be described using the same archival metadata templates as for any record.
While there will be a defined template for a service in the ERA (such as the XML Schema for WSDL), the present invention may use another template, the Archival Metadata schema, as the template to describe the service as a component of the system.
As templates evolve, the life cycle data elements in their description capture that evolution, such as the version. When a change to a template changes the behavior of the system, the earlier version of the template is preserved as a record so that the previous behavior of the system can be understood.
Templates will evolve as ERA evolves. As such templates, as records in ERA, will be versioned and managed. Life cycle data elements or records will include the version of the templates they use. Versioning will allow new templates to be introduced without creating problems with validation. Whether life cycle content that is subject to validation against templates should be updated as templates evolve will be a policy decision applied to each template.
Each process to update a template may be a standard work flow in the ERA, and described in its own template, which will include appropriate approval and authorization steps as determined in policy.
Templates, as records, will have their own fixity information to ensure their integrity and the life cycle data of objects modified by templates will record which version of which template was used.
The concept of managing templates can be extended to apply to every component of the system. Each software component of the ERA system should be described and held in the ERA. This applies to platform applications, web application components, any client side components, as well as all the functionality wrapped in web services which can be managed within the concept of managing templates as described above.
The concept of preserving original arrangement to the system can also be extended so as to describe in Archival Metadata how all the components are structurally linked—creating in essence a schema for the ERA itself.
While the invention has been described in connection with what are presently considered to be the most practical and preferred embodiments, it is to be understood that the invention is not to be limited to the disclosed embodiments, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the invention. Also, the various embodiments described above may be implemented in conjunction with other embodiments, e.g., aspects of one embodiment may be combined with aspects of another embodiment to realize yet other embodiments.
This application claims the benefit of U.S. Applications 60/802,875, filed May 24, 2006, and 60/797,754, filed May 5, 2006, each of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
60797754 | May 2006 | US | |
60802875 | May 2006 | US |