System and method for managing records through establishing semantic coherence of related digital components including the identification of the digital components using templates

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a reference model of an overall archives system;

FIG. 2 is a chart demonstrating challenges and solutions related to certain illustrative aspects of the present invention;

FIG. 3 illustrates the notional life cycle of records as they move through the ERA system, in accordance with an example embodiment;

FIG. 4 illustrates the ERA System Functional Architecture from a notional perspective, delineating the system-level packages and external system entities, in accordance with an example embodiment;

FIG. 5 illustrates a digital component extractor model according to the present invention;

FIG. 6 illustrates an XML Schema as a template for content and structure of a record;

FIG. 7 illustrates an instance of the template of FIG. 6; and

FIG. 8 illustrates an XSL template fore defining the presentation of the instance of FIG. 7.

DETAILED DESCRIPTION

The following description includes several examples and/or embodiments of computer-driven systems and/or methods for carrying out automated information storage, processing and/or access. In particular, the examples and embodiments are focused on systems and/or methods oriented specifically for use with the U.S. National Archives and Records Administration (NARA). However, it will be recognized that, while one or more portions of the present specification may be limited in application to NARA's specific requirements, most if not all of the described systems and/or methods have broader application. For example, the implementations described for storage, processing, and/or access to information (also sometimes referred to as ingest, storage, and dissemination) can also apply to any institution that requires and/or desires automated archiving and/or preservation of its information, e.g., documents, email, corporate IP/knowledge, etc. The term “institution” includes at least government agencies or entities, private companies, publicly traded corporations, universities and colleges, charitable or non-profit organizations, etc. Moreover, the term “electronic records archive” (ERA) is intended to encompass a storage, processing, and/or access archives for any institution, regardless of nature or size.

As one example, NARA's continuing fulfillment of its mission in the area of electronic records presents new challenges and opportunities, and the embodiments described herein that relate to the ERA and/or asset catalog may help NARA fulfill its broadly defined mission. The underlying risk associated with failing to meet these challenges or realizing these opportunities is the loss of evidence that is essential to sustaining a government's or an institution's needs. FIG. 2 relates specific electronic records challenges to the components of the OAIS Reference Model (ingest, archival storage, access, and data management/administration), and summarizes selected relevant research areas.

At Ingest—the ERA needs to identify and capture all components of the record that are necessary for effective storage and dissemination (e.g., content, context, structure, and presentation). This can be especially challenging for records with dynamic content (e.g., websites or databases).

Archival Storage—Recognizing that in the electronic realm the logical record is independent of its media, the four illustrative attributes of the record (e.g., content, context, structure, and presentation) and their associated metadata, still must be preserved “for the life of the Republic.”

Access—NARA will not fulfill its mission simply by storing electronic records of archival value. Through the ERA, these records will be used by researchers long after the associated application software, operating system, and hardware all have become obsolete. The ERA also may apply and enforce access restrictions to sensitive information while at the same time ensuring that the public interest is served by consistently removing access restrictions that are no longer required by statute or regulation.

Data Management—The amount of data that needs to be managed in the ERA can be monumental, especially in the context of government agencies like NARA. Presented herewith are embodiments that are truly scalable solutions that can address a range of needs—from a small focused Instance through large Instances. In such embodiments, the system can be scaled easily so that capacity in both storage and processing power is added when required, and not so soon that large excess capacities exist. This will allow for the system to be scaled to meet demand and provide for maximum flexibility in cost and performance to the institution (e.g., NARA).

Satisfactorily maintaining authenticity through technology-based transformation and re-representation of records is extremely challenging over time. While there has been significant research about migration of electronic records and the use of persistent formats, there has been no previous attempt to create an ERA solution on the scale required by some institutions such as NARA.

Migrations are potentially loss-full transformations, so techniques are needed to detect and measure any actual loss. The system may reduce the likelihood of such loss by applying statistical sampling, based on human judgment for example, backed up with appropriate software tools, and/or institutionalized in a semi-automatic monitoring process.

Table 1 summarizes the “lessons learned” by the Applicants from experience with migrating different types of records to a Persistent Object Format (POF).

TABLE 1

Type of record
Current Migration Possibilities

E-mail
The Dutch Testbed project has shown that e-mail can be

successfully migrated to a POF. An XML-based POF was

designed by Tessella as part of this work. Because e-mail

messages can contain attached files in any format, an e-mail record

should be preserved as a series of linked objects: the core message,

including header information and message text, and related objects

representing attachments. These record relationships are stored in

the Record Catalog. Thus, an appropriate preservation strategy can

be chosen and applied to each file, according to its type.

Word processing
Simple documents can be migrated to a POF, although document

documents
appearance can be complex and may include record characteristics.

Some documents can also include other embedded documents

which, like e-mail attachments, can be in any format. Documents

can also contain macros that affect “behavior” and are very

difficult to deal with generically. Thus, complex documents

currently require an enhanced preservation strategy.

Adobe's Portable Document Format (PDF) often has been treated

as a suitable POF for Word documents, as it preserves presentation

information and content. The PDF specification is controlled by

Adobe, but it is published, and PDF readers are widely available,

both from Adobe and from third-parties. ISO are currently

developing, with assistance from NARA, a standard version of

PDF specifically designed for archival purposes (PDF/A). This

format has the benefit that it forces some ambiguities in the original

to be removed. However, both Adobe and Microsoft are evolving

towards using native XML for their document formats.

Images
TIFF is a widely accepted open standard format for raster images

and is a good candidate in the short to medium term for a POF. For

vector images, the XML-based Scalable Vector Graphics format is

an attractive option, particularly as it is a W3C open standard.

Databases
The contents of a database should be converted to a POF rather

than being maintained in the vendor's proprietary format.

Migration of the contents of relational database tables to an XML

or flat file format is relatively straightforward. However, in some

cases, it is also desirable to represent and/or preserve the structure

of the database. In the Dutch Digital Preservation Testbed project,

this was achieved using a separate XML document to define the

data types of columns, constraints (e.g., whether the data values in

a column must be unique), and foreign key relationships, which

define the inter-relationships between tables. The Swiss Federal

Archives took a similar approach with their SIARD tool, but used

SQL statements to define the database structure.

Major database software vendors have taken different approaches

to implementing the SQL “standard” and add extra non-standard

features of their own. This complicates the conversion to a POF.

Another difficulty is the Binary Large Object (BLOB) datatype,

which presents similar problems to those of e-mail attachments:

any type of data can be stored in a BLOB and in many document-

oriented databases, the majority of the important or relevant data

may be in this form. In this case, separate preservation strategies

may be applied according to the type of data held.

A further challenge with database preservation is that of preserving

not only the data, but the way that the users created and viewed the

data. In some cases this may be depend on stored queries and

stored procedures forming the database; in others it may depend on

external applications interacting with the database. To preserve

such “executable” aspects of the database “as a system” is an area

of ongoing research.

Records with a
For this type of record, it is difficult to separate the content from

high degree of
the application in which it was designed to operate. This makes

“behavioral”
these records time-consuming to migrate to any format. Emulation

properties (e.g.,
is one approach, but this approach is yet to be fully tested in an

virtual reality
archival environment. Migration to a POF is another approach, and

models)
more research is required into developing templates to support this.

Spreadsheets
The Dutch Testbed project examined the preservation of

spreadsheets and concluded that an XML-based POF was the best

solution, though did not design the POF in detail. The structured

nature of spreadsheet data means that it can be mapped reliably and

effectively to an XML format. This approach can account for cell

contents, the majority of appearance related issues (cell formatting,

etc), and formulae used to calculate the contents of some cells.

The Testbed project did not address how to deal with macros: most

spreadsheet software products include a scripting or programming

language to allow very complex macros to be developed (e.g.,

Visual Basic for Applications as part of Microsoft Excel). This

allows a spreadsheet file to contain a complex software application

in addition to the data it holds. This is an area where further

research is necessary, though it probably applies to only a small

proportion of archival material.

Web sites
Most Web sites include documents in standardized formats (e.g.,

HTML). However, it should be noted that there are a number of

types of HTML documents, and many Web pages will include

incorrectly formed HTML that nonetheless will be correctly

displayed by current browsers. The structural relationship between

the different files in a web-site should be maintained. The fact that

most web-sites include external as well as internal links should be

managed in designing a POF for web-sites. The boundary of the

domain to be archived should be defined and an approach decided

on for how to deal with links to files outside of that domain.

Many modern web sites are actually applications where the

navigation and formatting are generated dynamically from

executed pages (e.g., Active Server Pages or Java Server Pages).

The actual content, including the user's preferences on what

content is to be presented, is managed in a database. In this case,

there are no simple web pages to archive, as different users may be

presented with different material at different times. This situation

overlaps with our discussion above of databases and the

applications which interact with them.

Sound and video
For audio streams, the WAV and AVI formats are the de facto

standards and therefore a likely basis for POFs. For video, there

are a number of MPEG formats in general use, with varying

degrees of compression. While it is desirable that only lossless

compression techniques are used for archiving, if a lossy

compression was used in the original format it cannot be recaptured

in a POF.

For video archives in particular, there is the potential for extremely

large quantities of material. High quality uncompressed video

streams can consume up to 100 GB per hour of video, so storage

space is an issue for this record type.

It is currently not possible to migrate a number of file formats in a way that will be acceptable for archival purposes. One aspect is to encourage the evolution and enhancement of third-party migration software products by providing a framework into which such commercial off-the-shelf (COTS) software products could become part of the ERA if they meet appropriate tests.

When an appropriate POF cannot be identified to reduce the chances of obsolescence, the format may need to be migrated to a non-permanent but more modern, proprietary format (this is known as Enhanced Preservation). Even POFs are not static, since they still need executable software to interpret them, and future POFs may need to be created that have less feature loss than an older format. Thus, the ERA may allow migrated files to be migrated again into a new and more robust format in the future. Through the Dutch Testbed Project, the Applicants have found that it is normally better to return to the original file(s) whenever such a re-migration occurs. Thus, when updating a record, certain example embodiments may revert to an original version of the document and migrate it to a POF accordingly, whereas certain other example embodiments may not be able to migrate the original document (e.g., because it is unavailable, in an unsupported format, etc.) and thus may be able to instead or in addition migrate the already-migrated file. Thus, in certain example embodiments, a new version of a record may be derived from an original version of the record if it is available or, if it the original is not available, the new version may be derived from any other already existing derivative version (e.g., of the original). As such, an extensible POF for certain example embodiments may be provided.

In view of the above aspects of the OAIS Reference Model, the ERA may comprise an ingest module to accept a file and/or a record, a storage module to associate the file or record with information and/or instructions for disposition, and an access or dissemination module to allow selected access to the file or record. The ingest module may include structure and/or a program to create a template to capture content, context, structure, and/or presentation of the record or file. The storage module may include structure and/or a program to preserve authenticity of the file or record over time, and/or to preserve the physical access to the record or file over time. The access module may include structure or a program to provide a user with ability to view/render the record or file over time, to control access to restricted records, to redact restricted or classified records, and/or to provide access to an increasing number of users anywhere at any time.

FIG. 3 illustrates the notional life cycle of records as they move through the ERA system, in accordance with an example embodiment. Records flow from producers, who are persons or client systems that provide the information to be preserved, and end up with consumers, who are persons or client systems that interact with the ERA to find preserved information of interest and to access that information in detail. The Producer also may be a “Transferring Entity.”

During the “Identify” stage, producers and archivists develop a Disposition Agreement to cover records. This Disposition Agreement contains disposition instructions, and also a related Preservation and Service Plan. Producers submit records to the ERA System in a SIP. The transfer occurs under a pre-defined Disposition Agreement and Transfer Agreement. The ERA System validates the transferred SIP by scanning for viruses, ensuring the security access restrictions are appropriate, and checking the records against templates. The ERA System informs the Producer of any potential problems, and extracts metadata (including descriptive data, described in greater detail below), creates an Archival Information Package (or AIP, also described in greater detail below), and places the AIP into Archival Storage. At any time after the AIP has been placed into Archival Storage, archivists may perform Archival Processing, which includes developing arrangement, description, finding aids, and other metadata. These tasks will be assigned to archivists based on relevant policies, business rules, and management discretion. Archival processing supplements the Preservation Description Information metadata in the archives.

At any time after the AIP has been placed into Archival Storage, archivists may perform Preservation Processing, which includes transforming the records to authentically preserve them. Policies, business rules, Preservation and Service Plans, and management discretion will drive these tasks. Preservation processing supplements the Preservation Description Information metadata in the archives, and produces new (transformed) record versions.

With respect to the “Make Available” phase, at any time after the AIP has been placed into Archival Storage, archivists may perform Access Review and Redaction, which includes performing mediated searches, verifying the classification of records, and coordinating redaction of records where necessary. These tasks will be driven by policies, business rules, and access requests. Access Review and Redaction supplement the Preservation Description Information metadata in the archives, and produces new (redacted) record versions. Also, at any time after the AIP has been placed into Archival Storage, Consumers may search the archives to find records of interest.

FIG. 4 illustrates the ERA System Functional Architecture from a notional perspective, delineating the system-level packages and external system entities, in accordance with an example embodiment. The rectangular boxes within the ERA System boundary represent the six system-level packages. The ingest system-level package includes the means and mechanisms to receive the electronic records from the transferring entities and prepares those electronic records for storage within the ERA System, while the records management system-level package includes the services necessary to manage the archival properties and attributes of the electronic records and other assets within the ERA System as well as providing the ability to create and manage new versions of those assets. Records Management includes the management functionality for disposition agreements, disposition instructions, appraisal, transfer agreements, templates, authority sources, records life cycle data, descriptions, and arrangements. In addition, access review, redaction, selected archival management tasks for non-electronic records, such as the scheduling and appraisal functions are also included within the Records Management service.

The Preservation system-level package includes the services necessary to manage the preservation of the electronic records to ensure their continued existence, accessibility, and authenticity over time. The Preservation system-level service also provides the management functionality for preservation assessments, Preservation and Service Level plans, authenticity assessment and digital adaptation of electronic records. The Archival Storage system-level package includes the functionality to abstract the details of mass storage from the rest of the system. This abstraction allows this service to be appropriately scaled as well as allow new technology to be introduced independent of the other system-level services according to business requirements. The Dissemination system-level package includes the functionality to manage search and access requests for assets within the ERA System. Users have the capability to generate search criteria, execute searches, view search results, and select assets for output or presentation. The architecture provides a framework to enable the use of multiple search engines offering a rich choice of searching capabilities across assets and their contents.

The Local Services and Control (LS&C) system-level package includes the functional infrastructure for the ERA Instance including a user interface portal, user workflow, security services, external interfaces to the archiving entity and other entities' systems, as well as the interfaces between ERA Instances. All external interfaces are depicted as flowing through LS&C, although the present invention is not so limited.

The ERA System contains a centralized monitoring and management capability called ERA Management. The ERA Management hardware and/or software may be located at an ERA site. The Systems Operations Center (SOC) provides the system and security administrators with access to the ERA management Virtual Local Area Network. Each SOC manages one or more Federations of Instances based on the classification of the information contained in the Federation.

Also shown are the three primary data stores for each Instance:

- 1. Ingest Working Storage—Contains transfers that remain until they are verified and placed into the Electronic Archives;
- 2. Electronic Archives—Contains all assets (e.g., disposition agreements, records, templates, descriptions, authority sources, arrangements, etc.); and
- 3. Instance Data Storage—Contains a performance cache of all business assets, operational data and the ERA asset catalog.

This diagram provides a representative illustration of how a federated ERA system can be put together, though it will be appreciated that the same is given by way of example and without limitation. Also, the diagram describes a collection of Instances at the same security classification level and compartment that can communicate electronically via a WAN with one another, although the present invention is not so limited. For example, FIG. 5 is a federation of ERA instances, in accordance with an example embodiment. The federation approach is described in greater detail below, although it is important to note here that the ERA and/or the asset catalog may be structured to work with and/or enable a federated approach.

The ERA's components may be structured to receive, manage, and process a large amount of assets and collections of assets. Because of the large amount of assets and collections of assets, it would be advantageous to provide an approach that scales to accommodate the same. Beyond the storage of the assets themselves, a way of understanding, accessing, and managing the assets may be provided to add meaning and functionality to the broader ERA. To serve these and/or other ends, an asset catalog including related, enabling features may be provided.

In particular, to address the overall problems of scaling and longevity, the asset catalog and storage system federator may address the following underlying problems, alone or in various combinations:

- Capturing business objects that relate to assets that are particular to the application storing the assets (e.g., in an archiving system, such business objects may include, for example, disposition and destruction information, receipt information, legal transfer information, appraisals and archive description, etc.), with each new business use of the design potentially defining unique business objects that are needed to control its assets and execute its business processes;
- Maintaining arbitrary asset attributes to be flexible in accommodating unknown future attributes;
- Employing asset and other identifiers that are immutable so that they remain useful indefinitely and, therefore, enable them to be referenced both within the archives and by external entities with a reduced concern for changes over time;
- Supporting search and navigation through the extreme scale and diversity of assets archived;
- Handling obsolescence of assets that develops over time;
- Accommodating redacted and other derivative versions of assets appropriate for an archive system;
- Federating (e.g., integrate independent parts to create a larger whole) multiple, potentially heterogeneous, distributed, and independent archives systems (e.g., instances) to provide a larger scale archive system;
- Supporting a distributed implementation necessary for scaling, site independence, and disaster recovery considerations where the distribution of assets and associated catalogs may change over time but remain visible to all sites;
- Employing a search architecture and catalog format that allows exploitation of multiple, possibly commercial search engines for differing asset data types and across instances of archives in a federation, as future needs may dictate;
- Accommodating multiple, heterogeneous, commercial storage subsystems among and within the instances in a federation of archives to achieve extreme scaling and adapt to changes over time;
- Supporting a variety of data handling requirements based on, for example, security level, handling restrictions and ownership, in a manner that performs well and remains manageable for an extremely large number of assets and catalog entries;
- Supporting storage of any kind of electronic asset;
- Supporting transparent data location and migration and storage subsystem upgrades/changes; and/or
- Supporting reconstruction of the catalog and archives with little or no information other than the original catalog and archived bit streams (e.g., for the purposes of disaster recovery).

Electronic records are manifested, in some way, as electronic data files. There are several requirements for managing the relationship between electronic records and data files. These requirements include, but are not limited to: 1) ensuring that all data files stored in the system are associated with the records they constitute; 2) specifying the relationship of each ingested data file with an electronic record; 3) specifying the relationship of each transformed data file to an electronic record; and 4) verifying the data files associated with electronic records contained in a transfer.

The relationship between electronic records and data files appears simple at first glance, but is in reality somewhat complex, particularly when considering the relationship between an individual electronic record and data files, as is required by requirements 2) and 3) above. Although it is tempting to think of electronic records as being directly composed of data files, this is incorrect, as explained in more detail below.

The present solves this complexity through an intermediate layer called a digital component extractor, which establishes a bridge between electronic records and data files. This bridge allows archivists and transferring entities to model the true semantic relationship between individual electronic records and data files.

The concept of a record originates in the archival and records management domains, where a record represents a “unit of recorded information”. As used herein, the term “record” means a unit of recorded information created, received, and maintained as evidence or information by an organization or person, in pursuance of legal obligations or the transaction of business.

This definition has a conceptual basis, in the sense that records are recognized and understood by humans to represent information. It is necessary when discussing electronic records to distinguish the archival and records management term “record” with the computer science concept of the same name. The computer science concept of “record” formally represents a matrix-tuple in linear algebra which is analogous to a row in a database table. The present invention uses the unqualified term “record” to indicate the archival and records management concept, and uses the qualifier “tuple record” to indicate the computer science concept. As used herein, the term “tuple record” means a matrix-tuple (defined by linear algebra), which is a finite function that maps field names to a certain value.

Archivists and records managers typically manage numerous records. The requirements discussed above require the system to manage not only records (in the plural), but also individual records (in the singular). The requirement to manage both individual and plural records presents several questions, including, but not limited to: 1) what defines the exact extent of an individual record? and 2) where precisely does an individual record start and where precisely does it end?

The answers to these questions must be precisely specified in the context of electronic records, where individual electronic records are managed independently.

Given the conceptual nature of records, a conceptual approach to defining the exact extent of a particular individual record is needed. A record can be said to exhibit a characteristic known as strong “semantic coherence,” which is implied by the “unit of recorded information” phrase in the definition of a record. As used herein, the term “semantic coherence” is defined as a conceptual meaning that is closely related through connections and consistency, and holds together firmly as parts of the same mass.

Semantic coherence covers a scale, from weak (no coherence) to strong (high coherence), and the exact point on the scale for any particular set of information will involve subjective (archival) judgment. A record represents conceptual meaning that “sticks together” strongly enough on the semantic coherence scale to be considered an individual record.

Consider the following examples of semantic coherence:

EXAMPLE 1

Consider a record of a particular veteran's military service. Information about that individual's service dates, ranks, and defined benefits is strongly logically connected. Is the same information for a different individual the same record? No, because the logical connection for information about one particular individual is very strong whereas the logical connection for information across individuals is weaker.

EXAMPLE 2

Consider again a record of a veteran's military service. Now consider information about a battle plan for a particular military engagement in which the individual participated. Is the battle plan part of the individual's military service record? No, while the battle plan is in itself a record (and is loosely connected to the individual's service record), its meaning is inconsistent with the service record, and is therefore a separate record.

Put another way, strong semantic coherence is the characteristic that allows a distinction between one particular record and another particular record.

With paper records, archivists often do not identify individual records, due to time and resource constraints. Instead, archivists typically manage records in the aggregate. With electronic records, archivists may have the capability and desire to identify individual electronic records as standard practice.

Each individual record has an attribute that defines its particular “record type.” As used herein, the term “record type” refers to the abstract form of the records, such as letter, memo, greeting card, or portrait, etc. As such, each record type represents a distinctive class of electronic records defined by their form. A record type represents a distinctive class of records defined by their function or use. Consider the following example of record types:

EXAMPLE 3

A parish church will typically maintain many different types of electronic records, including baptismal records, deeds to parish properties, ledgers of the parish financial accounts, minutes of parish meetings, and official parish correspondence. Each of these different record types has a distinct intellectual form. For example, baptismal records almost always list at least the name of the person baptized, the date and place of birth, and the date and place of the baptism. In contrast, financial account ledger records might include a chart of accounts with debit/credit entries. It would be rather surprising to find an infant's birth date in a financial ledger.

The abstract form of a record type is specified by a “record type template.” As used herein a “record type template” is template that identifies specific attributes for a specific type of record. The record type template specifies the essential characteristics of the record, which are used to ensure authenticity.

Referring again to Example 3, the record type template for baptismal records would identify the information expected in that type of record, such as the name of the person baptized, date and place of birth, etc. FIG. 5 illustrates the relationship between a record and a record type template. A record type template specifies the form of a record.

The Record Type Template also specifies the essential characteristics of the record, which are used to ensure authenticity as documented in co-pending, commonly assigned U.S. Application (Attorney Docket No 4870-25), entitled SYSTEM AND METHOD FOR PRESERVATION OF DIGITAL RECORDS.

Electronic records are accumulated and organized into “record aggregates” to facilitate organization and archival processing. As used herein, the term “record aggregate” means an intellectual aggregation of documentary material arising because they result from the same accumulation of filing process, the same function, or the same activity; have a particular form; or because of some other relationship arising out of their creation, receipt, or use; or because the aggregate was required for the purposes of archival arrangement. Record aggregates may be composed of other record aggregates, or records.

Record aggregates can themselves be accumulated and organized into higher order record aggregates. Consider the following example of a record aggregates:

EXAMPLE 4

An archivist might place military service records into an aggregate for the branch of the military (e.g., Army) which itself is within an aggregate for the Department of Defense, which itself is within an aggregate for the Federal Government.

Record aggregates may follow standard levels: record groups, collections, series, file units, and items. Each record aggregate has name and title attributes which help identify it. Record aggregates may be composed of other record aggregates, or electronic records. FIG. 5 illustrates the relationship between electronic records and record aggregates.

Record aggregates may either be homogeneous, i.e., they contain electronic records of the same record type, or heterogeneous, i.e., they contain electronic records of different record types.

Like electronic records, record aggregates have a degree of semantic coherence—they are organized according to principles of original order and provenance, which ensures that related electronic records are aggregated together. However, the semantic coherence that binds together a record aggregate is somewhat weaker than the semantic coherence that binds together a particular individual record. Put another way, an individual record within an aggregate has an independent identity because its semantic coherence is “strong enough” to be considered a record.

Computer software applications operate on data files, and data files represent the atomic unit of recorded information for computers. Where electronic records are conceptual in nature, data files are clearly physical. As used herein, the term “data file” means: 1) a collection of data that is stored together and treated as a unit by a computer software application; and 2) related data (e.g., numeric, textual, and/or graphic information) and fields that are organized in a strictly prescribed form and format. This definition includes two characteristics of data files, which are described in more detail below.

The first characteristic is that data files typically require interpretation by a computer software application, which the OAIS model calls “access software.” The OAIS definition for “access software” is a type of software that presents part of or all of the information content of an Information Object in forms understandable to humans or systems.

While it is conceivable that a person might look at all the individual bits of a data file to try to make sense of it, people generally use access software to present the information in some usable manner. The access software performs some kind of “presentation processing” to accomplish this. “Presentation processing” is defined as the software processing algorithms (including transformation, consolidation, tabulation, formatting, rendering, querying, filtering, interpretation, etc.) which access software employs to present the information contained in data files in a form understandable to humans.

Presentation processing covers a scale, from low (little to no processing required) to high (complex processing required), and the exact point on the scale for any particular set of information will involve subjective judgment. Presentation processing often involves presenting data files visually, but could also include presenting data files audibly or through any other human sensory perception.

Some data files are “eye readable” with minimal presentation processing. “Eye readable” is defined as data files whose information is inherently understandable to humans through visual inspection using access software that supports minimal presentation processing.

Only the simplest of data files are eye readable and most data files are completely unintelligible without a high degree of presentation processing. Using access software specifically suited to presenting a certain class of data files is necessary when the access software performs a high degree of software processing because without this access software, the information in the data files would be incomprehensible. Consider the following examples:

EXAMPLE 5

A fixed-length tabular dataset might be composed of one data file that structures tabular data into a regular row/column format that can easily be read and understood by a person. In this case, using access software might be optional.

EXAMPLE 6

A single web page might be composed of dozens of individual data files. For example, the web page might include multiple Hyper-Text Markup Language (HTML) data files, multiple Cascading Style Sheet (CSS) data files, client-side JavaScript script files, and multiple image files in various formats, such as Graphics Interchange Format (GIF) and Portable Network Graphics (PNG).

While a person could look through the individual bytes in each of these individual files, doing so would not provide an accurate sense of the data files' information content. This is because the access software, a web browser, actually performs a great deal of software processing to apply style sheets to transform and render content, more software processing to render images, and more software processing to render the behavior contained in the client-side scripts. This kind of software processing cannot easily be imagined or replicated by a person, so using access software is required.

EXAMPLE 7

Many data file formats are either undocumented, or are essentially incomprehensible to a person. For example, Microsoft Word's native binary (DOC) data file format is incompletely documented (due to the fact that it is proprietary) and is incomprehensible to a person who might look at the individual bytes within the data file. Using access software for these kinds of data files is required.

Historically, data files created in the earlier days of computing require low presentation processing, but as computers, software, data, and algorithms have continually increased in complexity over time, the amount of required presentation processing has also increased.

The second characteristic is that data files have a prescribed form and format. The above examples reference several data file formats, including Hyper-Text Markup Language (HTML) and Microsoft Word's native binary (DOC). This prescribed form and format is specified by a “data file type template.” As used herein, the term “data file type template” means a set of specifications about a data type that governs its format and behaviors.

The “specifications” in the above definition are essentially the instructions required by the access software to perform presentation processing.

Data files are often aggregated to facilitate management and presentation processing. In the web page example (Example 6), the web page is composed of many individual data files, which is known as a “data file set.” The term “data file set” means one or more data files that are logically related for purposes of presentation processing by access software.

Data file sets can either be “explicit,” or “implicit.” “Explicit” data file sets are defined by information contained in the data files, whereas “implicit” data file sets are defined through inscrutable software processing algorithms. Consider these examples:

EXAMPLE 8

Consider again the example of a web page. When an HTML data file refers to a CSS style sheet data file, it does so explicitly by data file name. This name can be resolved to find the CSS data file.

EXAMPLE 9

Consider an example of a set of database tables that include multiple data files for different kinds of information. One data file might contain simple data, another might contain binary data, and yet another data file might contain index information. The relationship between these data files is implicit, meaning it is not specified within the data files. Only the database application software defines these relationships as part of its presentation processing.

FIG. 5 illustrates the relationship between data files, data file type templates, data file sets, and access software.

As discussed above, electronic records are conceptual and data files are physical. Electronic records are manifested in some way as electronic data files, but the manner in which the electronic records are manifested must first be determined.

First, the options to describe the relationship between electronic records and data files should be considered. An individual record may be composed of:

- One entire data file
- Multiple entire data files
- A portion of one data file
- Portions of multiple data files

All of these options may apply, as explained in the following examples, which extend the example of the parish church (Example 3).

EXAMPLE 10

The parish church maintains each baptismal record as a separate word processing document data file, and its financial ledger as a separate spreadsheet data file. In this case, there is a one-to-one correspondence between a record and each data file.

EXAMPLE 11

The parish church maintains two separate spreadsheet data files for its financial ledger record, one spreadsheet for the balance statement and a second spreadsheet for the profit/loss statement. In this case, one record is composed of multiple data files.

EXAMPLE 12

The parish church has a sophisticated content management software application to manage all of its documents. The content management application stores all documents (including baptismal records, correspondence, financial ledgers, etc.) in one single database data file. In this case, one record is composed of a portion of one data file.

EXAMPLE 13

Again, the parish church has a sophisticated content management software application to manage all of its documents. The content management application stores all documents in one single database data file and all metadata about the documents in a separate database data file. In this case, one record is composed of portions of multiple data files.

In Examples 10-13, the intellectual form, content, and number of electronic records remains fixed, while the relationship of those electronic records to data files varies, depending on the particulars of how the parish church manages and uses its data files at a specific point in time.

The reason that the relationship varies between a record and data files is that a record has strong semantic coherence, while data files may not have strong semantic coherence. A particular data file might contain many different kinds of information, or even bits and pieces of information, which sometimes cannot be eye readable without significant presentation processing and access software. In other words, semantic coherence is not a requirement for data files per se—the semantic coherence is realized by the presentation processing and access software and the human understanding gained through using that software.

The relationship between electronic records and data files, then, is potentially many-to-many at a portion level—a record might be composed of one or more portions of data files, and data files might contain one or more portions of electronic records.

Based on Examples 10-13, it should be appreciated that the gap between electronic records (conceptual view) and data files (physical view) must be bridged. As the InterPARES I Preservation Task Force concluded, “Digital data inscribed on a physical medium do not have the form of a record. It is necessary to transform the inscribed bits into the form of the record.” (“Preserving Electronic Records,” Presentation on the work of the InterPARES I Preservation Task Force, Jun. 19, 2002)

The present invention provides a solution to the gap between electronic records an data filed by adding a logical view which transforms between the conceptual and physical views. To perform this task, the present invention provides a “digital component extractor.” As used herein, the term “digital component extractor” is defined as a software component that extracts digital components from a data file set, guided by a set of instructions. A “digital component” is defined herein as a set of digital information that exhibits strong semantic coherence and is expressed as a bit stream.

The purpose of the digital component extractor is to extract digital components from data files in a data file set that together comprise a record. FIG. 5 illustrates the model, which bridges the gap between electronic records and data files.

One implication of this model is that electronic records are composed of digital components (which exhibit strong semantic coherence) and not data files (which can exhibit any range of semantic coherence, including none whatsoever). Another implication is that digital component extractors are instructed as to how to extract digital components from data file sets.

Digital component extractors establish the map between data files and electronic records, and because this map is many-to-many, the exact method by which digital component extractors extract digital components varies. Consider the following examples:

EXAMPLE 14

If there is a one-to-one correspondence between a record and a data file, the digital component extractor simply needs to return the specified data file as the digital component. For example, a digital component extractor for a record that corresponds to a single word processing document data file would simply return that data file as the digital component.

EXAMPLE 15

If a record is composed of portions from one data file, the digital component extractor includes an algorithm to extract portions of the specified data file. For example, a digital component extractor for a record that corresponds to an e-mail archive data file would extract individual e-mails as digital components.

EXAMPLE 16

If a record is composed of portions from more than one data file, the digital component extractor includes an algorithm to extract portions of the specified data files. For example, a digital component extractor for a record that corresponds to a document spread across multiple database tables (and data files) in a content management software application would perform appropriate queries on those database tables to extract the digital component.

Put another way, digital component extractors contain the instructions necessary to extract digital components from data file sets.

Table 2 documents the approaches for specifying digital component extractors, and their advantages and disadvantages.

TABLE 2

Approach
Advantages
Disadvantages

The transferring entity defines
The transferring entity defines
Requires up-front planning and

the digital component
semantic coherence early,
investment by the transferring

extractors early in the records
which ensures that the
entity, plus a change in how

lifecycle, as the records are
information contained in the
the transferring entity manages

still in active use
data files is accessible
information

The transferring entity (with
The transferring entity (with
Requires a large time and

assistance from the archivist)
assistance from the archivist)
resource investment at the

defines the digital component
generally has the subject area
exact point (records

extractors after-the-fact, as
domain knowledge and
management offices) at which

part of preparing to transfer
technical knowledge to
transferring entities are

the electronic records to ERA
properly define semantic
overburdened

coherence

The ERA system itself
The system can make
A human might make better

imputes digital component
reasonable assumptions about
assumptions than the

extractors from record type
the digital component
automated ones, based on

templates and data type
extractors in an automated
subjective judgment. Also, the

templates
manner
system might not always be

able to perform this imputation

(for example, if key

information is missing)

An archivist defines the digital
The archivist generally has the
Requires a large time and

component extractors after-
subject area domain
resource investment from the

the-fact, during archival
knowledge and technical
archivist, which may not scale

processing
knowledge to properly define
to meet the electronic record

semantic coherence
archive's expected ingest

volumes

The electronic record archive
The system can apply
This is an area of on-going

system itself imputes semantic
linguistic and pattern
computer science research, and

coherence and therefore
matching algorithms to
at this time this requires

digital component extractors
determine appropriate digital
further development.

from the data file content
component extractors in an

automated manner

It would be efficient for transferring entities to establish intellectual control over the semantic coherence of their electronic records as they develop their information systems, but this will not always happen. It would also be efficient if transferring entities, with assistance from the archivist, at least defined their electronic records before the point of transfer, but again this will not always happen, because this is a burden on records officers. The system of the present invention imputes digital component extractors from templates as discussed below, and this generally will be acceptable. In the cases where none of these approaches work, the ERA must allow archivists to establish intellectual control over the electronic records at an item level through defining the digital component extractors.

Generally, ERA imputing the digital component extractors from the relevant templates will work quite well. Consider this example:

EXAMPLE 17

The record type template indicates a particular set of records is correspondence, and the data file template indicates the data file is in Microsoft Outlook (PST) format. A reasonable set of digital component extractors can be imputed that extract individual e-mails into separate digital components. Each digital component represents an individual e-mail, which exhibits strong semantic coherence.

In some rare cases, there may be no workable digital component extractors, because they are not defined by either the transferring entity or archivist, and the ERA system cannot impute reasonable alternatives. Consider this example:

EXAMPLE 18

The record type template indicates a particular set of records is geospatial information, and the data file template is in an unknown proprietary format that is not human readable and not documented. ERA cannot impute a reasonable set of digital component extractors because it is not aware of the data type format.

In the case where there are no workable digital component extractors, the ERA of the present invention will create a default set of digital component extractors, known as “placeholder digital component extractors,” which are defined as a set of digital component extractors that assume each data file is a single digital component

The levels of available preservation, access, and authenticity services that the ERA of the present can provide may be constrained for electronic records with placeholder digital component extractors, so these should be the exception rather than the norm. In other words, placeholder digital component extractors are only consistent with the most basic level of service in ERA.

All of the entities modeled by the present invention, such as electronic records, record aggregates, digital components, data files, etc., must be identifiable and resolvable. An approach to identifiers is more fully documented in co-pending, commonly assigned U.S. Application (Attorney Docket 4870-9), filed Apr. 26, 2007, entitled SYSTEM AND METHOD FOR AN IMMUTABLE IDENTIFICATION SCHEME IN A LARGE SCALE COMPUTER SYSTEM.

All identifiers within THE ERA must exhibit the following characteristics:

- The identifier must resolve to the entity which it identifies
- The identifier must be guaranteed unique across the ERA identifier namespace
- The identifier for a particular entity must be immutable
- The identifier system must scale to ten teraobjects

An approach to generating identifiers according to the present invention involves using a cryptographic hash algorithm (such as SHA-256) based on the initial content of the thing being identified. This approach meets the required constraints.

It should be noted that some entities have an identity which is independent of its content. For example, the identity of a record is independent of the content digital components and/or data files that make up any particular version of that record. New versions of electronic records can arise from redaction and preservation activities, and each record version will have its own independent identifier that is related back to the record.

In these cases, the identifier will be generated from the content of the entity when it is first created within ERA and immutable thereafter. Thus, the identifier for electronic records would be generated and assigned when the record is created within ERA based on the content of the first version's digital components, and that identifier would be immutable thereafter.

An approach to preservation and authenticity issues are more fully documented in co-pending, commonly assigned U.S. application (Attorney Docket 4870-25), entitled SYSTEM AND METHOD FOR PRESERVATION OF DIGITAL RECORDS.

The notion of digital components and digital component extractors has some interesting implications for preservation. The InterPARES I Preservation Task Force states “It is impossible to preserve an electronic record. It is only possible to preserve the ability to reproduce an electronic record.” (“Preserving Electronic Records”, Presentation on the work of the InterPARES I Preservation Task Force, Jun. 19, 2002.) A record's digital components, along with access software, allow reproduction of the electronic record. As such, the preservation strategy of the present invention ensures the digital component extractors produce digital components that authentically represent the record. This means that digital component extractors must honor the essential characteristics associated with the record (and which are specified in the record type template).

The process of redaction involves deleting specific content from a record to produce a new version of the record, and the new version of the record typically has reduced access restrictions.

In the electronic record context, digital content is contained in both data files and digital components, so in theory redaction (deleting digital content) could occur in either place. In practice, most redaction tools redact content from data files, so the present invention will support this approach. This means that redaction will occur against data files, which will produce a new version of the data files, and the digital component extractors will produce new digital components from these redacted data files. This process will result in a new version of the record, that is composed of redacted digital components that have been extracted from redacted data files.

Like records, original order and arrangement are conceptual and not physical. Thus, order and arrangement both apply to records, but not data files. The order of data files is essentially arbitrary and meaningless from an archival context, since data files exhibit low semantic cohesion.

It is possible that electronic records might have no meaningful original order, in the same way paper records might have no meaningful original order. In these cases, the present invention will follow the advice of Frank Boles in “Disrespecting Original Order” to maintain records in a state of simple usability. (Boles, F., “Disrespecting Original Order”, The American Archivist, Vol. 45 No. 1, pp. 26-32, 1982.) Simple usability for electronic records implies dynamic sorting, filtering, and querying capabilities.

It is possible that the digital component extractors of the present invention will be executed to produce a physical representation of a digital component. In this case, a digital component would be a bit stream serialized as a managed file within the system. It is also possible that the digital component extractors will be executed on-demand to produce a transient digital component, as needed. In this case, a digital component would be a transient in-memory bit stream. The present invention allow for both options, and the decisions on which to use will be a matter of policy and design.

Templates play a large part in NARA's vision of the ERA both as a means to manage electronic records, in respect to scheduling, and as a means to preserve records, in respect to defining preservation formats and processing.

Because there are many potential applications of templates, and because templates are sometimes described by examples of documents that conform to the templates rather than the template itself, there is a need to define what templates are and how they are used.

As discussed in more detail below, the present invention utilizes a taxonomy of templates and the relationships between templates and instances of templates to identify and manage records. The present invention also utilizes the relationship between hierarchical templates and hierarchical information using a matrix. Furthermore, the present invention provides for managing templates.

It is helpful to begin with an example of templates and instances of templates, and to provide an illustrative listing of some kinds of templates that might be used within the ERA system of the present invention.

According to the present invention, the use of template may be associated with all of the following:

- To describe the structure and content of record life cycle documents that the system will help create and manage. This includes templates for Transfer Agreements, Disposition Agreements, Preservation Plans, etc.
- To describe the presentation of documents.
- To define the relationship between assets within the archive (such as the original order of records) and within transfers of records to the archive.
- To describe the structure and content of archival metadata, the contextual information which, together with the digital objects it describes forms the records. This includes archival description elements and life cycle data elements.
- To describe components and resources within the system itself. Instances of these templates include data type format templates, templates that describe digital adaptation processes, and resources such as Authorities Sources.
- To describe the operation of ERA system itself. Instances of these templates define operations such as work flow processes that orchestrate the use of ERA system services.

It can therefore be seen that templates are being used according to the present invention to:

- Describe the content and structure of a document—what data elements it should contain and any relationships between those data elements
- Describe the content and structure of the metadata that describes a document.
- Describe how a document should be presented to a user, how would its content be laid out on a screen or a printed page, and when appropriate to describe the choreography of the presentation of different digital objects
- Serve as a manifest to list all the documents contained within some collection of documents.
- Serve as a catalog of documents describing the relationships between them.
- Serve as components within the ERA system, providing processing instructions for operations that take place, such as the orchestration of work flows or digital adaptation processing.
- Describe components of the ERA system, such as specific data type formats.

Some of these uses of templates have been described with reference to instantiations of the templates and some have been described with reference to the templates themselves. It is necessary to distinguish between templates and instances of templates.

Using XML technologies as an example, an example of templates, and instances of documents that conform to or are generated by those templates that might be used in the preservation and presentation of a document displayed on a web page is provided.

The first template is an XML schema that defines the structure of the record catalog which lists the digital objects that are part of the web page and their hierarchical relationships. An instance of that template is a selection from the record catalog for the page in question.

Referring to FIG. 6, the next template might be an XML schema that defines the content and structure of the document that is to be displayed on the page. Each data element in the document is defined. The relationship(s) of each data element to other data elements are also defined.

Referring to FIG. 7, an instance of the template of FIG. 6 is an XML document (the textual content of the document) that conforms to that schema and which includes the data elements and content of the type defined in the schema. The instance has data elements described in the schema that hold values, which is also consistent with the schema.

Referring to FIG. 8, the next template might be an XSL template that defines the presentation of that XML instance in HTML on the web page (or as in some other format such as PDF). The XSL template may be a spreadsheet, or other type of template, and can be used to describe how an XML instance that conforms to an XML shema will be presented or displayed, for example as HTML or a PDF file. The template can also be used to transform an XML document into a variety of other formats, as well as into a different XML document.

Other types of templates, may orchestrate a sequence of pages. The instantiation of that template is the web page—which is the record that is being preserved.

Additional templates may be involved in defining the behavior of a web application, including templates that define the work flow within the application, templates that define the orchestration of pages within the application and templates that describe the animation of items on a page.

Table 3 provides an overview of some of the types of templates that may occur in the ERA of the present invention. Although each example has been mapped to an appropriate XML syntax that might be used to create the template, it should be appreciated that the present invention is not limited to the use of any particular format. It should also be appreciated that the list of templates Table 3 is not intended to be exhaustive. There are many possible applications for templates and there are other XML technologies, and non-XML technologies, which may be used.

TABLE 3

Indicative

XML

Application of Template
Syntax
Examples

1. Record Structure Templates

Structure of Records; Record
XML
Record Catalog

Catalog entries
Schema,
Submission Information Package

METS

2. Lifecycle Documents

Structure and content of Life
XML
Transfer Agreement

Cycle documents
Schema
Disposition Agreement

Preservation Plan

Layout of documents on
XSL, XSL-
Presentation of documents

screen or paper
FO

3. Archival Metadata (information specific to a record or a part of a record)

Structure and content of
XML
Origin, Provenance, Content, Context, etc.

Archival Description
Schema

Structure and content of Life
XML
Additions to life cycle data

cycle Data
Schema

4. System Components (an information component of the system, or description of a

component of the system)

Structure of Authority
XML
Authority Sources

Sources and Thesauri
Schema

Structure and content of
XML
Persistent Formats where content is

Persistent Object Formats
Schema
primarily words, numbers, vectors etc.

(POF) *(1)
BSDL
Persistent Formats where content is

primarily images, sound, etc.

Digital Adaptation
XSL/T
Data type specific processing templates

Instructions

to transform from one data type to

non-exhaustive list *(2)

another

Presentation of multimedia
SMIL
Templates to define interactions

records

between multiple digital items in

multimedia presentations

5. System Metadata

Description and versioning of
XML
Disposition Agreement template

templates
Schema

6. Identity & Rights

Structure and content of User
XML
User profiles

Profiles
Schema

Authorization Requests/
SAML
Authorization of users

Responses

Access Restrictions & Rights
XACML
Definition of access privileges for

specific records

7. Service Architecture

Work flow Processes
BPEL
Orchestration of services involved in

business processes, such as managing a

FOIA request

Services
WSDL
Inputs and outputs of individual

services

Templates may be used to define the relationships between records in the archives, such as defining the original order of records, the structure of the record catalog, and the structure of transfers to the archives or the delivery of copies to users (Submission Information Packages and Dissemination Information Packages).

Capturing the original order of a record represents a case where a template can be used within a template. The structure of the Record Catalog can be described in a template that defines the information elements that make up an entry in the catalog. The content of some of those information elements may be other templates, or they may be become values in the instantiation of an object that conforms to another template.

Templates may be used to define the content and structure of records schedules and other Life Cycle Documents.

Templates may be used to define the structure of record description, and the elements of information that compose the metadata of records.

A template for Archival Metadata, which includes description and Life cycle data, will define which elements of information that must be present, what type of information they should contain, and how they are related to each other.

Templates may be used as inputs to processes that transform digital objects in the archive, including templates that may be used to define the presentation of assets to users.

The System component templates cover the widest variety of use of templates. This includes defining persistent object formats, defining the information needed by a processor to render those formats in a current format, defining the choreography and behaviors of objects in aggregate multimedia records, etc.

The System Components will be constantly evolving, adding new templates as new digital technologies evolve. Each type of system component will have its own family of templates.

Templates may be used to define the structure of component description. The ERA system will archive itself and be self-describing. Templates will define elements of information needed for components to be self describing.

Templates may also be used to define the nature and rights of entities and the access restrictions on assets in the archive.

A records-centric access model will define restrictions and rights in relation to records using the internal structure of the records themselves. Templates will define the instructions on records and create the framework for aligning identity—role—authorization to protect the records.

Templates may further be used to describe system services and orchestrate services within work flow processes.

The Service Architecture describes the arrangement and delivery of services in the ERA system of the present invention, including the work flow processes and the functionality at each step in the process. Templates, expressed for example in Business Process Execution Language (BPEL), may be used to describe the orchestration of functional services, and at a lower level, describe the inputs and outputs to each individual functional services, using for example Web Services Description Language (WSDL).

A hierarchical scheme according to the present invention may be implemented for managing templates. The introduction of hierarchy to the management of templates adds another level of abstraction. A template abstracts from a specific instance to the general case. Such a template is associated to a single type of object. With hierarchy, another layer of abstraction may be added that can be applied to any of: 1) the template, 2) the content which it controls, or 3) both.

As an object subject to a hierarchical arrangement the template becomes a mirror of the organization of objects into increasing larger aggregate structures which is a method of organization common to the ERA system of the present invention as a whole.

Templates can have a hierarchical connotation either because: (a) the template itself can only be instantiated with reference to a hierarchy of templates which collectively define its content, or (b) the object the template describes can only be instantiated with reference to a hierarchy of digital items or conceptual arrangements of digital items.

In the first case (a), instantiating the template requires retrieving elements from within different templates within a hierarchy. For example, Life Cycle Data document templates (Transfer Agreements, Disposition Agreements, etc) will have their own specific information elements but will also likely share a set of information elements common to all Life Cycle Data documents.

The template hierarchy might look like:

ERA.xsd (elements common to the ERA, such as identifiers)

- Life_Cycle_Documents.xsd (elements common to all Life Cycle documents)
  - Transfer_Agreement.xsd (e.g. SF-258 specific elements)
  - Disposition_Agreement.xsd (e.g. SF-115 specific elements)
  - Preservation_Plan.xsd (elements specific to this template).

In XML Schema, this may be implemented by having each template in each child level of the template hierarchy begin with an <include/> instruction that incorporates in the child template all the data elements described in its parent, which in turn will <include/> all the data elements in its parent, etc.

In the second case (b), to instantiate a document that conforms to a template requires retrieving elements of information from hierarchically organized assets within the archive.

For example the template for archival metadata may include elements of information some of which are associated to a record catalog item that represents the conceptual concept of the entire record (the parent or root element of the record) while other elements of information are associated to individual digital items that are components of the record.

To create a document that represents the archival metadata for a specific digital item, and which conforms to the archival metadata template, requires retrieving all the information elements from each level in the record's internal hierarchy from that digital item up to the record's “root”.

For example, suppose that the family of a noted physicist donates her personal papers to NARA. The record hierarchy that might look like:

Curie Collection

Family Papers

Professional Papers

Research Activities

Reagents

Metadata that describes the <Origin> of the record will likely be associated with the highest level in the record hierarchy, the “//Curie Collection” level, as the description of <Origin> applies to all the documents in that collection.

Metadata that describes the <Digital Object Type> of a specific document will be associated with a specific document, such as “//Curie Collection/Professional Papers/Research Activities/Reagents”.

To create an instance of the metadata for the “//Reagents” document requires the accretion of the metadata for itself and all its ancestors as we traverse the record hierarchy up to the collection level.

The possible intersections of templates and hierarchies can be presented in a matrix as shown in Table 4. Along one axis are the templates; either derived from a hierarchy or self-contained. Along the other axis are the conforming content, again either derived from a hierarchy or self-contained.

The matrix below illustrates where some types of templates may fall in the matrix.

TABLE 4

Content Axis

Template Axis
Template is
Life Cycle Document templates,
Archival metadata, the schema

Hierarchical
where template is Life Cycle
for metadata may be instantiated

The template is an
Document + generic Life Cycle
by aggregating schemas within a

aggregation of template
Elements
hierarchy of metadata schemas,

elements from a

and the conforming metadata

hierarchy of templates.

document may be created from

Document conformance

the aggregation of all metadata

cannot be tested without

elements traversing a record

including elements from

hierarchy.

the hierarchy.

Template is Self-
System metadata, such as
n/a

Contained
persistent format definitions

The template is a self-
Service Architecture templates;

contained object.
both the hierarchy of BPEL

Document conformance
managing WSDL, and within

can be tested without
WSDL the aggregation of generic

reference to any other
WSDL and the web service

template.
specific elements described in

XML Schema

Content Self-Contained
Content Hierarchal

An object that conforms to the
The creation of an object that

template is a self-contained object in
conforms to the template is achieved

its own right and conformance can be
by retrieving all references to it from

tested without reference to the
each layer in the hierarchy. The

hierarchy to which it belongs.
conforming object accretes its content

as it traverses the hierarchal tree and

is only conforming at the end of the

accretion process.

In a self-describing system, each template is both a functional component of the system and a record in the system. As a record in the system, the template is treated the same as any other record, with its own metadata, life cycle management, and preservation. The ERA system of the present invention may be regarded, therefore, as an aggregate record, with its own hierarchy of documents, so that part of our ERA record hierarchy might look like

ERA

System

Templates

System

Workflow

DispositionWorkflow.bpel (instance of

BPEL template)

AddDescriptionService.wdsl

(instance of WSDL template)

Each instance of a system component, including templates, has its own archival metadata (metadata that describes a record). This latter metadata makes the component self describing.

For example, a WSDL file is an instance of the template for defining a service and a BPEL file is an instance of the template that defines a work flow.

The archival metadata of the WSDL file will include information such as;

- What does it do?
- What work flow does it belong to?
- What version is this, is it the current version?
- How does it work—inputs, outputs?
- Where did the code originate?
- Are there are intellectual rights associated to this web service?
- What is the actual code?

This sort of information could be included in the WSDL file as comments (or <Documentation/> elements) but would not be very manageable as a result. The system would not be able to apply its record management functionality to its own templates, which is based on archival metadata held exterior to the digital object the metadata describes,

To make description of the system components manageable, they should be described using the same archival metadata templates as for any record.

While there will be a defined template for a service in the ERA (such as the XML Schema for WSDL), the present invention may use another template, the Archival Metadata schema, as the template to describe the service as a component of the system.

As templates evolve, the life cycle data elements in their description capture that evolution, such as the version. When a change to a template changes the behavior of the system, the earlier version of the template is preserved as a record so that the previous behavior of the system can be understood.

Templates will evolve as ERA evolves. As such templates, as records in ERA, will be versioned and managed. Life cycle data elements or records will include the version of the templates they use. Versioning will allow new templates to be introduced without creating problems with validation. Whether life cycle content that is subject to validation against templates should be updated as templates evolve will be a policy decision applied to each template.

Each process to update a template may be a standard work flow in the ERA, and described in its own template, which will include appropriate approval and authorization steps as determined in policy.

Templates, as records, will have their own fixity information to ensure their integrity and the life cycle data of objects modified by templates will record which version of which template was used.

The concept of managing templates can be extended to apply to every component of the system. Each software component of the ERA system should be described and held in the ERA. This applies to platform applications, web application components, any client side components, as well as all the functionality wrapped in web services which can be managed within the concept of managing templates as described above.

The concept of preserving original arrangement to the system can also be extended so as to describe in Archival Metadata how all the components are structurally linked—creating in essence a schema for the ERA itself.

While the invention has been described in connection with what are presently considered to be the most practical and preferred embodiments, it is to be understood that the invention is not to be limited to the disclosed embodiments, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the invention. Also, the various embodiments described above may be implemented in conjunction with other embodiments, e.g., aspects of one embodiment may be combined with aspects of another embodiment to realize yet other embodiments.

	Number	Date	Country
	60797754	May 2006	US
	60802875	May 2006	US

System and method for managing records through establishing semantic coherence of related digital components including the identification of the digital components using templates

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

US Classifications

International Classifications

Abstract

Description

Claims

CROSS-REFERENCES TO RELATED APPLICATIONS

Provisional Applications (2)