This invention relates to library services and methods and apparatus for maintaining a database of content location and reuse rights for that content. Works, or “content”, created by an author is generally subject to legal restrictions on reuse. For example, most content is protected by copyright. In order to conform to copyright law, content users often obtain content reuse licenses. A content reuse license is actually a “bundle” of rights, including rights to present the content in different formats, rights to reproduce the content in different formats, rights to produce derivative works, etc. Thus, depending on a particular reuse, a specific license to that reuse may have to be obtained.
Many organizations use content for a variety of purposes, including research and knowledge work. These organizations obtain that content through many channels, including purchasing content directly from publishers and purchasing content via subscriptions from subscription resellers. In these latter cases, reuse licenses are provided by the publishers or resellers. However, in many other cases, users must search to discover the location of content. In order to insure that their use is properly licensed, these organizations often engage the services of a license clearinghouse in order to locate the content and obtain any required reuse license.
The license clearinghouse, in turn, maintains a database of metadata that references the content and, in some cases, maintains copies of the content itself. The metadata indicates where the content can be obtained and the license rights that are available. With this database, a user can search for metadata that references the desired content, select a location for obtaining the content and pay a license fee to the license clearinghouse to obtain the appropriate reuse license. The user then obtains the content from the selected location and the license clearinghouse distributes the collected license fee to the proper parties.
In order to keep the metadata database current, license clearinghouses constantly receive new metadata and content material from several different sources, such as the Library of Congress, the Online Computer Library Center (OCLC), the British Library or various content publishers. Often, metadata that references to the same content is obtained from several different sources.
In addition, even though some metadata is equivalent in the sense that it references the same content, certain metadata may be preferred. For example, metadata that references content which is available from the license clearinghouse and for which licenses are also available from the content clearinghouse, is preferred over metadata that references content where the license must be obtained from a third party. Some sources, such as the Library of Congress, the British Library or OCLC are considered authoritative and thus metadata that reference content in these sources is preferred over metadata that references content that can be obtained from other sources, such as publishers.
It is desirable to provide the most preferred metadata to a user who is searching the database. Thus, the database metadata entries must be compared with each other to determine which entries will be returned as the results of a search. While a method using straightforward comparison can be successful with relatively small databases, it quickly becomes prohibitively time-consuming with large scale databases. For example, if the metadata representing every work is compared to the metadata representing every other work in the database, for a database with n works, the number of combinations is n*(n−1)/2. Therefore, for a database containing 25 million works, 312.5 trillion comparisons are required to determine the preferred database entries. Similarly, for a database with 75 million works, 2.8125 quadrillion comparisons are required.
Consequently, some mechanism is required that manages different versions of a work so that a most preferred version is presented to a user and new material can be entered within a reasonable time.
In accordance with the principles of the invention, the database entries are clustered with a clustering algorithm and then comparisons between entries are made only between the entries in each cluster. Once a set of entries has been determined to be equivalent, the entry with the most preferred metadata is marked as preferred so that it is indexed and displayed as the result of a search. When an entry must be edited or when license rights must be assigned to an entry, a composite master entry is constructed from those entries in the set that contain preferred metadata and stored in the database. The composite master entry is then marked as the preferred data entry so that it is subsequently made available for searches and display to the user.
In one embodiment, data entries that are determined to be equivalent are assigned the same publication ID and stored in the database. Later, when a master entry is required, all entries with publication IDs that are the same as that entry are retrieved and the master entry is constructed from the retrieved entries.
In another embodiment, equivalent entries are ranked by a quality level that is based on the publication source. Fields in the master entry are filled with corresponding data available in the data entry with the highest quality level. For fields that remain unfilled because no corresponding data is available in the data entry with the highest quality level, if corresponding data is available in the data entry with the next highest quality level, that data may be used to fill these fields.
In still another embodiment, the master field filling process is continued until a predetermined required number of fields are filled.
In yet another embodiment, the master field filling process is continued until as many fields are filled as possible.
Bibliographic data entries can come from many sources and each source has its own data format. In many cases, the same data comes from multiple sources. In the inventive system, all data that is loaded is stored in association with its source. Each source and the data entries associated with that source have assigned to them a “quality” level chosen from a predetermined hierarchy. As mentioned above, the highest quality is assigned to sources/data entries that reference content which is available from the license clearinghouse and for which licenses are also available. The next hierarchy levels are assigned to sources which are considered authoritative, such as the Library of Congress and the British Library. The lowest levels in the hierarchy are assigned to other sources, such as publishers.
The validation process, which is discussed in more detail below, entails processing each entry into a standard form and checking for duplicate entries. The validated entries are then posted to the document repository in step 110 and the process finishes in step 112. Then, as described below, for each unique record, either the highest quality version or a composite entry created from information in equivalent entries is produced as the results of a search or in an index.
Next, in step 204, the MARC XML data is transformed to an XML format 308 that is used in the staging database 314. As indicated in
In addition, data records occasionally represent more than one version or “manifestation” of a single work. In the inventive system, metadata representing each manifestation is stored because a manifestation is the level at which copyright is assigned. Consequently, in step 404, when data containing more than one ID number of the same type in a single record is received from a source, it represents more than one manifestation, so that record is split into multiple manifestations. This is illustrated in
The data in each data record is now further processed. In
In parsing and validation step 408, other, more complex, data values that sources represent in various ways are normalized. A simple example is a publication date. Dates can be represented in a wide variety of ways, so the publication date is extracted by parsing the entry, and converted into a single format. This parsing is performed by the parser 522 and the exact form of the parsing depends on the source and the format of the data entry. In general, all date fields are subjected to this kind of processing, including author birth date and author death date. Similarly, the technique for representing the page count of a work also varies widely among sources, and even within each source, so the page numbers must be parsed out of the data entry and normalized into a standard format by parser 522. These converted values are also stored.
Validation involves examining the data to insure that it is readable and falls within certain limits. For example, certain characters, such as control characters, that might cause readability problems are removed from the data fields. Checks are also made to determine that the data will fit into its assigned location in the repository, that the data type is correct, and the data value is not too large. Some data fields (for example, date fields) are range checked to make sure they are within a reasonable range. Certain data tables in the repository database require entries in selected rows (for example, titles). The existence of the required data in the staging database is checked in step 410. Finally, in step 412, duplicate data is eliminated from each data entry. This processing is performed by the validator 524.
The data records in the staging database each have a fixed format with predetermined fields which accept data. Some or all of the fields may contain data as a result of the processing described above in connection with
In step 414, a matching routine is run by the matcher 526 to determine whether the new data entry is “equivalent” to one or more data entries already stored in the repository. This routine is executed each time a new data entry is loaded into the staging database as indicated in step 414. However, it may also be executed when existing data entries are edited. In this manner equivalence is always determined. When a new data entry is received from a source, a decision must first be made whether to add the new entry or to update an existing entry already in the repository database. Where possible, a key value assigned by the source is used to make this determination. If the key value of the received data entry differs from the key values of data entries already stored in the repository database, then the received entry is assumed to be a new entry, otherwise an existing entry is updated. Where it is not possible to use the key value, the equivalence routine is run on the data entries associated with the source in the repository database to determine whether the received entry is new or equivalent to an existing entry.
As mentioned above, due to the large number of data entries in the repository database, it is not possible to compare the data in the fields of each new data entry to corresponding data in the fields of each existing data entry in order to make a determination of equivalency. Instead, in accordance with the principles of the invention, a clustering method is used to make the equivalency determination. One illustrative embodiment is shown in
As shown in
The data values in the primary data field are then extracted, as indicated by arrows 716 and 718, and applied to comparator 720 which compares the values as indicated in step 606. If the data values match as determined in step 614, the process proceeds to step 616 where a score calculator 722 calculates a total score for the pair of entries. The total score is calculated by examining, in both entries, each data field to which a match score has been assigned. When the data field values match, the assigned match score is added to the total score. If the values do not match, nothing is added to the total score. After the total score has been calculated, it is provided to a comparator 724 as indicated by arrow 726.
The comparator compares the total score to various predetermined thresholds 728. When the total score exceeds a predetermined equivalence threshold value (for example, 875), the pair of data entries are deemed equivalent. Similarly, if the total score exceeds a predetermined near-equivalence score (for example 675), the pair of entries are deemed to be near-equivalent.
Equivalent entries are marked by assigning to them the same publication ID, as set forth in step 620 and as indicated schematically by arrows 730 and 732 in
The exemplary clustering method is effective for bibliographic data entries. One skilled in the art would understand that other conventional clustering algorithms, such as dimensional reduction, can also be used. If information other than bibliographic information is included in the entries, then algorithms, such as latent semantic indexing, can be used as would be known to those skilled in the art.
After the entries have been marked or, alternatively, if no match is determined in step 614 or the total score is determined to be less than the near-equivalence threshold in step 618, the process proceeds to step 612 where a determination is made whether additional entries remain to be processed. If no entries remain to be processed, then the process finishes in step 610.
Alternatively, if in step 612, it is determined that additional entries remain to be processed, then the process proceeds to step 608 where the next entry is selected for processing and the process proceeds back to step 606. In this manner, all pairs of entries in the sorted list are compared for equivalence.
When data entries are indexed, such as in connection with a search function, equivalents to a data entry are examined and the entry with the highest quality is selected. If two entries are equivalent and have the same quality level assigned, then both entries are indexed together. Highest quality entries are marked as preferred so that they will be displayed in search results. If a data entry with a higher quality level is later loaded into the repository database, that entry is then marked as preferred.
However, in one embodiment, when a entry is “used” in the sense that it must edited or license rights are to be assigned to the underlying work, all entries equivalent to that entry are examined and a “master” entry is created and marked as equivalent to the other data entries by giving it the same publication ID. This master entry is then assigned the highest quality level that is available and is also marked as a preferred entry. Master entries are the only entries in the repository that are editable. When a user attempts to change a data entry that has no corresponding master entry, a new master entry is created from the entry and the user is allowed to edit the new master entry instead. The new master entry then is marked as preferred. In this manner, the inventive system presents a single logical view of the data because data entries in the repository that are equivalent to data entries with higher quality levels are hidden and never presented to a user. In another embodiment, the master entry is created at the time when the equivalent entries are determined.
If, in step 808, it is determined that all selected fields have been filed with data, then the process finishes in step 814. Alternatively, if it is determined in step 808 that all selected fields have not been filled, then the process proceeds to step 810 where a determination is made whether there are more data entries to be examined.
If in step 810 it is determined that no additional data entries remain to be examined, then all selected data fields in the master entry for which information is
This data entry arrangement 900 is shown schematically in
Each of entries 902 is associated with a source that generated the entry. As previously mentioned, the sources are arranged in a predetermined hierarchy by quality. For example, entries 904 and 906 are master entries created as described above. These entries have the highest quality level 930 (illustratively designated as 1000 in the example shown in
All of the entries are also subject to equivalency processing, schematically illustrated by block 936 which generates an equivalency list 938 that is also stored in the repository. As indicated in list 938, in the illustration, work number 10 is equivalent to work number 17; work number 12 is equivalent to work number 15 and work number 13 is equivalent to work number 18.
Lastly, the entries are subjected to a quality check so that only the highest quality unique entries are selected for display to the user. These works 942 are surfaced to the user whereas other works 944 that are equivalent to the highest quality
Whereas the following works would be hidden:
While the invention has been shown and described with reference to a number of embodiments thereof, it will be recognized by those skilled in the art that various changes in form and detail may be made herein without departing from the spirit and scope of the invention as defined by the appended claims.