The electronic distribution and consumption of electronic books is changing the nature of publishing. Books can now be published and republished with very little expense. In many cases, books can be published by simply submitting them electronically to an electronic distribution service.
However, electronic distribution services face a challenge in organizing and indexing the many book submissions that are received. This challenge is complicated by the fact that various versions of the same book may often be submitted. This may happen where a publisher offers a book in different versions, revisions, or editions. For example, an initial book offering may be followed by a version containing corrections and revisions. As another example, a subsequent edition of a book may contain a special forward or other front/back matter. As yet another example, different publishers or sources may submit various versions of the same book, particularly where the book is in the public domain, or when publishing rights are held by different publishers in different jurisdictions.
Electronic book submissions may be accompanied by metadata such as title, authorship, and other information, and this information can sometimes be compared to determine whether two submissions are the same book. However, metadata such as this is not always definitive. Titles, for example, may not always be sufficient because they may not be entirely consistent from one version to another, and because different books may have similar titles. Furthermore, supplied metadata varies in its completeness and accuracy.
The detailed description is set forth with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items.
A book title set is a set of book products that represent the same or a very similar literary work. For instance, the literary work “Moby Dick” may be offered as a number of different versions, from a number of different publishers and sources. Even though there may be various differences between the versions, including formatting, graphics, front-matter, back-matter, and so forth, the versions are all considered to be part of the same title set because they share generally the same or very similar body text. Written and audio versions of a book, for example, may be considered part of the same title set.
Book distributors may wish to identify title sets for various reasons. For example, distributors may want to group members of a title set for convenient viewing and selection by customers. As another example, an online distributor may wish to identify user comments, ratings, and reviews regarding one book, and to associate them with all members of the same title set.
This disclosure describes techniques that can be used to efficiently determine whether books are part of the same book title set. The described techniques utilize multiple comparisons for this purpose. A first-pass comparison compares the metadata of an individual book with the metadata of other books to determine likely matching candidates. For example, a simple first-pass comparison may compare the title of the book with the titles of other books, identifying any such other books having similar titles.
A second-pass comparison then compares the content of the book with the content of any identified candidates. If the body text is identical or very similar to that of another book, the two books are deemed to be part of the same title set.
Each electronic reader 104 has a display upon which electronic content such as electronic books (eBooks) may be rendered. The terms content, content item, and “eBook” include essentially any form of electronic data that may be consumed on a device, including textual and verbal works comprising sequences of words such as digital books, audio books (which may also include video), electronic magazines, papers, journals, periodicals, documents, instructional materials, course content, and so on.
The electronic readers 104 may comprise various types of devices, including computers, handheld devices or other types of small, light-weight, portable devices upon which eBooks and other content can be rendered and conveniently viewed in a manner similar to viewing a paper book. Examples of handheld electronic readers include flat form-factor devices such as tablets, pads, smartphones, personal digital assistants (PDAs), etc.
In some embodiments, the electronic readers 104 may comprise dedicated-purpose eBook reader devices, having flat-panel displays and other characteristics that mimic the look, feel, and experience offered by paper-based books. For example, such eBook reader devices may have high-contrast flat-panel displays that appear similar to a printed page and that persist without frequent refreshing. Such displays may consume very negligible amounts of power, so that the eBook reader devices may be used for long periods without recharging or replacing batteries. In some instances, these readers may employ electrophoretic displays.
In the example of
The network 106 may be any type of communication network, including a local-area network, a wide-area network, the Internet, a wireless network, a cable television network, a telephone network, a cellular communications network, combinations of the foregoing, etc. Services, sometimes referred to as “cloud-based” services, may be provided from the network 106. In
In the configuration illustrated by
In
The content service 108 might be implemented in some embodiments by an online merchant or vendor. Electronic books and other electronic content might be offered for sale by such an online merchant, or might be available to members or subscribers for some type of periodic or one-time fee. In some circumstances, eBooks or other content might be made available without charge. The content service 108 may include a virtual storefront or other type of online interface for interaction with consumers and/or devices. The content service 108 may expose a graphical, web-based user interface that can be accessed by human users to browse and obtain (e.g., purchase, rent, lease, etc.) content items such as eBooks. The content service may also expose programmatic interfaces or APIs (application program interfaces) that entities and devices can use to obtain digital content items and related services.
In the described embodiment, the content service 108 includes a content ingestion service 110. The content ingestion service 110 is a logical component to which publishers 112 can connect to submit electronic books 114 for publication and distribution via the content service 108. Most relevant to the discussion herein, the ingestion service 110 receives the eBooks 114 and compares them to the members of existing title sets to determine whether the submitted electronic books are or should be members of such existing title sets.
Note that the described comparisons may also be performed with respect to books that are already part of the publisher's library, rather than on books as they are newly received, in order to initially establish and identify title sets among previously submitted books.
Referring now to
The metadata comparison 204 may be performed in various ways, depending on the nature of available metadata. In general, metadata associated with received electronic books and with existing or previously catalogued books may include various data fields, including but not limited to title, authorship or author names, publisher name, publication data, copyright date, ISBN (International Standard Book Number), and so forth.
The metadata comparison 204 may comprise comparing one or more of the available metadata fields. For example, title and authorship may be compared.
In some cases, a metadata matching score may be calculated for each of the existing title sets. Such a metadata matching score may be based on the number of matching metadata fields, and may also be influenced by how closely individual fields match each other. Certain fields, such as title and authorship fields, may be normalized prior to comparison, by removing extraneous characters and converting to a single case (upper case or lower case).
In other cases, matching title sets may be found by simply submitting a query for existing title sets whose metadata matches the metadata of the electronic book 114. Such a query may return no matches, a single match, or multiple matches.
At 206, the ingestion service 110 determines whether action 204 found no matches, a single match, or multiple matches. If a single unambiguous match was found, an action 208 comprises adding the electronic book 114 to the existing title set whose metadata matched that of the electronic book 114. Execution then proceeds at 210 to the next electronic book 114—the actions of
Multiple matches are identified at 206, indicating a plurality of existing title sets of which the electronic book 114 may be a member, a second-pass content comparison 212 is performed. This comparison will be described below, with reference to
If no matches are identified at 206, execution simply proceeds at 210 to the next electronic book.
Note that in some embodiments, a title set may be represented by a normalized or canonical set of metadata. In these embodiments, the comparison of
Also note that a title set may comprise only a single book or book version. In other words, a book that has no other versions may itself be treated as a title set, and the comparison of
Note that the comparisons of
The second-pass comparison 212 includes iterating the steps within the dashed block of
For each one of the candidate title sets 302, an action 304 is performed, which comprises comparing the text of the electronic book 114 and the candidate title set 302 to calculate a text matching score indicating similarity between the text of the received electronic book 114 and that of the member book 302.
In some embodiments, the comparison 304 may be limited to comparison against a single member of each title set, or against a canonical version of a book that represents the title set. Also, note that some title sets may contain only a single book.
The text comparison 304 may be performed in various ways. For example, a word comparison or word frequency comparison may be performed to determine whether certain words occur with the same or approximately the same frequency in both books. Alternatively, the edit distance between the two books may be calculated. Edit distance is a measure of the number of edits (insertions and deletions) that must be made in order to transform one text string to another. One type of edit distance is referred to as the Levenshtein distance. In some cases, edit distances may be calculated for portions or sections of a book, and those edit distances may be averaged or otherwise combined to determine similarity between two books.
Other alternatives for comparing text of the books may include evaluating the degree of word or page alignment. For example, autocorrelation techniques can be used to determine whether pages or other portions of one book occur in the text of another book. Techniques such as this are described in a US Patent Application entitled “Electronic Book Pagination,” by inventors Jones and Berezhnyy, having Ser. No. 12/979,971, filed Dec. 28, 2010, which is hereby incorporated by reference. Other techniques for evaluating text alignment are disclosed in a US Provisional Application entitled “Aligning Content Items to Identify Differences”, by inventors Hamaker and Killalea, having Ser. No. 61/427,682, filed Dec. 28, 2010, which is hereby incorporated by reference.
In many embodiments, the text comparison 304 is performed on the bodies or body texts of the two books being compared. The body text is the primary literary work or content of a book, excluding front matter and back matter such as tables of contents, forwards, title pages, afterwards, appendices, indexes, etc. Furthermore, the text may be normalized prior to comparison, by removing illustrations, extraneous characters, and by converting all characters to upper case or lower case.
At 306, if the text comparison 304 produces a text matching score higher than a pre-determined text threshold, the electronic book 114 is grouped at 308 with the book that was the object of the comparison, and becomes part of the same title set. Otherwise, if the text comparison 304 produces a text matching score that is not higher than the text threshold, the actions are repeated as indicated by 310. The actions within block 300 are repeated for all of the candidate titles sets 302.
Note that the description above assumes that matching scores and thresholds are numerical and that greater scores indicate higher degrees of similarity. Different implementations may of course use different types of measurements and thresholds to indicate similarity.
In a very basic configuration, an example server 400 may comprise a processing unit 402 composed of one or more processors, and memory 404. Depending on the configuration of the server 400, the memory 404 may be a type of computer storage media and may include volatile and nonvolatile memory. Thus, the memory 404 may include, but is not limited to, RAM, ROM, EEPROM, flash memory, or other memory technology.
The memory 404 may be used to store any number of functional components that are executable by the processing unit 402. In many embodiments, these functional components comprise instructions or programs that are executable by the processing unit 402, and that when executed implement operational logic for performing the actions attributed above to the content service 108 and the ingestion service 110. In addition, the memory 404 may store various types of data that are referenced by executable programs, including content items that are supplied to consuming devices such as electronic reader 104.
Functional components stored in the memory 404 may include an operating system 406 and a database 408 to store content items, annotations, title set data, eBook metadata, etc. Functional components of the server 400 may also comprise a web service component 410 that interacts with remote devices such as computers and media consumption devices.
The memory 404 may also include eBook ingestion logic 412, configured to implement the functionality described above with reference to
The server 400 may of course include many other logical, programmatic, and physical components, generally referenced by numeral 418.
Note that the various techniques described above are assumed in the given examples to be implemented in the general context of computer-executable instructions or software, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, etc. for performing particular tasks or implement particular abstract data types.
Other architectures may be used to implement the described functionality, and are intended to be within the scope of this disclosure. Furthermore, although specific distributions of responsibilities are defined above for purposes of discussion, the various functions and responsibilities might be distributed and divided in different ways, depending on particular circumstances.
Similarly, software may be stored and distributed in various ways and using different means, and the particular software storage and execution configurations described above may be varied in many different ways. Thus, software implementing the techniques described above may be distributed on various types of computer-readable media, not limited to the forms of memory that are specifically described.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as illustrative forms of implementing the claims. For example, the methodological acts need not be performed in the order or combinations described herein, and may be performed in any combination of one or more acts.
Number | Name | Date | Kind |
---|---|---|---|
4884974 | DeSmet | Dec 1989 | A |
5166664 | Fish | Nov 1992 | A |
5802204 | Basehore | Sep 1998 | A |
5907845 | Cox et al. | May 1999 | A |
5960464 | Lam | Sep 1999 | A |
5986690 | Hendricks | Nov 1999 | A |
6047093 | Lopresti et al. | Apr 2000 | A |
6074093 | Anderson | Jun 2000 | A |
6108688 | Nielsen | Aug 2000 | A |
6138129 | Combs | Oct 2000 | A |
6173251 | Ito et al. | Jan 2001 | B1 |
6218982 | Shirai et al. | Apr 2001 | B1 |
6377945 | Risvik | Apr 2002 | B1 |
6622624 | Divine et al. | Sep 2003 | B1 |
6658626 | Aiken | Dec 2003 | B1 |
6704733 | Clark et al. | Mar 2004 | B2 |
6741985 | Green | May 2004 | B2 |
6898307 | Harrington | May 2005 | B1 |
7020654 | Najmi | Mar 2006 | B1 |
7020663 | Hay et al. | Mar 2006 | B2 |
7028258 | Thacker et al. | Apr 2006 | B1 |
7702655 | Panelli et al. | Apr 2010 | B1 |
7823127 | Zeidman | Oct 2010 | B2 |
8051088 | Tibbetts | Nov 2011 | B1 |
8250065 | Chambers et al. | Aug 2012 | B1 |
8266115 | Park et al. | Sep 2012 | B1 |
8280640 | Levin et al. | Oct 2012 | B2 |
8316032 | Baluja | Nov 2012 | B1 |
8463790 | Joshi | Jun 2013 | B1 |
8495061 | Lifantsev | Jul 2013 | B1 |
8510312 | Thibaux et al. | Aug 2013 | B1 |
20020021838 | Richardson et al. | Feb 2002 | A1 |
20020049781 | Bengtson | Apr 2002 | A1 |
20020091584 | Clark et al. | Jul 2002 | A1 |
20020107735 | Henkin et al. | Aug 2002 | A1 |
20020123994 | Schabes | Sep 2002 | A1 |
20030032010 | Selifonov et al. | Feb 2003 | A1 |
20030093427 | Hsu et al. | May 2003 | A1 |
20030103238 | MacLean et al. | Jun 2003 | A1 |
20040068471 | Kato | Apr 2004 | A1 |
20040088165 | Okutani et al. | May 2004 | A1 |
20040194021 | Marshall et al. | Sep 2004 | A1 |
20040205540 | Vulpe et al. | Oct 2004 | A1 |
20040218205 | Irwin et al. | Nov 2004 | A1 |
20050060273 | Andersen et al. | Mar 2005 | A1 |
20050096938 | Slomkowski et al. | May 2005 | A1 |
20050097007 | Alger et al. | May 2005 | A1 |
20050131932 | Weare | Jun 2005 | A1 |
20050138551 | Elazar et al. | Jun 2005 | A1 |
20050160355 | Cragun et al. | Jul 2005 | A1 |
20050160356 | Albornoz et al. | Jul 2005 | A1 |
20050190397 | Ferlitsch | Sep 2005 | A1 |
20050192955 | Farrell | Sep 2005 | A1 |
20050196074 | Deere | Sep 2005 | A1 |
20050198070 | Lowry | Sep 2005 | A1 |
20050209989 | Albornoz et al. | Sep 2005 | A1 |
20060036593 | Dean et al. | Feb 2006 | A1 |
20060036934 | Fujiwara | Feb 2006 | A1 |
20060041590 | King | Feb 2006 | A1 |
20060150096 | Thacker et al. | Jul 2006 | A1 |
20060156226 | Dejean et al. | Jul 2006 | A1 |
20060173818 | Berstis et al. | Aug 2006 | A1 |
20060262340 | Lee | Nov 2006 | A1 |
20060277167 | Gross et al. | Dec 2006 | A1 |
20060294094 | King | Dec 2006 | A1 |
20070061582 | Ohmori et al. | Mar 2007 | A1 |
20070150443 | Bergholz et al. | Jun 2007 | A1 |
20070196015 | Meunier et al. | Aug 2007 | A1 |
20070217692 | Newcomer et al. | Sep 2007 | A1 |
20070217715 | Newcomer et al. | Sep 2007 | A1 |
20070274704 | Nakajima et al. | Nov 2007 | A1 |
20070280072 | Hsieh et al. | Dec 2007 | A1 |
20070286465 | Takahashi et al. | Dec 2007 | A1 |
20080019430 | Suzuki et al. | Jan 2008 | A1 |
20080027916 | Asai | Jan 2008 | A1 |
20080077570 | Tang | Mar 2008 | A1 |
20080114757 | Dejean et al. | May 2008 | A1 |
20080126335 | Gandhi et al. | May 2008 | A1 |
20080134023 | Aizawa | Jun 2008 | A1 |
20080141117 | King | Jun 2008 | A1 |
20080154943 | Dreyer et al. | Jun 2008 | A1 |
20080163039 | Ryan et al. | Jul 2008 | A1 |
20080209314 | Sylthe et al. | Aug 2008 | A1 |
20080229182 | Hendricks et al. | Sep 2008 | A1 |
20080235579 | Champion et al. | Sep 2008 | A1 |
20080243842 | Liang | Oct 2008 | A1 |
20080275871 | Berstis et al. | Nov 2008 | A1 |
20080294453 | Baird-Smith et al. | Nov 2008 | A1 |
20090012984 | Ravid et al. | Jan 2009 | A1 |
20090027419 | Kondo et al. | Jan 2009 | A1 |
20090049026 | Ohguro | Feb 2009 | A1 |
20090063557 | MacPherson | Mar 2009 | A1 |
20090144277 | Trutner et al. | Jun 2009 | A1 |
20090164312 | Nadig | Jun 2009 | A1 |
20090182728 | Anderson | Jul 2009 | A1 |
20090204893 | Nguyen et al. | Aug 2009 | A1 |
20090241054 | Hendricks | Sep 2009 | A1 |
20090254810 | Mitsui | Oct 2009 | A1 |
20090265321 | Grubb et al. | Oct 2009 | A1 |
20090310408 | Lee et al. | Dec 2009 | A1 |
20090313539 | Ota et al. | Dec 2009 | A1 |
20090324096 | Megawa | Dec 2009 | A1 |
20100114827 | Pearce | May 2010 | A1 |
20100166309 | Hull et al. | Jul 2010 | A1 |
20100198864 | Ravid et al. | Aug 2010 | A1 |
20100205160 | Kumar et al. | Aug 2010 | A1 |
20100220216 | Fishman et al. | Sep 2010 | A1 |
20100251089 | Cole et al. | Sep 2010 | A1 |
20100262454 | Sommer et al. | Oct 2010 | A1 |
20110029491 | Joshi | Feb 2011 | A1 |
20110078152 | Forman et al. | Mar 2011 | A1 |
20110119240 | Shapira | May 2011 | A1 |
20110153330 | Yazdani et al. | Jun 2011 | A1 |
20110231474 | Locker et al. | Sep 2011 | A1 |
20120036431 | Ito et al. | Feb 2012 | A1 |
20120060082 | Edala et al. | Mar 2012 | A1 |
20120121195 | Yadid et al. | May 2012 | A1 |
20120198330 | Koppel et al. | Aug 2012 | A1 |
20140298167 | Jones et al. | Oct 2014 | A1 |
Number | Date | Country |
---|---|---|
2009205248 | Sep 2009 | JP |
Entry |
---|
Non-Final Office Action for U.S. Appl. No. 13/050,829, dated May 8, 2012, Janna Hamaker et al., “Aligning Content Items to Identify Differences”, 20 pages. |
Office action for U.S. Appl. No. 13/050,829, dated Nov. 29, 2012, Hamaker et al.,“Aligning Content Items to Identify Differences”, 24 pages. |
Office action for U.S. Appl. No. 12/979,971, dated Aug. 8, 2013, Jones et al., “Electronic Book Pagination”, 32 pages. |
Office action for U.S. Appl. No. 12/980,015, dated Sep. 10, 2013, Weight et al., “Book Version Mapping”, 27 pages. |
Wikipedia, “Lookuptable”, at http://en.wikipedia.org/w/index.php?title=Lookup—table&oldid—333018082, retrieved on Aug. 12, 2013, 2009, 7 pages. |
Wikipedia, “Metadata”, at http://en.wikipedia.org/w/index.php?title=Metadata&oldid=333583065, retrieved on Aug. 2, 2013, 2009, 17 pages. |
Office action for U.S. Appl. No. 12/979,971, dated Apr. 18, 2013, Jones et al., “Electronic Book Pagination”, 30 pages. |
Final Office Action for U.S. Appl. No. 12/980,015, dated May 9, 2014, Christopher F. Weight, “Book Version Mapping”, 30 pages. |
Office Action for U.S. Appl. No. 12/979,971, dated Dec. 6, 2013, Derek T. Jones, “Electronic Book Pagination”, 13 pages. |
Office action for U.S. Appl. No. 13/050,829, dated Sep. 30, 2014, Hamaker et al., “Aligning Content Items to Identify Differences”, 30 pages. |
Office Action for U.S. Appl. No. 12/980,015, dated Jul. 7, 2015, Christopher F. Weight, “Book Version Mapping”, 32 pages. |
Office Action for U.S. Appl. No. 12/980,015, dated Jan. 21, 2016, Weight et al., “Book Version Mapping”, 40 pages. |
Office action for U.S. Appl. No. 12/980,015, dated Nov. 17, 2016, Weight et al., “Book Version Mapping”, 27 pages. |
Office Action for U.S. Appl. No. 14/308,914 dated Jul. 14, 2017, Jones et al, “Electronic Book Pagination”, 22 pages. |
Office Action for U.S. Appl. No. 14/308,914 dated Mar. 6, 2017, Jones et al, “Electronic Book Pagination”, 18 pages. |
Final Office Action for U.S. Appl. No. 12/980,015, dated May 19, 2017, Christopher F. Weight, “Book Version Mapping”, 30 pages. |
“How do I use the Migrate Comments Command?”, The Same Page, retrieved Feb. 17, 2017 from http://blogs.adobe.com/thesamepage/2009/05/how—do—i—use—the—migrate—comme.html, posted May 28, 2009, 2 pgs. |