1. Technical Field
This disclosure relates generally to displaying aligned passages of text in different languages on an ebook reader.
2. Background
Electronic books (ebooks) are becoming very popular. Ebooks, as with any digital content, can be conveniently purchased online and downloaded to client devices for users to access. A user reading an ebook written in his or her non-native language may come across a passage that the user does not fully understand. For example, a user who is a native Hebrew reader reading an English-language ebook may come across a passage in the English text that uses words new to the user. In this instance, to comprehend the text, the user might wish to refer to the passage in the user's native language. One solution is to perform machine translation of the passage to the user's native language. However, machine-translated text may be inaccurate or, at least, lack nuance present in the original text. This problem is compounded because the user is likely to be requesting translation of an especially complex passage. Thus, machine-translated text is not ideal in this situation.
A method, non-transitory computer-readable storage medium, and system for providing a reference passage corresponding to a reading passage of an ebook as described herein. One aspect of the method comprises grouping different-language instances of a same ebook into a group, the different-language instances of the ebook created by human translation of the ebook and including a reading-language instance and a reference-language instance of the ebook. The method further comprises aligning corresponding passages in the different-language instances of the ebook in the group. The method additionally comprises, in response to a request identifying a reading passage in the reading-language instance of the ebook, identifying a reference passage in the reference-language instance of the ebook aligned with the reading passage and sending information describing the identified reference passage in response to the request.
One aspect of the non-transitory computer-readable storage medium stores executable computer program instructions for providing a reference passage corresponding to a reading passage of an ebook. The computer program instructions comprise instructions for grouping different-language instances of a same ebook into a group, the different-language instances of the ebook created by human translation of the ebook and including a reading-language instance and a reference-language instance of the ebook. The computer program instructions further comprise instructions for aligning corresponding passages in the different-language instances of the ebook in the group. The computer program instructions additionally comprise instructions for, in response to a request identifying a reading passage in the reading-language instance of the ebook, identifying a reference passage in the reference-language instance of the ebook aligned with the reading passage and sending information describing the identified reference passage in response to the request.
One aspect of the computer system for providing a reference passage corresponding to a reading passage of an ebook comprises a non-transitory computer readable storage medium storing executable program code. The executable program code comprises code for grouping different-language instances of a same ebook into a group, the different-language instances of the ebook created by human translation of the ebook and including a reading-language instance and a reference-language instance of the ebook. The executable program code further comprises code for aligning corresponding passages in the different-language instances of the ebook in the group. The executable program code additionally comprises code for, in response to a request identifying a reading passage in the reading-language instance of the ebook, identifying a reference passage in the reference-language instance of the ebook aligned with the reading passage and sending information describing the identified reference passage in response to the request.
The features and advantages described in the specification are not all inclusive and, in particular, many additional features and advantages will be apparent to one of ordinary skill in the art in view of the drawings, specification, and claims. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the disclosed subject matter.
The Figures (FIGS.) and the following description describe certain embodiments by way of illustration only. One skilled in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles described herein. Reference will now be made in detail to several embodiments, examples of which are illustrated in the accompanying figures. It is noted that wherever practicable similar or like reference numbers may be used in the figures and may indicate similar or like functionality.
Generally, a user can purchase and download electronic books (ebooks) through a client device 102. When reading an ebook in a first language, the user may instruct the client device 102 to display an identified passage of the ebook in a second language. The client device 102 obtains the corresponding passage of text in the second language from the corpus server 110 and displays it to the user. In one embodiment, the text in the second language is produced by a human translator or via another technique that generates a high-quality translation. Thus, the text in the second language reflects the same tone and nuance of the text in the first language and may assist the user in comprehending the passage, particularly if the user is fluent in the second language but not fluent in the first language.
In one embodiment, the client devices 102 are electronic devices used by users to read ebooks. For example, the electronic devices can be dedicated ebook readers or other general or specific-purpose computing devices such as mobile telephones, or tablet, notebook, or desktop computers executing ebook reading applications. The ebook reading applications can be standalone applications or integrated into operating systems, web browsers or other software executing on the computing devices. While only two client devices 102A, 102B are illustrated in
A client device 102 and/or ebook reading application executing on the client device provides a graphic user interface (GUI) 104 (depicted by way of example in
When reading an ebook in the reading language, the user may use the GUI to select a portion of the text in the reading language. The selected portion of text in the reading language is referred to as the “reading passage” and may include, for example, a page, paragraph, sentence, or sentence fragment. The user may select the reading passage by, e.g., using a cursor, touch-screen gesture, or other technique. In response to selection of the reading passage, the GUI displays an associated “reference passage” with text in the reference language aligned with the reading passage. The reference passage is “aligned” in the sense that it corresponds to the reading passage selected by the user, except that the text of the reference passage is in the reference language.
The GUI of the client device 102 may display the reference passage in association with the reading passage in a variety of different ways. For example, the GUI may display the reference passage in a separate window offset from the reading passage, or may display the reference passage in a dual column adjacent to the reading passage.
The corpus server 110 includes one or more computers and provides ebook content including reading and reference passages to the client devices 102. The corpus server 110 may provide the ebook content in a variety of ways. In one embodiment, the corpus server 110 provides ebooks containing both reading and reference passages to the client devices 102 in a single interaction. For example, the corpus server 110 may provide an entire ebook in multiple languages for storage at a client device 102. In another embodiment, the corpus server 110 provides portions of ebooks and/or reference passages to the client devices 102 over multiple transactions. For example, the corpus server 110 may provide a chapter or page of an ebook in response to a request from a client device 102. Then, the corpus server 110 may provide a reference passage to a client device 102 in response to a request that identifies the corresponding reading passage.
The book repository 114 is in communication with the corpus server 110 and includes a database storing ebooks in a variety of languages. Depending upon the embodiment, the book repository 114 may be a relational or other type of database. The database may be local to or remote from the corpus server 110. The ebooks in the repository include text, images, and/or other content that form the ebooks. In addition, each ebook may have associated metadata that describe the ebook, such as describing the ebook's title, author, publication date, publisher, language, International Standard Book Number (ISBN), etc. The metadata may also describe the structure of content within the ebook, such as the pagination, chapter divisions, etc.
In one embodiment, the book repository 114 stores different-language instances of ebook titles. For example, the book repository 114 may store ebook instances of “Ulysses” by James Joyce in its original English language, and in foreign languages such as Spanish, French, and Hebrew. Further, in one embodiment, the texts of the foreign-language versions of the ebooks are composed manually by human translators of the original texts. Many ebooks are published in a variety of languages, and the foreign—(i.e., non-native) language versions of the ebooks are translated by human translation specialists.
The human-translated versions of the ebooks include the same tone, nuance, and other esthetic characteristics found in the native-language versions of the books. In order to capture these esthetic characteristics, the translator may deviate from literal translation when translating the books. Human translation is in contrast to machine translation in which it is more likely that the translated text is a literal translation of the original text.
The corpus server 110 includes an alignment engine 112 that aligns corresponding passages in different-language instances of ebooks. For a given ebook, such as “Ulysses”, the alignment engine 112 identifies the instances of the ebook in multiple different languages stored in the book repository 114 and aligns the corresponding passages in the different-language versions. When a request for a reference passage corresponding to specified reading passage in an ebook is received from a client device 102, the alignment engine 112 identifies the reference passage corresponding to the text passage to the corpus server 110.
The storage device 308 is a non-transitory computer-readable storage medium such as a hard drive, compact disk read-only memory (CD-ROM), DVD, or a solid-state memory device. The memory 306 holds instructions and data used by the processor 302. The pointing device 314 is a mouse, track ball, or other type of pointing device, and is used in combination with the keyboard 310 to input data into the computer 300. The graphics adapter 312 displays images and other information on the display device 318. The network adapter 316 couples the computer 300 to a network. Some embodiments of the computer 300 have different and/or other components than those shown in
The computer 300 is adapted to execute computer program modules for providing functionality described herein. As used herein, the term “module” refers to computer program instructions and other logic used to provide the specified functionality. Thus, a module can be implemented in hardware, firmware, and/or software. In one embodiment, program modules formed of executable computer program instructions are stored on the storage device 308, loaded into the memory 306, and executed by the processor 302.
The book grouping module 402 groups together different-language instances of the same ebook contained in the book repository 114. Thus, for example, the book grouping module 402 may identify and group together (e.g., cluster) the English, French, and Hebrew instances of the novel “Ulysses” by James Joyce. The book grouping module 402 may group the ebooks using a variety of different techniques.
In one embodiment, the book grouping module 402 groups the ebooks using metadata associated with the ebooks. The book grouping module 402 examines the metadata associated with the various ebooks in the repository 114 to identify the different-language instances of the same ebooks. For example, different translations of a given ebook may share the same metadata, such as book title, author, publisher, series title, and publishing date.
In another embodiment, the book grouping module 402 performs a textual analysis of the ebooks in the repository 114 to identify different-language instances of the same ebooks. For this embodiment, the book grouping module 402 identifies a basis language, e.g., English. The book grouping module 402 then uses machine translation to translate ebooks in the repository 114 that are not already in the basis language to that language. The book grouping module 402 next analyzes the ebook texts in the basis language to cluster the ebooks based on textual similarity. For example, the book grouping module 402 may cluster together ebooks having a threshold measure of textual similarity. Instances of the same ebook that are in different languages will tend to have similar texts when machine translated to the same basis language. Therefore, clustering based on textual similarity forms clusters of instances of the same ebook. The book grouping module 402 accordingly identifies the ebooks within a given cluster as being different-language instances of the same ebook.
The passage alignment module 404 aligns passages of text in different-language instances of an ebook. “Alignment” refers to identifying a passage of text in one language of an ebook that generally corresponds to an equivalent passage of text in another language of the ebook. That is, the text in the first language of the ebook has the same or a similar meaning as the text in the second language, subject to variations introduced due to translation.
In one embodiment, the passage alignment module 404 performs the alignment by using machine translation to translate different-language instances of an ebook into a same basis language. The same machine translations generated by the book grouping module 402 may be used by the passage alignment module 404. During this translation, the passage alignment module 404 maintains data describing the mapping between the text in the original language (i.e., the non-basis language version) of the ebook and the translated basis-language text. Thus, for each passage of the basis language text, the passage alignment module 404 can identify the location of the passage in the original language text from which the basis language text was generated.
The passage alignment module 404 compares the basis language versions of the ebook instances in order to identify highly-similar passages. The passage alignment module 404 may compare each basis language passage with the version of the passage originally in the basis language in order to identify highly-similar passages. For example, if the basis language is English, the passage alignment module 404 may separately compare the basis language passages translated from the French, Spanish, and Hebrew versions of “Ulysses” with the original English language version of “Ulysses” in order to identify passages in the foreign-language texts that are highly-similar to the English-language passages. Alternatively, the passage alignment module 404 may compare each basis language passage with each other basis language passage to identify highly-similar passages.
In one embodiment, “highly-similar” is determined by comparing passages (e.g., sentences, paragraphs) using a similarity metric that produces a score indicating the amount of similarity between the passages. The score may be based, for example, on the number of words or characters in common, the orders in which the words and/or characters appear, and weights assigned to certain words and/or characters. The passages having a score above a threshold are considered “highly-similar.” The passage alignment module 404 records these highly-similar passages as being aligned.
In one embodiment, the passage alignment module 404 uses the metadata describing the ebook structures when identifying highly-similar passages. The passage alignment module 404 uses the metadata to reduce the amount of basis-language text to compare when identifying highly-similar passages. For example, the passage alignment module 404 may use metadata describing chapters in order to compare basis language passages within the same chapter of an ebook. Generally, chapter divisions are expected to remain the same across instances of ebooks in different languages. Therefore, by comparing basis language passages from the same chapter of different ebook instances, the passage alignment module 404 increases the likelihood that highly-similar passages do, in fact, correspond to the same passages in the ebook instances.
The passage alignment module 404 stores alignment data describing the locations of the aligned passages. The alignment data indicate the locations of passages in a given instance of an ebook that, when translated to the basis language, align with basis-language passages in specified locations of other-language instances of the same ebook. For example, the alignment data may specify the locations of passages in the Hebrew-language instance of “Ulysses” that, when translated to English (the basis language), align with specified passages of the English-language instance of “Ulysses”. The alignment data may also specify locations of passages in other language instances of “Ulysses” that align with specific passages of the English-language instance. Thus, the alignment data may be used to align passages in any language instance with passages in any other language instance of the ebook.
The machine translation module 406 provides machine translation of text, such as ebook passages, on behalf of other modules in the alignment engine 112. In one embodiment, the machine translation module 406 receives an input of text in one language, performs substitution of words, and applies grammar rules to produce an output of the same text in another language. The machine translation module 406 may interact with an external machine translation resource to perform the translations, such as the GOOGLE TRANSLATE service provided by GOOGLE INC. The machine translation module 406 may be used, for example, to translate text into the basis language on behalf of the book grouping module 402 and the passage alignment module 404.
The client interface module 408 interacts with the client devices 102 to provide aligned passages. In one embodiment, the client interface module 408 receives a request for an aligned passage from a client device 102. The request includes passage identification information identifying a reading passage for which an aligned reference passage is requested. To this end, the request may identify one or more of the ebook, the reading language, the reference language, and the location of the reading passage within the ebook. The request may also include related information such as an identifier of the user of the client device, an identifier of the client device, and/or any other information that is necessary or desired.
In response to receiving a request, the client interface module 408 uses the passage identification information, in combination with the alignment data stored by the passage alignment module 404, to identify the aligned reference passage. The client interface module 408 responds to the request by sending the client device 102 reference passage information describing the aligned reference passage. In one embodiment, the client interface module 408 retrieves the text of the reference passage from the reference-language ebook instance in the book repository 114 and provides that text as the reference passage information. In another embodiment, the client interface module 408 provides the location in the reference-language ebook instance at which the aligned reference passage is located to the client device 102 and the client device uses this information to obtain the reference passage.
In step 502, the client device 102 receives a selection of a reading passage in a reading language for which the user requests an aligned reference passage in a reference language. The reading passage is contained within an ebook. The client device 102 may receive the selection in response to a gesture or other input by the user. The client device 102 then determines (step 504) the position of the selected reading passage in the ebook. The client device 102 then identifies the corresponding reference passage by, e.g., sending (step 506) a request for the reference passage to the corpus server 110. The request includes passage identification information identifying the position of the selected reading passage. In response, in step 508, the client device 102 receives from the corpus server 110 reference passage information describing the aligned reference passage. The client device 102 then obtains, if necessary, and presents (step 510) the reference passage to the user. For example, the client device 102 may display the reference passage in a pop-up window or in a dual column view. The reference passage contains a human-generated translation of the text in the reading passage and may, therefore, assist the user in comprehending the reading passage.
In step 602, the alignment engine 112 groups together different-language instances of ebooks into clusters, so that a single cluster contains different-language instances of the same ebook. This clustering may be performed by using machine translation to translate ebooks in the repository 114 into a basis language, and clustering the basis-language ebooks based on textual similarity. For a cluster containing different-language ebook instances, in step 604, the alignment engine 112 aligns corresponding passages across the ebook instances. As described above, the alignment can be achieved by machine-translating the text of the ebook instances in the cluster into a common basis language, and comparing the basis language versions of the texts to identify highly-similar passages. The alignment engine 112 stores alignment data indicating locations of aligned passages in the different-language ebook instances.
In step 606, the alignment engine receives a request for a reference passage in a reference language from a client device 102. The request includes passage identification information identifying the location of a reading passage in an instance of an ebook in a reading language. In response to the request, in step 608, the alignment engine 112 uses the passage identification information to identify an aligned passage in the reference language that corresponds to the reading passage. In step 610, the alignment engine 112 sends reference passage information describing the aligned reference passage to the client device 102. The reference passage information can include the text of the reference passage, and/or information the client device 102 can use to obtain the reference passage.
The foregoing description of embodiments of the invention has been presented only for the purpose of illustration and description and is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Numerous modifications and adaptations thereof will be apparent to those skilled in the art without departing from the spirit and scope of the present invention.