The present invention relates to Internet browsing technologies, and more particularly, Internet browsing technologies on electronic mobile readers, or e-readers.
In recent years, technologies such as E-Ink®, made by the E-Ink Corporation of Cambridge, Mass., have enabled mobile devices to closely simulate the experience of reading a real book. As a result, electronic mobile readers, or e-readers, have become very popular. For example, the first offering of the Kindle® e-reader made by Amazon Inc. of Seattle, Wash., was sold out in five and a half hours.
However, currently, e-readers are limited in their functionality, especially displaying web pages. For example, animated Flash® content and images have very poor visual effects on dedicated reading device screens, such as E-Ink® based devices. Further, complex web pages are difficult to display on an e-reader screen with low resolution. Moreover, banner ads, navigation bars, and text boxes are very often irrelevant to a user's reading experience on an e-reader.
Some web sites have offered a mobile version of their content. For an example, cnn.com has made its mobile version available at m.cnn.com. The mobile version is normally simpler and text-centric, compared to its corresponding version on a computer intended for regular use at a single location. However, only a small number of web sites have mobile editions. Even for those web sites that provide mobile editions, the entire web site is not available.
As a result, today's e-readers do not support internet browsing effectively. For example, the nook™ e-reader made by Barnes and Noble, Inc. of New York, N.Y., and the Reader™ e-reader made by the Sony Corporation of Tokyo, Japan, simply do not have web browsing capabilities. Amazon's Kindle® e-reader has a mobile web browser, but it fails to display most complex websites (e.g., yahoo.com) in a user friendly manner.
It is, therefore, an object of the present invention to optimize web content for display on a mobile device, particularly on an electronic mobile reader (“e-reader”).
Broadly, the present invention is directed to a method and apparatus for receiving web content and converting it into a format that can be displayed on a mobile e-reader. One aspect of the invention is directed to a method for processing a web page comprising the steps of:
Advantageously, the subset of page units correspond to the page units to be displayed according to a predetermined specification, as will be described more fully below.
Another aspect of the invention is directed to a mobile device capable of displaying an assembled page comprising a decomposer for decomposing a web page into a plurality of page units; a filter for filtering at least one page unit and producing thereby a subset of the plurality of page units; an assembler for assembling the subset of page units; and a display for displaying the assembled subset of page units as an assembled page. One such apparatus is an e-reader having a computing system architecture as would be understood by a person having ordinary skill in the art.
Yet another embodiment of the invention is directed to a system that includes a mobile device and a web site and a communications link therebetween. In such a system, the computers that are part of the system, and/or the mobile device can include devices, programs, connections, functions, and functionality such as, but not limited to, a display, a central processing unit, random access memory, read only memory, a bus controller, an interrupt controller, mass storage, removable media, fixed disk drive, keyboard, mouse, audio and/or video transducer, audio and/or video controller, network adapter, web server, local area network, wide area network, process scheduling, memory management, networking, I/O services, communications adapter, interface device, and a connection to a network over a medium (such as a tangible medium, including but not limited to optical or hard-wire communications lines, or a wireless medium, including but not limited to microwave, infrared, or other transmission techniques).
In one embodiment, an original web page is decomposed into page units. One or more filters are then applied to the page units. The subset of page units that are not removed by the selected filter(s) are then assembled into an assembled page that is displayed on a device.
The present invention also is particularly useful for devices that have a limited ability to display all elements of a web page, such as those found in e-readers and mobile tablets which do not display moving images well.
In the specification, the singular forms include plural references unless the context clearly dictates otherwise. Unless defined otherwise all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
Other features, objects, and advantages of the present invention will be apparent to a person of ordinary skill in the art from the following detailed description made with reference to the drawings annexed in which:
The present invention is directed to optimizing web content for display on a mobile device such as an e-reader. Although the following description describes an embodiment for use with an e-reader, it should be understood that this invention is applicable to and can be used with any mobile or handheld device.
As used herein, a “page unit” is a fragment of the information of a web page as normally displayed in a browser. A web page can be decomposed into the smaller page units, such as, but not limited to, text units, picture units, hyperlink units, and multimedia units. These page units can be processed later and stored separately. As non-limiting examples, a page unit can be the result of a web search, image search, or a news search. Advantageously, page units that the manufacturer and/or the user consider to be irrelevant or “spam” can be filtered and/or discarded.
In one preferred embodiment of the present invention, a web page is decomposed into page units. A filter is then applied to the page units and page units identified as unwanted are filtered out. The remaining subset of page units can then be assembled to create an assembled page. This assembled page may be cleaner and/or smaller than the original web page.
Decomposition in accordance with the present invention concerns processing an original web page by decomposing it into page units. The content and visual information of each page unit is then collected. In one embodiment, a web page is written in Hyper Text Markup Language (“HTML”). Page units for such a web page include, but are not limited to, a link, a text label, a table, and an image. Each page unit optionally can be tagged with geographic information, which indicates the location of that page unit in the original web page as displayed in a browser.
A given page unit can also contain other smaller page units. For example, a table unit might contain multiple table row units. An original web page then can be decomposed into page units by applying HTML rendering or layout engines (such as WebKit or Gecko) to the original web page. An original web page can also be decomposed into page units by parsing the HTML text using XML/DOM into a Document Object Module (“DOM”) such that the objects can be manipulated by an application program.
Filters can be selected by a person of ordinary skill in the art as a matter of design choice according to device specifications, user preferences and/or characteristics of the page units. The device specifications can identify the types of page units to be displayed in an assembled page. For example, dynamic content, such as Flash® or video, cannot be displayed properly on screens using the existing technology, such as E Ink® based devices. Thus, in one preferred embodiment, filters can be selected to identify page units containing such dynamic content.
A user preference can also identify the types of page units to be displayed in an assembled page. For example, a user might not want to display navigation ads on an e-reader. Thus, a setup option can be provided that the user can actuate and select what types of page units will be displayed for a given set of circumstances. This setup option results in a specification to be applied to the web page.
Generally, every web page can present information differently. The original web page can be a factor in determining the types of page units to assemble. For example, merchandise or commercial information may not be filtered on an e-commerce-related web page, while it may be filtered on a web page for news. Thus, a specification can be set to classify an original web page based on the category of content extracted from the original web page and/or the page units. The accuracy of the information extracted can be greatly increased if the type of web site is known in advance.
The page units can be filtered based on an application of the selected filter or filters. Not all the page units can be or should be displayed on the client device. Animated Flash® and images have very poor visual effects on screens using the existing technology, such as E Ink®. Preferably, during this step, selected filters are applied to remove page units that are identified as page units that should not be displayed according to the criteria defined by the device or the user of the device, as the case may be.
Various filters can be used in accordance with the present invention. In one embodiment, a device-based filter, which is triggered by a specification of the reading device, is applied.
In another embodiment, a content-based filter, which is based on the properties of the original web page and/or page units, is applied. A content-based filter can be applied based on the classification result of the original page and/or page units. As a non-limiting example, a page unit that contains merchandise information from a web page classified as “e-commerce” may not be filtered out, while a page unit that contains advertisement information from a web page classified as “news” can be filtered out. As another non-limiting example, the footer and header information (according to the geographic information) can be filtered, i.e., removed, as they do not provide much reading value.
For each filter, the input and the output are both page units.
After a filter is applied to the page units, the remaining subset of page units are assembled. These page units preferably can be assembled in accordance with the specification of the target reading device and/or end user preferences. In one embodiment, the geographic information of the page units is altered to achieve maximal readability in the device. In this regard, the device can map the original web content geographic information to the mobile device display geographic information, considering the relative positioning of the page unit in the original web page and in the assembled web page for display on the mobile device. In other words, the geographic information is used to assemble the subset of page units to present the content of the subset of page units in a meaningful way to the reader.
In an alternate embodiment, the content information and the geographic properties of the subset of page units from the original web page are used to assemble the layout of the subset of page units.
In another embodiment, the subset of page units can be converted into semantic data files with all the geographic location information removed. The semantic information is then used to assemble the layout based on the semantic information of the subset of page units. In one embodiment, the semantic information is encoded in Extensible Markup Language format.
In accordance with a preferred embodiment of the present invention, a page decomposition process will now be described with reference to
As illustrated in
Three types of open source HTML rendering engines (Gecko, WebKit, and Lynx) were deployed within a server farm. The server farm consisted of eight servers. Each server hosted ten instances of each rendering engine and each rendering engine could handle one request at a time.
When a request was received, the input URL was first sent to three engine instances simultaneously and the engines fetched the page content and transferred the downloaded page into DOM data structure. The rendered results were then aggregated to produce the final DOM, which was used to generate the assembled page content.
In accordance with a preferred embodiment of the present invention, a filter selection process will now be described with reference to
The Filter Selection Process 220 includes the Classification Layer 222 and the Filter Selection Layer 224. The Classification Layer 222 classifies the Original Web Page 200 and/or the page units. This classification contributes to deciding which filters to use to generate the subset of page units. To classify the Original Web Page 200, the features of the page units were extracted. Each feature is a numeric or string value. For example, the title of a page is a feature which was extracted from the text field of the title unit. The number of images is a feature that was computed by summing the number of all the image units on the Original Web Page 200. The classification of the Original Web Page 200 was determined by computing the statistics about the features and by applying classifiers such as a naive Bayesian classifier. In this example, the Original Web Page 200 was classified into multiple categories such as news, blog, or discussion forum. Each classification was associated with a set of specific filters.
The Filter Selection Layer 224 takes the classification of the page, the device specification, and the user preferences into consideration to generate a series of filters to be used in the Filter Process 230.
In accordance with a preferred embodiment of the present invention, a filter process will now be described with reference to
The Filter Process 230 includes the Filter Layer 232. The Filter Layer 232 generates a subset of page units by applying a filter, i.e., one or more filters, to the page units. Each filter is applied to the page units so unwanted page units will not be displayed on the Assembled Page 250. In one embodiment, a page unit can be filtered multiple times. In another embodiment, once a page unit has been filtered out by a first filter, it is possible that a second filter will not be applied to a filtered out page unit. In another embodiment, a page unit is filtered once.
The device-based filter was applied to page units to filter out page units that could not be properly displayed in a specific target device. For example, all the images that have a width larger than 600 pixels were filtered out by a device-based filter designed to be used with a Kindle® e-reader since that image cannot be displayed in the Kindle® screen without distortion.
With this device specification, the device-based filter can discard images in the DOM in which the width is greater than 600. Alternately, or in addition, the operation may change the width property of the image so that the whole image could fit into the screen specified for the device.
DOM items are then removed based on the device specification entry 320. The Filter Layer 232 then checks if all of the specification entries have been processed 330. If all of the specifications have not been processed, the Filter Layer 232 returns to step 310. If all of the specifications have been processed, the Filter Layer proceeds to the next phase of the algorithm whereby the page unit is properly formatted for assembly. The page unit is resized 340, its layout is rearranged 350 and paginated 360. The page unit is then returned 370.
The content-based filter was applied to page units to identify page units containing the patterns of advertisements to be filtered out. The content-based filter was also applied to filter page units containing Flash® and animated GIF images. The content-based filter was also applied to filter page units located in certain positions in the web page that would be difficult for the user to notice.
Content-based filtering using a geographic score takes into account the fact that users tend to pay more attention to those contents located in the “above the fold” and center of the screen. “Above the fold” refers to a location on a traditional printed newspaper as that area on the upper half of the front page of a newspaper. As a result, many web sites put the most relevant information within the above the fold and center area. Higher geographic scores are evaluated for DOM items within this area. For other parts of the web page, the relevance of the information tends to decrease as the content becomes further away from the focus center.
G=|w/2−x|*α−(y/h)*β
wherein w is screen width, h is screen height, x is the x coordinate of the item, y is the y coordinate of the item, α is horizontal relative factor, and β is vertical relative factor. The DOM item is then scored 420, rearranged 430, and returned 440. For each page unit, its geographic score is the summation of all the geographic scores of its children.
Once the geographic score is evaluated, the DOM nodes are rearranged in a recursive fashion 555-580 and returned 430. The rearranging process begins at the root node for the DOM tree 555. For each DOM node, the children are sorted based on the descending order of the geographic score 575. The nodes of the children are rearranged and then each child node is visited recursively based on the same algorithm.
In accordance with a preferred embodiment of the present invention, a page assembly process will now be described with reference to
With reference to
Example 5, as illustrated in
Example 6, as illustrated in
The embodiment illustrated in
The foregoing description, including embodiments and examples, is for illustrative purposes and is not intended to limit the invention to the precise form disclosed. Persons skilled in the art are capable of appreciating other embodiments from the scope and spirit of the foregoing teaching.
The present application claims the benefit of Provisional Patent Application No. 61/337,729 entitled “Optimizing Web Content Display On An Electronic Book Reader,” filed on Feb. 11, 2010, which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61337729 | Feb 2010 | US |