Content such as newspapers and magazines are increasingly accessible from web portals. A user can visit a web site and select individual links to articles. Currently, some services use RSS feed mechanisms to provide web content to users directly, such as blog entries, news headlines, audio, and video, in a standardized format. However, these RSS feeds depend on the web content owner for deployment. In addition, these RSS feeds are available for only a small part of web content available on the internet.
In the following description, like reference numbers are used to identify like elements. Furthermore, the drawings are intended to illustrate major features of exemplary embodiments in a diagrammatic manner. The drawings are not intended to depict every feature of actual embodiments nor relative dimensions of the depicted elements, and are not drawn to scale.
A “computer” is any machine, device, or apparatus that processes data according to computer-executable instructions, including machine readable instructions, that are stored on a computer-readable medium either temporarily or permanently. A “software application” (also referred to as software, an application, computer software, a computer application, a program, and a computer program) is a set of machine readable instructions that an apparatus, e.g., a computer, can interpret and execute to perform one or more specific tasks. A “data file” is a block of information that durably stores data for use by a software application.
The term “computer-readable medium” refers to any medium capable storing information that is readable by a machine (e.g., a computer). Storage devices suitable for tangibly embodying these instructions and data include, but are not limited to, all forms of non-volatile computer-readable memory, including, for example, semiconductor memory devices, such as EPROM, EEPROM, and Flash memory devices, magnetic disks such as internal hard disks and removable hard disks, magneto-optical disks, DVD-ROM/RAM, and CD-ROM/RAM.
The term “web page” refers to a document that can be retrieved from a server over a network connection (including a wireless network) and viewed in an application, including a web browser application.
As used herein, the term “includes” means includes but not limited to, the term “including” means including but not limited to. The term “based on” means based at least in part on.
Content such as newspapers and magazines are increasingly accessible from web portals. A use can visit a web site and select individual pages with articles to read. The user experience may not be satisfactory since the web pages often include a large amount of auxiliary content, including advertisement. Often, the article of interest may be distributed across multiple web pages and have more advertisement display. Also, it can be tedious for a user to click on and follow a large number of links to read through various articles, as it may require traversing multiple web pages to view all the user-desired content.
To facilitate a user's access to web content, a system and method is describes that allows a user to annotate topics of interest directly from web portals. A system and method herein enables automatic extraction of content that is of interest to a user, and delivery of that content of interest to the user's devices.
The extracted content can be delivered in various formats, for example according to a user preference. The extracted content may be delivered as a Portable Document Format (PDF) document, as a web page (for example, based on a markup language file), or in an electronic book format (including an ebook or other electronic book accessible by an electronic reader). Non-limiting examples of applicable markup language files include a HTML file based on a variation of the markup language, including XHTML and HTML5, and a markup language embedded in or called from HTML including Cascade Style Sheet (CSS) and JavaScript. In an example, the extracted content is delivered in an electronic book format, including as an EPUB® file (a *.epub file). In an example, the extracted content may be delivered as a link in an electronic transmission (such as email), and the user gains access to the body of the extracted content by following the link.
In a non-limiting example implementation, the extracted content is delivered to a portable device, including a smartphone, a tablet, a slate, or other touch-based device or other hand-held device, a laptop, a notebook, or other portable computer-based device. In a non-limiting example implementation, the extracted content is delivered to a computer-based viewing device that may be part of a booth, a kiosk, a pedestal or other type of physical support.
In an example, the extracted content is considered delivered to a designated destination if a user utilizes a device (including a portable device and a computer-based viewing device) to access and/or view the extracted content, including by following a link.
In some examples, the content delivery system 10 outputs the results from operation of content delivery system 10 by storing them in a data storage device (including, in a database) or rendering them on a display (including, in a user interface generated by a software application). Example displays include the display screen of a portable device, including a smartphone, a tablet, a slate, or other touch-based device or other hand-held device, a laptop, a notebook, or other portable computer-based device. Other example displays include the display screen of a computer-based viewing device that may be part of a booth, a kiosk, a pedestal or other type of physical support.
In an example, a system and method described herein is configured to allow a user to access personalized content that is aggregated from multiple we sources and delivered to the user at the user's destination of choice. The system can include a client-based component for setting up the web content selections. The system can include a server-based component for analyzing the selections. The server-based component can be used to fetch the web content selections and to deliver the web content selections to the designated destination.
Referring now to
The block diagram of
In block 202, at least one module performs the operations to receive input indicative of the user's selection from a content portal. The functionality can be performed by a client-based component. An implementation provides a user with access to a content portal and facilitates use of an interface of the client-based component so that the user can indicate the selections of interest from the content. For example, the selections of interest can be a section of the web page that includes links to the articles of interest. The client-based component provides a user with a tool for use in indicating the selections of interest of the web content.
In an example, the client-based component presents a tool 305 that a user can use to select a section of a web page 300, served from a content portal, which includes links to the articles of interest. In the illustration of
In an example, for selecting the section(s) of interest on a web page, the client-based component can present a content selector tool that allows a user to highlight, drag-and-drop, or draw a rectangle or other shape around, clip, or in some other manner indicate the section(s). In another example, the selection can be performed, for example, using a client browser plug-in.
The client-based component returns the user-specified information to another component of content delivery system 10 for storage and processing to facilitate content delivery. Non-limiting examples of information returned to the other component of content delivery system 10 include the uniform resource locator (URL) of the content portal and information that describes the user-selected region of the web page. Non-limiting examples of information that describes the user-selected region of the web page include a document object model (DOM) tree annotated with selected nodes or an XPath description (where XPath, XML Path Language, is a query language that is used for selecting nodes from an XML document).
In an example, the operations described in connection with block 202 can be performed on more than one web page. In this example, user input is received which indicates the selection of the sections of links in at least one content portal that point to the articles of interest for each of the web pages.
In block 204, an interface of the client-based component presents a field that requests the user specify a destination for delivery of the extracted content. The extracted content can be delivered to the specified destination through a number of different mechanisms. Non-limiting examples of destinations that the extracted content can be delivered to include a repository that the user creates on a server, an application (including a mobile application) distributed to and installed on the user's portable device, a printer connected to the internet that the user has access to, a retail print fulfillment center that the user specifies, and an email account, in a non-limiting example implementation of block 204, an application can be created and sent to an account that the user has with an electronic print center, which can then be downloaded to the user's printer to facilitate delivery of the extracted content of the user's printer.
The interface of the client-based component can also present a field that requests the user specify a content delivery schedule, including delivery dates and delivery times.
The interface of the client-based component can also present a field that requests the user specify the format in which the extracted content is delivered. The user may specify that the extracted content is delivered as a portable document format (PDF) document, as a web page (for example, based on a markup language file), or in an electronic book format (including an ebook or other electronic book accessible by an electronic reader). Non-limiting examples of applicable markup language files include a HTML file based on a variation of the markup language, including XHTML and HTML5, and a markup language embedded in or called from HTML including Cascade Style Sheet (CSS) and JavaScript. In an example, the extracted content is delivered in an electronic book format, including as an EPUB® file (a *.epub file). In another example, the user may specify that the extracted content is delivered as a link in an electronic transmission (such as email) or a web page and the user gains access to the body of the extracted content by following the link.
In an example where the operations of block 202 are performed on more than one web page, user input is received in block 204 which indicates the user-specified content delivery schedule, delivery destinations, and the format in which the extracted content is to be delivered for each web page. The delivery schedules, delivery destinations and formats for delivery of the extracted content can be specified as the same for content extracted from all web pages, different for content extracted from each different web page, or the same for content extracted from some web pages and not others.
In an example where the operations of blocks 202 and 204 are performed on more than one web page, interface 400 allows a user to complete fields 405, 410 and 415 for each of the web pages. As illustrated in
In an example, the client-based component can be a browser plug-in, or an extension to a computer application. In another example, the client-based component can be stand-alone program.
In an example, a user gains benefit of use of a system implementing functionality 200 by installing the client-based component on a user's client device, including a portable device or a computer-based viewing device.
The block diagram of
In an example, the operations described hereinbelow in connection with blocks 222, 224 and 226 can be performed on more than one web page. In this example, information indicative of user input is received in block 222 for each of the web pages. In block 224, content extraction rules are generated based on the analysis of the document structure of each of the web pages that includes the links pointing to the articles of interest. In block 226, content delivery is organized for delivery of the extracted content from each of the web pages. One or more content delivery templates 228 can be developed that includes the content extraction rules generated in block 224. For example, a single content delivery template can be generated for extracting content from all of the web pages, or different content delivery templates can be generated for extracting content from the web pages, in some combination. The content delivery organization information from block 226 is used to configure the content delivery template 228 so that, when implemented, the extracted content from the web pages is delivered in the specified format to the specified destinations according to the specified schedule.
In block 224, a component of content delivery system 10 processes the user input from block 222. Using the region selection information received in block 222, the structure of the web page is analyzed and content extraction rules are generated. Non-limiting example of systems and methods to implement algorithms that can be used for generating the extraction rules in block 224 are described in international application no. PCT/CN2009/075545 (publication no. WO2011/072434). In brief, the generated content extraction rules facilitate extracting web content in a webpage is extracted by identifying paragraphs in the Web content based on line-break node determination. A range of text-body associated with the identified paragraphs is identified using a maximum scoring subsequence. The identified text-body is refined using a heuristic rule of substantially horizontal alignment. The generated content extraction rules facilitate extracting one or more titles and one or more images associated with the web content. Other non-limiting example systems and methods to implement algorithms that can be used for generating the extraction rules in block 224 are described in international application no. PCT/CN2009/075117 (publication no. WO2011/063561). In brief, the example systems and methods extract content from a target web page (where the links of interest point to) by selecting data of interest in a source web page (the web page including the links of interest) and trying to locate corresponding data in a target web page by determining similarities in the DOM tree representations of the source and target web pages. The content extracting rules can be generated by defining a set of DOM trees that include the DOM tree of the source web page and a truncated DOM tree of the target web page, the truncated tree including all matched paths and all unmatched branches comprising a data node for which an alignment cost does not exceed a predefined threshold. Using the extraction rules includes, for data residing in a node of a path of a subsequent target web page DOM tree matching the node in the matched path of the source web page DOM tree or the truncated target web page DOM tree, extracting the data. The extraction rules can be stored, e.g., on a sever. In an example, extraction rules can be associated with an account created by the user.
In a non-limiting example implementation of block 224, the web page document structure of a web page is analyzed to locate the positions of links in the DOM tree. Content extraction rules are derived to extract the regions containing these links. These content extraction rules can be stored on the server and associated with the user's account.
In an example implementation of content delivery system 10 to deliver content, the content extraction rules generated in block 224 are used to analyze the web page and to analyze the links in the content portal of the regions indicated by the user.
The block diagram of
In an example, the operations described hereinbelow in connection with blocks 252, 254 and 256 can be performed on more than one web page. In this example, in block 252, extraction rules are applied to extract the content of interest according to the pre-set schedule for each of the web pages. In block 254, the extracted content from each of the web pages is composed according to the format that the user specified. The content extracted from the web pages can be composed into a single final document, or multiple documents, as specified by the user. In block 256, the composed content is delivered to the specified delivery destinations in the specified format(s) to provide the user with the personalized content 258 at the scheduled content delivery time(s).
In an example implementation, the functionality of blocks 252, 254 and 256 are used for run-time execution of content delivery to provide the personalized content 258. The content extraction rules are applied to web pages (consistent with block 252). Web content is fetched and the extracted web content is delivered to designated destinations according to set schedules (consistent with block 256). The schedules can be set and the destinations can be designated a user. Article extraction technology can be applied to extract content from web pages. Non-limiting examples of article extraction technology is described in U.S. patent application Ser. No. 13/052,622, which describes systems and methods that can be used for determining the uniform resource locator associated with a printer friendly version of a webpage and retrieving the content. The extracted content can be composed to a layout structure (consistent with block 254). In an example, the extracted content can be composed to a layout structure specified by a user. In another example, the extracted content can be composed to an automated layout structure generated by a layout system. The composed content is delivered to designated destinations according to set schedules.
In an example, a component of content delivery system 10 applies the content extraction rules to the web page and converts information indicative of the extracted content into an RSS feed.
In example implementations of functionality 250, content extraction rules are applied to fetch the content of interest from the user-selected content portal. The content portal includes links to the articles of interest. The articles that the links point to may change at on a daily basis, or even at regular intervals throughout the day. As a result, the articles that are linked in the user-selected content portal also may change at on a daily basis, or even at regular intervals throughout the day. Thus, the content of interest fetched when the system retrieves content from the content portal at a first time point may differ from the content fetched when the system retrieves content at a second time point, since the links in the user-selected content portal may change. The content extraction rules generated in block 224 are configured to fetch content at the user-indicated frequency based on the links in the user-selected content portal. In an example implementation of blocks 252, 254 and 256, the web page document structure for a new web page is analyzed at the scheduled time point, and the update links for the articles of interest are collected from the user-selected content portal. Technology is applied to extract article content from the articles accessed by the links, the extracted content is composed according to a layout and the composed content is delivered to the user-specified destinations.
An example implementation of the functionality of 252, 254 and 256 of
As illustrated in the example implementation of
A system and method according to a principle described herein can provide a superior reading experience to a user by collecting content in one place without requiring the user to click through multiple links manually. A system and method herein can be applied to much of the content of a web page. The content selection can be more direct from the perspective of the user, since the mark-up to indicate the section including the articles of interest on the we page is done directly from the content portal.
Referring now to
In an example, a method for receiving user input for use in configuring content delivery to the user can be performed based on more than one web page. In this example, the method includes displaying at least one interface for receiving user input that indicates the selection of the sections of links in content portals of web pages that point to the articles of interest, and displaying at least one interface for receiving user input that indicates specified content delivery schedules, delivery destinations, and formats in which the extracted content is to be delivered. The user input received, including information indicative of user-selected sections of the web pages and specified content delivery schedules, delivery destinations, and delivered content formats, are stored to a memory. The delivery schedules, delivery destinations and formats for delivery of the extracted content can be specified as the same for content extracted from all web pages, different for content extracted from each different web page, or the same for content extracted from some web pages and not others.
Referring now to
In an example, a method for generating content extraction rules and content delivery template(s) for use in content delivery can be performed based on more than one web page. In this example, the method includes receiving information indicative of user-selected sections of the web page that includes links to the articles of interest, specified content delivery schedule, and delivery destinations, and generating content extraction rules based on the user-selected sections of the content portals of the web pages. The method also includes organizing the content delivery based on the specified content delivery schedule, and delivery destinations, and generating at least one content delivery templates based on the content extraction rules and the content delivery organization. A single content delivery template can be generated for extracting content from all of the web pages, or different content delivery templates can be generated for extracting content from the web pages, in some combination.
Referring now to
In an example, a method for generating content extraction rules and content delivery template(s) for use in content delivery can be performed based on more than one web page. In this example, the method includes applying content extraction rules to extract the content of interest pointed to by links in the user-selected sections of the web pages according to a specified schedule(s), and composing the extracted content according to the format(s) that the user specified. The method also includes delivering the composed content to specified delivery destinations at the scheduled content delivery time(s) to provide a user with personalized content. The content extracted forum the web pages can be composed into a single final document, or multiple documents, as specified by the user.
Interactions may be made with the computer system 110 (e.g., by entering commands or data) using one or more input devices 120 (e.g., a keyboard, a computer mouse, a microphone, joystick, or a touch pad). Information may be presented through a user interface that is displayed to a user on the display 121 (implemented by, e.g., a display monitor or display screen), which is controlled by a display controller 124. The display controller may be implemented by, e.g., a video graphics card. The display 121 can be a display screen of a portable viewing device or computer-based viewing device. The computer system 110 may includes peripheral output devices, such as speakers and a printer. In an example where computer system 110 is, e.g., a desktop computer, a laptop computer, may include a network interface card (NIC) 126 that facilitates connection with one or more remote computers.
As shown in
Content delivery system 10 may include one or more discrete data processing components, each of which may be in the form of any one of various commercially available data processing chips. In some implementations, the content delivery system 10 is embedded in the hardware of any one of a wide variety of digital and analog computer devices, including desktop, workstation, server computers, portable devices, and computer-based viewing devices, in some examples, the content delivery system 10 executes process instructions (e.g., machine-readable code, such as computer software) in the process of implementing the methods that are described herein. These process instructions, as well as the data generated in the course of their execution, are stored in one or more computer-readable media. Storage devices suitable for tangibly embodying these instructions and data include all forms of non-volatile computer-readable memory, including, for example, semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices, magnetic disks such as internal hard disks and removable hard disks, magneto-optical disks, DVD-ROM/RAM, and CD-ROM/RAM.
The principles set forth herein extend equally to any alternative configuration in which content delivery system 10 has access to web content 12. As such, alternative examples within the scope of the principles of the present specification include examples in which the content delivery system 10 is implemented by the same computer system, examples in which the functionality of the content delivery system 10 is implemented by a multiple interconnected computers (e.g., partially on a server in a data center and partially on a user's client machine), and examples in which the content delivery system 10 communicates with portions of computer system 110 directly through a bus without intermediary network devices.
The preceding description has been presented only to illustrate and describe embodiments and examples of the principles described. This description is not intended to be exhaustive or to limit these principles to any precise form described. Many modifications and variations are possible in light of the above teaching.
Many modifications and variations of this invention can be made without departing from its spirit and scope, as will be apparent to those skilled in the art. The specific examples described herein are offered by way of example only, and the invention is to be limited only by the terms of the appended claims, along with the full scope of equivalents to which such claims are entitled.
As an illustration of the wide scope of the systems and methods described herein, the systems and methods described herein may be implemented on many different types of processing devices by program code comprising program instructions that are executable by the device processing subsystem. The software program instructions may include source code, object code, machine code, or any other stored data that is operable to cause a processing system to perform the methods and operations described herein. Other implementations may also be used, however, such as firmware or even appropriately designed hardware configured to carry out the methods and systems described herein.
It should be understood that as used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise. Finally, as used in the description herein and throughout the claims that follow, the meanings of “and” and “or” include both the conjunctive and disjunctive and may be used interchangeably unless the context expressly dictates otherwise.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US11/54150 | 9/30/2011 | WO | 00 | 2/11/2014 |