Glossary
The following definitions are offered for purposes of illustration, not limitation, in order to assist with understanding the discussion that follows.
HTML: HTML stands for HyperText Markup Language, the authoring language used to create documents on the World Wide Web. HTML defines the structure and layout of a Web document by using a variety of tags and attributes. For further description of HTML, see e.g., “HTML 4.01 Specification”, a World Wide Web consortium recommendation dated Dec. 24, 1999, the disclosure of which is hereby incorporated by reference. A copy of this specification is available via the Internet (e.g., currently at www.w3.org/TR/REC-html40).
HTTP: HTTP is the acronym for HyperText Transfer Protocol, which is the underlying communication protocol used by the World Wide Web on the Internet. HTTP defines how messages are formatted and transmitted, and what actions Web servers and browsers should take in response to various commands. For example, when a user enters a URL in his or her browser, this actually sends an HTTP command to the Web server directing it to fetch and transmit the requested Web page. Further description of HTTP is available in “RFC 2616: Hypertext Transfer Protocol—HTTP/1.1,” the disclosure of which is hereby incorporated by reference. RFC 2616 is available from the World Wide Web Consortium (W3C), and is available via the Internet (e.g., currently at www.w3.org/Protocols/). Additional description of HTTP is available in the technical and trade literature, see e.g., Stallings, W., “The Backbone of the Web,” BYTE, October 1996, the disclosure of which is hereby incorporated by reference.
Network: A network is a group of two or more systems linked together. There are many types of computer networks, including local area networks (LANs), virtual private networks (VPNs), metropolitan area networks (MANs), campus area networks (CANs), and wide area networks (WANs) including the Internet. As used herein, the term “network” refers broadly to any group of two or more computer systems or devices that are linked together from time to time (or permanently).
RSS: RSS is short for RDF Site Summary or Rich Site Summary, an XML format for syndicating Web content. A Web site that wants to allow other sites to publish some of its content creates an RSS document and registers the document with an RSS publisher. A user that can read RSS-distributed content can use the content on a different site. Syndicated content includes such data as news feeds, events listings, news stories, headlines, project updates, excerpts from discussion forums or even corporate information.
URL: URL is an abbreviation of Uniform Resource Locator, the global address of documents and other resources on the World Wide Web. The first part of the address indicates what protocol to use, and the second part specifies the IP address or the domain name where the resource is located.
Winsock: Windows Sockets 2 (Winsock) is a Microsoft-provided interface that enables programmers to create advanced Internet, intranet, and other network-capable applications to transmit application data across the wire, independent of the network protocol being used. With Winsock, programmers are provided access to advanced Microsoft Windows networking capabilities such as multicast and Quality of Service (QOS). Winsock follows the Windows Open System Architecture (WOSA) model; it defines a standard service provider interface (SPI) between the application programming interface (API), with its exported functions and the protocol stacks. It uses the sockets paradigm that was first popularized by Berkeley Software Distribution (BSD) UNIX. It was later adapted for Windows in Windows Sockets 1.1, with which Windows Sockets 2 applications are backward compatible. Winsock programming previously centered around TCP/IP. Some programming practices that worked with TCP/IP do not work with every protocol. As a result, the Windows Sockets 2 API adds functions where necessary to handle several protocols. For further information regarding Winsock, see e.g., “Winsock Reference”, available from Microsoft Corporation, the disclosure of which is hereby incorporated by reference. A copy of this documentation is available via the Internet (e.g., currently at msdn.microsoft.com/library/default.asp?url=/library/en-us/winsock/winsock/winsock-reference.asp).
XML: XML stands for Extensible Markup Language, a specification developed by the World Wide Web Consortium (W3C). XML is a pared-down version of the Standard Generalized Markup Language (SGML), a system for organizing and tagging elements of a document. XML is designed especially for Web documents. It allows designers to create their own customized tags, enabling the definition, transmission, validation, and interpretation of data between applications and between organizations. For further description of XML, see e.g., “Extensible Markup Language (XML) 1.0”, (2nd Edition, Oct. 6, 2000) a recommended specification from the W3C, the disclosure of which is hereby incorporated by reference. A copy of this specification is available via the Internet (e.g., currently at www.w3.org/TR/REC-xml).
Referring to the figures, exemplary embodiments of the invention will now be described. The following description will focus on the presently preferred embodiment of the present invention, which is implemented in desktop and/or server software (e.g., driver, application, or the like) operating in an Internet-connected environment running under an operating system, such as the Microsoft Windows operating system. The present invention, however, is not limited to any one particular application or any particular environment. Instead, those skilled in the art will find that the system and methods of the present invention may be advantageously embodied on a variety of different platforms, including Macintosh, Linux, Solaris, UNIX, FreeBSD, and the like. Therefore, the description of the exemplary embodiments that follows is for purposes of illustration and not limitation. The exemplary embodiments are primarily described with reference to block diagrams or flowcharts. As to the flowcharts, each block within the flowcharts represents both a method step and an apparatus element for performing the method step. Depending upon the implementation, the corresponding apparatus element may be configured in hardware, software, firmware, or combinations thereof.
Basic System Hardware and Software (e.g., for Desktop and Server Computers)
The present invention may be implemented on a conventional or general-purpose computer system, such as an IBM-compatible personal computer (PC) or server computer.
CPU 101 comprises a processor of the Intel Pentium family of microprocessors. However, any other suitable processor may be utilized for implementing the present invention. The CPU 101 communicates with other components of the system via a bi-directional system bus (including any necessary input/output (I/O) controller circuitry and other “glue” logic). The bus, which includes address lines for addressing system memory, provides data transfer between and among the various components. Description of Pentium-class microprocessors and their instruction set, bus architecture, and control lines is available from Intel Corporation of Santa Clara, Calif. Random-access memory 102 serves as the working memory for the CPU 101. In a typical configuration, RAM of sixty-four megabytes or more is employed. More or less memory may be used without departing from the scope of the present invention. The read-only memory (ROM) 103 contains the basic input/output system code (BIOS)—a set of low-level routines in the ROM that application programs and the operating systems can use to interact with the hardware, including reading characters from the keyboard, outputting characters to printers, and so forth.
Mass storage devices 115, 116 provide persistent storage on fixed and removable media, such as magnetic, optical or magnetic-optical storage systems, flash memory, or any other available mass storage technology. The mass storage may be shared on a network, or it may be a dedicated mass storage. As shown in
In basic operation, program logic (including that which implements methodology of the present invention described below) is loaded from the removable storage 115 or fixed storage 116 into the main (RAM) memory 102, for execution by the CPU 101. During operation of the program logic, the system 100 accepts user input from a keyboard 106 and pointing device 108, as well as speech-based input from a voice recognition system (not shown). The keyboard 106 permits selection of application programs, entry of keyboard-based input or data, and selection and manipulation of individual data objects displayed on the screen or display device 105. Likewise, the pointing device 108, such as a mouse, track ball, pen device, or the like, permits selection and manipulation of objects on the display device. In this manner, these input devices support manual user input for any process running on the system.
The computer system 100 displays text and/or graphic images and other data on the display device 105. The video adapter 104, which is interposed between the display 105 and the system's bus, drives the display device 105. The video adapter 104, which includes video memory accessible to the CPU 101, provides circuitry that converts pixel data stored in the video memory to a raster signal suitable for use by a cathode ray tube (CRT) raster or liquid crystal display (LCD) monitor. A hard copy of the displayed information, or other information within the system 100, may be obtained from the printer 107, or other output device. Printer 107 may include, for instance, an HP Laserjet printer (available from Hewlett Packard of Palo Alto, Calif.), for creating hard copy images of output of the system.
The system itself communicates with other devices (e.g., other computers) via the network interface card (NIC) 111 connected to a network (e.g., Ethernet network, Bluetooth wireless network, or the like), and/or modem 112 (e.g., 56K baud, ISDN, DSL, or cable modem), examples of which are available from 3Com of Santa Clara, Calif. The system 100 may also communicate with local occasionally-connected devices (e.g., serial cable-linked devices) via the communication (COMM) interface 110, which may include a RS-232 serial port, a Universal Serial Bus (USB) interface, or the like. Devices that will be commonly connected locally to the interface 110 include laptop computers, handheld organizers, digital cameras, and the like.
IBM-compatible personal computers and server computers are available from a variety of vendors. Representative vendors include Dell Computers of Round Rock, Tex., Hewlett-Packard of Palo Alto, Calif., and IBM of Armonk, N.Y. Other suitable computers include Apple-compatible computers (e.g., Macintosh), which are available from Apple Computer of Cupertino, Calif., and Sun Solaris workstations, which are available from Sun Microsystems of Mountain View, Calif.
A software system is typically provided for controlling the operation of the computer system 100. The software system, which is usually stored in system memory (RAM) 102 and on fixed storage (e.g., hard disk) 116, includes a kernel or operating system (OS) which manages low-level aspects of computer operation, including managing execution of processes, memory allocation, file input and output (I/O), and device I/O. The OS can be provided by a conventional operating system, Microsoft Windows NT, Microsoft Windows 2000, Microsoft Windows XP, or Microsoft Windows Vista (Microsoft Corporation of Redmond, Wash.) or an alternative operating system, such as the previously mentioned operating systems. Typically, the OS operates in conjunction with device drivers (e.g., “Winsock” driver—Windows' implementation of a TCP/IP stack) and the system BIOS microcode (i.e., ROM-based microcode), particularly when interfacing with peripheral devices. One or more application(s), such as client application software or “programs” (i.e., set of processor-executable instructions), may also be provided for execution by the computer system 100. The application(s) or other software intended for use on the computer system may be “loaded” into memory 102 from fixed storage 116 or may be downloaded from an Internet location (e.g., Web server). A graphical user interface (GUI) is generally provided for receiving user commands and data in a graphical (e.g., “point-and-click”) fashion. These inputs, in turn, may be acted upon by the computer system in accordance with instructions from OS and/or application(s). The graphical user interface also serves to display the results of operation from the OS and application(s).
The above-described computer hardware and software are presented for purposes of illustrating the basic underlying computer components that may be employed for implementing the present invention. For purposes of discussion, the following description will present examples in which it will be assumed that there exists one Internet-enabled computer, such as a “server” (e.g., Web server), that provides information content to clients (e.g., desktop computers, laptop computers, mobile devices, and the like). The present invention, however, is not limited to any particular environment or device configuration. In particular, a client/server distinction is not necessary to the invention, but is used to provide a framework for discussion. Instead, the present invention may be implemented in any type of system architecture or processing environment capable of supporting the methodologies of the present invention presented in detail below.
Current Mobile Browsers
Today there are two major groups of browsers for the handheld market. Both approaches require a “server” to pre-process Web pages for a reduced page markup browser. The first group of mobile browsers, referred to herein as the “client-server” approach, includes the following browser products: Opera-mini browser, AvantGo, and Minimo (reduced version of Mozilla/Firefox). The second group of “server” solutions relies on a server to transpose HTML into a modified form of the original Web page. Examples of this group include Squeezer (used by Askjeeves/Moreover). Squeezer divides a page into a serial stream of areas and unfortunately delivers a lot of unwanted content to the handheld device.
In accordance with the present invention, a third alternative is provided: the RSS Builder. The RSS Builder creates RSS feeds “on demand” for the user from pages that do not have RSS feeds. Benefits of this approach include:
Users are primarily concerned about obtaining the information they want, and are relatively unconcerned about what browser they use. In servicing these users, RSS and “feeds” are better suited to a small form factor than any “reduced set” HTML. Small “feature phones” will be around a long time before being replaced by more expensive “smart phones,” so it is important to address the needs of users of these devices.
In accordance with the present invention, a user with a “feature phone” may easily retrieve content from practically anywhere on the Web, despite the fact that such a phone lacks any sort of Web browser capability. The basic approach is illustrated in
Today, the world may be divided into low end “feature phone” and high-end “smart” phones. The feature phone device may be thought of as a server-based “thin-client.” The high end devices, on the other hand, are able to run sophisticated software, such as Sybase Content Capture Technology software (available from Sybase, Inc. of Dublin, Calif.). Sybase Content Capture Technology is a sophisticated toolset that can extract, aggregate and integrate information quickly and easily to provide a unique, targeted view of data. Delivering HTML Web content to both devices poses a difficult problem because of the limited display characteristics of the devices. The “RSS Builder” of the present invention provides a solution to this problem.
In basic operation, the RSS Builder receives a URL (of a target Web page) as input and returns an XML “RSS Feed” of that page as output. The approach has several advantages and possible applications. The approach allows the implementation of a “Discovery” feature that allows the user to search for RSS feeds by entering a URL. If no RSS feeds are available for that URL, the RSS Builder can return a RSS feed that is created in real-time to the user. Using a “feature phone,” the user can enter a URL or select a URL from a favorites list that will retrieve a RSS feed that is generated from the server. Using a “lightweight” RSS Reader (i.e., portion of the RSS Builder that is deployed to end-user devices), one can reach the “vast majority” of Web content using a modest feature phone (e.g., modest processing capability). The small size of the RSS Builder makes it possible to place a subset of RSS retrieval functions inside the SIM Card of an inexpensive feature phone. The full-featured (i.e., server side) RSS Builder can also reside as part of a server configuration and send the results of its content retrieval to any mobile device.
Carriers and manufacturers that are concerned with “spectrum bandwidth” requirements of their handheld devices can replace the existing HTML browser with the lightweight RSS Reader, thereby allowing their users to reach the content they desire using less “connect time” than when using a browser. Delivery time to download and render dynamic RSS feeds is also much faster than trying to “transpose” HTML to a small display format. In this manner, the RSS Builder of the present invention can extend the life of millions of feature phones.
Users of Sybase mFolio's desktop Web application can use “ultra-personalization” to reach their desired content on the Web. (Sybase mFolio is designed to take advantage of the increasing computational power of today's convergent devices, such as smartphones and PDAs, and helps carriers to offer any viewpoint of regular Web content on the handset easily, without additional coding.) Using the RSS Builder system, one can create an RSS feed of areas and articles for each mFolio “content page category” (sports, news, schedule, and the like) and then simply click on an RSS title to retrieve the article or area content. For example, the user may have several news articles on a single page tab called “World News.” In the handheld device, the user is now presented with a “World News” feed installed on his or her handheld. Clicking on World News, the user sees the title of each of his or her aggregated articles. Clicking on an article's title invokes an “article capture” feature, which returns the article text. From start to finish, no browser is needed. Adoption of this dynamic RSS may complement “viewpoint” capture features for high end phones. With the RSS Builder of the present invention, one may implement a range of solutions that spans any device with an RSS reader, or a simple browser that can display RSS feeds.
In accordance with the present invention, RSS search is also supported. Here, an ordinary Web search (e.g., “Google search”) may be rendered into an RSS feed, which in turn goes to any results page and returns a corresponding RSS feed for that page. In the case of Google searches, Google allows a user to set preferences via a personalization page to deliver a format friendly to mobile devices. However, Google is not able to help the user view content from sources which are themselves returned from a source. The RSS Builder of the present invention, in contrast, translates search results for a mobile device as well as translates pages listed in the search results for format on a mobile device. In this manner, the RSS Builder of the present invention provides mobile users complete access to the information that they really want.
If desired, the RSS search may be further divided based on result type. For example, Google results can be parsed by the RSS Builder so that users only sees results of the kind they desire. A user could, for instance, request a search for “articles only” or “headlines only.” In response to such a request, the RSS Builder may examine each result page and limit the corresponding RSS feed to only those pages that met the user's needs (i.e., desired type).
In accordance with the present invention, an HTML to RSS conversion methodology is provided. In this regard, a portion of the methodology may be implemented using existing Sybase Content Capture Technology, including:
FEParser: used to extract “visible text” and “anchor text.”
Article capture: used to extract an article from a page.
CCL: used as the href for each RSS bullet; it points back to the source area, article, or page containing an article.
Additionally, new “text attributes” are defined to assist with identifying more than 70 text styles on Web pages, as illustrated on
Page Pattern Recognition
When a page is requested by the mobile user, the page is parsed and each area on the page is surveyed to determine the number of “information” objects within that area. The results of the page survey are used to categorize the page into one of several strategies. Once the page category is defined the page is parsed again to generate the best possible RSS output for that particular Web page.
Strategies will not only determine the RSS parser (RSSBuilderParser) being used but also several other aspects. For example, the particular strategy will change the tags of the final output depending on the needs of the desired page. Style sheet information may also change depending on the display/browser requirements of the mobile or handheld device. The particular strategy will also determine the XSLT transform to apply when style sheets are used, as well as determining the navigation tags to take the user back to “Home” or to drill down to the next area or news group on a page.
FEPageMetrics
Page metrics are determined by a new FEPageMetrics class, which is designed as a “drop-into” component, for example for use in Sybase Content Capture Technology (lightweight content integration engine portion of the above-mentioned toolset). In operation, FEPageMetrics is passed a URL as part of a CachedURL object. In response, it retrieves the page and surveys the “most important” characteristics of the page. FEPageMetrics includes a “BuildFinal” method which returns a report of the page. For instance, the following is a sample report for CNN.com:
As shown, the page results are broken down into “Page Construction”, “Page Content” and “Page Layout” categories. In this manner, the page results or metrics may be used by the system to give a good identification of the underlying page type.
FEPageTerms
“Page Terms,” essentially comprising short lists of terms, are employed to help identify a particular page strategy. For example, “Page Terms” and corresponding strategies may be defined for “News Page,” “Finance Page,” “Catalog,” “Blog,” and “Navigation,” as follows:
News Page
World News, Sports, Top Stories, Technology, Politics, Health, Travel, Education, Law, Entertainment, World, Education
Finance Page
Business News, Markets, Quotes, Latest News, Companies, Technology, Finance
Catalog
Price, Lot, Description, Testimonials, Products, Shopping, Account, Hot
Blog
Trackback, Comments, Posted by
Navigation
More . . . , Next . . .
During system operation, each term list is loaded into a hashtable when the FEPageTerms (object) is initialized. As the page is parsed, runs of text less than a preset amount (e.g., 15 characters) are examined to reduce the amount of CPU time necessary to count special terms. The terms are saved in text files to be loaded by the FEPageTerms object and can be easily localized for languages other than English. In the currently preferred embodiment, the approach taken is to not build an exhaustive list of terms that might be helpful, but instead build the smallest list possible that will help identify a page type.
Navigation
In accordance with the present invention, dynamic navigation tags are added to the RSS content that is delivered to the mobile device. This provides the end user with a substantially improved means for navigating content, allowing the user to browse the Web without a full HTML browser. In this manner, the system of the present invention not only provides an HTML to RSS bridge for mobile devices but also the means to use a very small RSS reader as a “browser.”
As previously discussed, the results of the page survey are used to categorize a given target page into one of several strategies. The particular page strategy that a given page is categorized as determines the particular parser (i.e., particular version of RSSBuilderParser) that is applied to the page. In the currently preferred embodiment, the following page strategies are defined:
Anchor Page: more than 50% of the content is from anchor tags. (HTML uses the <a> (anchor) tag to create a link to another document.)
Blog page: anchors following by predictable runs of text with a clear pattern.
Aggregation page: anchors with short summary of source page.
Article: some anchors but the main feature is a “high score” text article.
Search results: lists of anchors with short description.
Photo pages: collection of photo pages from album sites.
mFolio page: photo pages comprising a collection of capture descriptions and CCL (Content Collection Language) statements that have been created by a desktop user of Sybase mFolio. (Sybase mFolio is a standalone, embeddable mobile application software and mobile data solution, that deliver highly focused, device-optimized mobile media browser and syndicated content to end users.)
The following presents specific sample URLs and corresponding rules for the various pages strategies.
Example URL:
http://cnnfn.com (illustrated in
Rules:
Each anchor will have a RSS bullet title.
RSS title will have the original HREF of the cnnfn page article.
Ignore ads if possible.
Rank titles in RSS by anchor title style. For example, large fonts titles at the top of the RSS feed.
Ignore anchors to index pages such as “cnn.com/sports/” but instead favor content (e.g., cnn.com/news/1223334.htm).
The XML generated by the system of the present invention may be compared with the XML from the same page that is generated by target Web site (e.g., cnnfn.com). A high degree of overlap of articles in both their native XML form and generated XML form provides an indication that good results are being obtained on sites without an XML feed.
Example URL:
http://www.realclearpolitics.com/articles/2006/04/the_congresswoman_and the_admi.html (illustrated in
Rules:
Looking for an “area” enclosed within a table with a large run of text.
The “article text run” should be the dominant run on the page. In other words, there generally should not be more than one article on a page. Styles for title and date timestamp are used to identify more than one article.
The article should not have “many” embedded anchors.
Looking for text styles that suggest an area such as 1 title style and something akin to a “body text” style.
A “date” style is also helpful.
Embedded tables are helpful in identifying advertisements which are not included in the returned article.
There is a “minimum text length” that will define an article.
Example URL:
http://www.horsepigcow.com/index.html (illustrated in
Rules:
Very similar to an article page but with a repeated pattern of date, title, and text run.
Should also have repeated patterns of blog terms such as “posted by”, “trackback” and “comments.”
If a desired page is parsed and the system cannot with some degree of certainty identify the type of page, then the page is denoted as “unknown.” Unknown pages may be processed as follows. From the page contents present, the system can deliver a “best guess” of what is most important on that page, such as a list of anchors or “visible text.” In the currently preferred embodiment, the system displays a “Show More” option (e.g., displayed at the bottom of the mobile device screen) that the user may invoke to instruct the system to deliver another page part to the mobile device. In the event that the foregoing is not possible (e.g., because the system cannot define the best part of the page to deliver), the system may display additional options (e.g., “Show Links”, “Show Text”, and “Show Images”) allowing the mobile device user to select individual portions of the unknown page type.
RSSBuilder Servlets
In the currently preferred embodiment, server-side program logic is implemented via RSSBuilder servlets. For deployment, for example, the RSSBuilder servlets can be easily added to Sybase Contentintegrator (available from Sybase, Inc. of Dublin, Calif.) by modifying the web.xml file with the servlet names and moving the servlets and related parser classes to the “classes” folder in core. (Sybase Content Integrator is an intuitive toolset built for software vendors that immediately adds content extraction, aggregation, and transformation functionality to existing applications so users have access to clean, targeted data.)
In an exemplary invocation, a servlet is called with the following parameters (arguments):
RSSSearch: localhost:8080/core/rsssearch?q=[search words]
RSSPlayback localhost:8080/core/rssplayback?a=[url of desired page to translate to RSS]
RSSDumpReport localhost:8080/core/rssdump?a=[url of desired page to translate to RSS]
The RSSSearch parameter specifies a search query and a target search site (e.g., Google) to perform the query; the results are translated into an RSS feed. The RSSPlayback parameter specifies the translation of any URL passed to it into a RSS feed. The RSSDump parameter specifies the determination of the “page metrics” of any URL passed to it.
The following description presents method steps that may be implemented using processor-executable instructions, for directing operation of a device under processor control. The processor-executable instructions may be stored on a computer-readable medium, such as CD, DVD, flash memory, or the like. The processor-executable instructions may also be stored as a set of downloadable processor-executable instructions, for example, for downloading and installation from an Internet location (e.g., Web server).
In accordance with the present invention, an improved method for converting HTML to RSS is provided by using a page pattern recognition approach. The present invention allows a user of a feature phone to enter a URL (or go to their favorites list) for a Web page of interest, whereupon the system of the present invention retrieves the Web page and examines every object on the page, in order to determine what type of page the Web page is.
As an additional feature of the present invention, the method may synthesize navigational links on-the-fly, based on its knowledge of the target page. As indicated at step 608, the above-described transformation is augmented by further processing the page to synthesize navigational links (i.e., links that were not present in the original page, as retrieved at step 602). In this manner, the present invention not only delivers information in a format suitable (i.e., viewable) for the feature phone, but also inserts navigational aids into the rendered page to provide the user with a means to navigate through that rendered content. For a rendered page that has an underlying page type of blog, for example, the method may synthesize a “Next” link that would retrieve the next blog comment. In a similar manner, a rendered article page may include a synthesized “Next” link that retrieves the next text block (e.g., 200-word text block) for the article. For an anchor page type (i.e., includes multiple anchors), the method may synthesize a “Home” link that navigates back to a base navigation page (determined based on page types). In this manner, the present invention provides the user with rich navigation capability without the feature phone itself requiring any additional software, such as browser software. As a result, the present invention provides the feature phone with browser-like capability without the requirement for an expensive processor and memory (ordinarily required for running browser software). Additionally, the approach economizes use of the screen “real estate” of the small device by avoiding the display of static glyphs (i.e., browser back, forward, and home glyphs), by instead using dynamically-generated navigational links whose appearance is controlled based on the actual design and content of the underlying Web page of interest to the user.
As described above, the FEParser is used during the HTML to RSS conversion to extract “visible text” and “anchor text.” Internally, FEParser forms the “base class” of all strategy page parsers, collectively referred to as FEParser.com. FEParser contains all necessary logic to parse and organize information and layout of the HTML page being parsed. Attributes of FEParser include the following:
FEPageMetrics is a descendant class that overrides FEParser and is designed to “drop-into” any distribution of the Content Capture software. FEPageMetrics is passed a URL as part of the CachedURL object, retrieves the page, and survey the “most important” characteristics of the page. FEPageMetrics includes a “BuildFinal” method that returns a “Page Metrics” report of any page. A sample Page Metrics report for a typical page, for example, is as follows:
Javascript lncludes: 0
Page Type
The results of a PageMetrics parse are used to determine the general type of page being parsed into several major categories: TABLE or DIV. The differences between TABLE and DIV tags are extensive. TABLE type of page layouts clearly outnumber the DIV tag pages and are built using software applications that have been around for years. DIV tags are usually built with newer tools and have many advantages over the older TABLE page design. By being able to identify the type of page and applying the correct page strategy we can extract information from the target page with accuracy.
Page Type: TABLE
Once the page is identified as a “TABLE” type, the system starts looking at the frequency and patterns on the page. To extract an article from the page it is necessary to identify “content breaks” on the page that identify the begin and end of each article. On blog pages, it is necessary to identify multiple “postings” on a page. This problem is made more difficult in that HTML tags such as H1 (Header 1), H2 (Header 2), etc. are not always used to define headers; also, other styles such as font tags are not always helpful in identifying patterns. To improve the accuracy of TABLE page parsing it is necessary to look at headers and other content breaks, and also to keep count of what is being rendered between the tags.
Consider the page shown in
Page Type: DIV Tags
A much more difficult problem is the growing number of “DIV” tag pages that have some of the old TABLE HTML tags but sometimes have very few HTML tags so that a completely different approach is needed to identify articles or blog posts. DIV tags on a page usually have a ‘class name’ or ‘id’ as a tag attribute. The class name can be used to apply a page style that is defined on the page or within a cascading style sheet. Knowing the number of DIV tags and their names give very little information by themselves. Knowing exactly what a DIV tag will do within a browser with 100% accuracy will require matching the tag with the tag style and determining what the style is designed to do. This is not possible and far beyond the scope of the current parser.
The system can however accurately identify articles and posts on a page with the following process. The system builds a hashtable of all DIV class names and keeps track of the content on the page between the start and end of each DIV tag of the same name. Information that is collected for each DIV start and end include:
Number of visible text characters (not anchor text)
Number of anchors.
Number of images.
Counts of nested DIV tags.
At the end of the first PageMetrics parse the system determines how many “repeating” DIV classes there are on a page and how many of the repeating DIV repeat with the same frequency.
Consider a page that has a repeating pattern:
DIV_NAME_AAAA 1140 characters
DIV_NAME_BBBB 40 characters
DIV_NAME_CCCC 1000 characters
DIV_NAME_DDDDD 40 characters
DIV_NAME_EEEEE 70 characters
DIV_NAME_AAAA 2130 characters
DIV_NAME_BBBB 30 characters
DIV_NAME_CCCC 2000 characters
DIV_NAME_DDDDD 40 characters
DIV_NAME_EEEEE 70 characters
DIV_NAME_AAAA 4120 characters
DIV_NAME_BBBB 30 characters
DIV_NAME_CCCC 4000 characters
DIV_NAME_DDDDD 40 characters
DIV_NAME_EEEEE 70 characters
Metrics:
DIV_NAME_AAAA repeats 3 times total visible text=7390
DIV_NAME_BBBB repeats 3 times total visible text=100
DIV_NAME_CCCC repeats 3 times total visible text=7000
DIV_NAME_DDDDD repeats 3 times total visible text=120
DIV_NAME_EEEEE repeats 3 times total visible text=210
The system can look at the metrics of the page and identify the following DIV tags:
DIV_NAME_AAAA=post or article main container
DIV_NAME_BBBB=title name
DIV_NAME_CCCC=body text
DIV_NAME_DDDDD=byline?
DIV_NAME_EEEEE=timestamp?
If necessary the system looks for timestamp patterns of text such as dates and time patterns.
The following description presents method steps of the present invention for determining page layout types. As in the case of HTML to RSS conversion, the operation may be implemented using processor-executable instructions for directing operation of a device under processor control. The processor-executable instructions themselves may be stored on a computer-readable medium, such as CD, DVD, flash memory, or the like, and may also be stored as a set of downloadable processor-executable instructions, for example, for downloading and installation from an Internet location (e.g., Web server).
If during the first page metrics parse, at step 801, the method identifies the article content with high accuracy, then it returns a “FirstPass” article result, as indicated at step 802. In that case, the determination is completed and the method concludes with “FirstPass” as the returned result. Otherwise, the method continues. The next easiest category is to identify “Photo Blog” pages, as these have a large number of images and little text. This case is tested at step 803. If “Photo Blog” is found, the method processes the photo blog page at step 804. Thereafter, the method concludes with “Photo Blog” as the returned result. Otherwise, the method continues. The method now attempts to identify TABLE or DIV tag based layouts, at steps 805 and 807, respectively. Upon finding a TABLE tag page at step 805, the method processes the TABLE tag at step 806, and thereafter concludes with “TABLE tag” as the returned result. Similarly upon finding a DIV tag page at step 807, the method processes the DIV tag at step 808, and thereafter concludes with “DIV tag” as the returned result. Step 809 represents the fall-through or default case where no clear definition of the page type is discernable. In that case, the method returns a “best guess” of the page content. This may be done, for example, by displaying a block of visible text” from the source page.
In accordance with the present invention, the method proceeds as follows. Using the previously captured page metrics, at step 901 the method looks to see whether the visible text is greater than 5000 characters. Here, the method is attempting to determine whether the page is probably a story, instead of simply a news page, a collection of headlines, or a collection of photos. If not, then the method proceeds along the logic set forth on the left-hand side of
The difference between the small and regular parsers is perhaps best illustrated by example. Consider, for instance, a small page and a regular page, where each has multiple headlines. The small page may have only two sentences of text on the page that are important followed by an image. On the regular page, one or two headlines may be followed by several paragraphs of text. When the method processes the regular page, as soon as the method has the first well-defined headline and the first well-defined paragraph (or two), the method may stop parsing the rest of the page because it has already attained all the information that it can display on the target device (e.g., mobile phone). Additionally, it is likely that the page designer did not include two or more articles of substantial length on a single page with just one title. Therefore, for example, in the case that the method encounters a lot of visible text (greater than 5000 characters) and H1 tags/weighting (i.e., yes at 901, and at 931), the method proceeds to step 941 to use the H1 (regular) parser.
Here, the method is not merely concerned with the presence or absence of H1 tags, but is instead concerned with whether H1 tags are the predominant feature for the page relative to other tags based on previously gathered page metrics (e.g., H1Count, H1Text, H1Len, and H1Ratio). Consider the following program logic (rule):
Here, the program logic tests the various metrics for determining what page type to return. Once the page type has been determined, the respective parser is invoked for performing page type-specific processing. The H1 parser, for example, extracts the heading (based on extracting information between the H1 tags) and the accompanying article text (based on extracting information between subsequent paragraph tags). This extracted information may be captured to a buffer (e.g., upon reaching a preset limit, such as 200 words), for matching the information that is to be sent to the ultimate target screen that is to receive it (e.g., mobile phone screen).
Processing by the small parser, in contrast, is more difficult. For example, if the H1 small parser were to wait for 200 words in the page, the parser will never find them. Therefore, the program logic at step 901 (i.e., visible text greater than five thousand characters) pre-screens the page, so that the small parsers may all operate on the assumption that the page does not include a lot of text and instead focus their attention on looking for other (“small text”) things, such as a photo blog page. For example in the case of the H1 small parser as shown at 911 at
It is possible that a given page may not employ H1 tags as a predominant feature. In fact, the typical Web page will not use the full complement of H1-4 heading tags. Therefore in the case that an H1 tag is not the predominant feature for the page, the method determines whether H2 tags predominate the page, by performing step 922 for small text page (less than 5000 characters, such as a photo blog) and step 932 for regular text page. If H2 tag weighting is to be applied, the method proceeds to the corresponding H2 parser (parser 942 for regular text page, and parser 912 for small text page). Again, tag-appropriate logic is embodied in each parser (e.g., see handletext handler below), so that the method can extract exactly the correct text/images that is appropriate for the given page. In this manner, the method may continue processing the page for H3 tags (tested at steps 923 and 933) and H4 tags (tested at steps 924 and 934), with appropriate parsers invoked (parsers 913 and 943, and parsers 914 and 944, respectively).
Page type determination is performed by a “GetPageType( )” method of the PageMetrics class. The method looks at the page metrics and applies rules to return the specific strategy most likely to result in the highest quality article content being returned to the device. The method is implemented in Java syntax as follows:
The GetPageType method or routine is invoked by the PageMetrics class, where the system is looking at all of the metrics that have been collected during a first pass through the page. The method works through the important architectural features of the page (under exam) as follows. At line 4 of GetPageType, the method attempts to determine whether the source page is from a known blog source (e.g., www.bloglines.com). If yes (true), then the method sets a “rule fired” flag at line 6. If the page is from Flicker (Web site), then the method returns PAGEMETRICS_PAGETYPE_FLICKER. Otherwise, the method returns PAGEMETRICS_PAGETYPE_BLOG_GENERIC_H2. At line 12, the method examines the Frameset count for the page. Framesets divide a page up into small frames. Each frame is an important page layout feature that requires its own strategy. In the case that one or more framesets is present, the method sets the “rule fired” flag at line 14 and returns PAGEMETRICS_PAGETYPE_IFRAME at line 15. At line 17, the method examines whether the amount of visible text is less than 50 characters. If yes, the “rule fired” flag is said at line 18, and the method returns PAGEMETRICS_PAGETYPE_IMAGE at line 19. On the other hand, if the visible text is greater than 5000 (tested at line 21), then the method proceeds to test whether the length of the first body (line 23) is greater than 1000 characters. (First body refers to the first body of text, such as an article; it does not refer to HTML body tags.) If yes, the “rule fired” flag is set at line 25, and the method returns PAGEMETRICS_PAGETYPE_FIRSTPASS. A “first pass” page type is one where the system is able to extract good text (good headline and good body text) on the first pass. In this manner, the method may proceed to other rules for bracketing or identifying the page type.
Each parser includes a handletext handler or routine. The particular strategy followed by a given handletext handler depends on the current tag weighting applied (i.e., whether H1, H2, H3, or H4 strategy). Depending on the particular parser, the logic of handletext can be simple or very complex, for example:
This handler or method serves to extract visible text. Lines 1-26 of the method employ simple logic to extract visible text, up to the point where a break character is encountered. Depending on flags set, the handler makes a determination whether the text is inside the page title or inside the page body. Specifically, lines 27-38 include logic for extracting the text from a title. Here, this logic operates when the handler is inside a title (i.e., the title has started but not yet ended). Once the title text (if any) has been processed, the method proceeds to lines 39-54 to extract text inside the page's body. If the last tag was a break and the handler is processing inside an anchor, then the handler simply returns at line 43. Otherwise, the handler proceeds to add the visible characters to a body (buffer) that it is building up in memory (line 47), provided that the method is not inside an HTML Option (i.e., menu item, tested at line 45). In this manner, the handler builds up a title (buffer) and a body (buffer). Depending on the parser, the handler may build up other buffers as well.
While the invention is described in some detail with specific reference to a single-preferred embodiment and certain alternatives, there is no intent to limit the invention to that particular embodiment or those specific alternatives. For instance, those skilled in the art will appreciate that modifications may be made to the preferred embodiment without departing from the teachings of the present invention.
The present application is related to and claims the benefit of priority of the following commonly-owned, presently-pending provisional application(s): application Ser. No. 60/767,545 (Docket No. SYB/0127.00), filed Jun. 14, 2006, entitled “System and Method for Delivering Mobile RSS Content”, of which the present application is a non-provisional application thereof. The disclosure of the foregoing application is hereby incorporated by reference in its entirety, including any appendices or attachments thereof, for all purposes.
Number | Date | Country | |
---|---|---|---|
60767545 | Jun 2006 | US |