Extracting structured data from weblogs

Information

  • Patent Grant
  • 10180986
  • Patent Number
    10,180,986
  • Date Filed
    Monday, October 12, 2015
    8 years ago
  • Date Issued
    Tuesday, January 15, 2019
    5 years ago
Abstract
Methods and apparatus for extracting structured data from weblogs are disclosed. In some examples, the methods and apparatus include retrieving a feed referenced on a webpage of the weblog and, in response to determining that the feed does not contain a first portion of a weblog post, creating, via a processor, a representation of the weblog post based on a second portion of the weblog post included in the feed, searching, via the processor, the weblog for the second portion of the weblog post, when the second portion of the weblog post is found in the weblog, identifying, via the processor, a node associated with the second portion in the webpage, and modifying, via the processor, the representation based on information from within the node to reconstruct the weblog post.
Description
BACKGROUND

Weblogging or “blogging” has emerged in the past few years as a new grassroots publishing medium. Like electronic mail and the web itself, weblogging has taken off and by some estimates the number of weblogs is doubling every 6 months. As of June 2006, BlogPulse estimates place the number of active weblogs at nearly 10 million blogs, of which about 36% have had at least one post in the past 3 months. BlogPulse finds approximately 60,000 new weblogs each day. Statistics published by other blog search engines such as Technorati and Pub Sub are similar. However, these estimates may well be excluding large numbers of non-English language weblogs.


A weblog is commonly defined as a web page with a set of dated entries, in reverse chronological order, maintained by its writer via a weblog publishing software tool. We can define each entry as a set of one or more time-stamped posts; an author may typically post several times a day. This is a matter a style, as some authors post at most once a day in an all-inclusive entry. Others prefer to micro-post, making each published item a separate post in the day's entry.


Due to the popularity of weblogs, there is a need for a method of searching individual posts within weblogs. The present invention addresses this need.





BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 shows a sample page from a weblog.





DETAILED DESCRIPTION
1. Overview

The present invention provides a process for segmenting weblogs into posts. Weblogs can facilitate communication and dissemination of content in any environment having two or more workstations in mutual communication. While weblogs are typically hosted by a server connected to the Internet, the concept can include other types of networks, such as local area networks (LANs), wide area networks (WANs), and public data networks, by which client workstations obtain data from a server workstation.


Each workstation may comprise a microcomputer such as a personal computer, for example, including a system bus that is connected to a central processing unit (CPU) and to memory, including read only memory (ROM) and random access memory (RAM). The system bus can be connected, via appropriate interface known to persons skilled in the art, to various input/output devices, including additional nonvolatile data storage devices, video and audio adapters, keyboard, mouse, and other devices that provide input to the workstation or receive output from the workstation. The Workstation can also include a data port for communicating with other constituents of collaborative data processing environment. The data port may be a serial port for linking workstation to a modem or a communications adapter for connecting workstation to a LAN.


Each workstation also typically includes software programs that are stored on the data storage devices or retrieved from other parts of collaborative data processing system and loaded into RAM and then into CPU for execution. Among those programs is a client program that receives messages from, and transmits messages to, other workstations connected to the network.


Web search engines such as Google, Yahoo, and MSN Search index the entire content of a web page typically every few days. However, for weblogs, users want to be able to search over individual posts, and in near real-time. Weblog search portals such as Technorati, Feedster, PubSub and BlogPulse have gained in popularity over the past year and a half, as people begin to turn to weblogs to get up-to-the-minute breaking news and to get fresh angles on news stories.


In addition, marketers have awakened to the possibility of mining consumer sentiment from weblogs. In order to produce accurate analytics, it is first necessary to be to identify individual weblog posts. Examples of consumer sentiment analytics are the buzz surrounding a product (number of mentions), number of links to a company website, trends in number of mentions and number of links, and ratio of positive vs. negative mentions. Glance, M. Hurst, K. Nigam, M. Siegler, R. Stockton, and T. Tomokiyo. Analyzing online discussion for marketing intelligence. In Proceedings WWW-2005, Chiba, Japan, 2005 (incorporated herein by reference).


Researchers as well are turning to blogs to gauge opinion and community structure. For example, Adamic and Glance recently analyzed the linking behavior of political bloggers during the 2004 U.S. Presidential Election and found that conservative bloggers link to each other more frequently and in a denser pattern than liberal bloggers. Adamic and N. Glance, The political blogosphere and the 2004 U.S. election: Divided they blog, In Proceedings WWW-2005 2nd Annual Workshop on the Weblogging Ecosystem, Chiba, Japan, 2005 (incorporated herein by reference). Marlow has studied the structure and authority in weblogs using inter-post citation counts. Marlow. Audience, structure and authority in the weblog community, In International Communication Association Conference, New Orleans, La., 2004 (incorporated herein by reference). Adar et. al. have explored how memes thread through the blogsphere from post to post. Adar, L. Zhang, L. A. Adamic, and R. M. Lukose, Implicit structure and the dynamics of blogspace, In Proceedings WWW-2004 Workshop on the Weblogging Ecosystem, New York City, N.Y., 2004 (incorporated herein by reference). The Global Attention Profiles project tracks the attention that bloggers pay to different nations of the world, in comparison with selected mainstream media outlets.


To enable sophisticated analytics over weblogs, a blog search engine typically uses an indexing mechanism that indexes a weblog one post at time, as opposed to one HTML page at a time. In order to index blogs one post at a time, the indexing system should be able to segment the weblog HTML into individual posts and extract meta-data associated with the posts, such as the posting date, title, permalink, and author.


The present invention provides a method for segmenting weblogs into individual posts using a combination of weblog feeds (such as RSS and Atom) and model-based wrapper segmentation. RSS is a family of web feed formats, specified in XML and used for Web syndication. Web feeds provide web content or summaries of web content together with links to the full versions of the content, and other metadata. RSS, in particular, delivers this information as an XML file called an RSS feed, webfeed, RSS stream, or RSS channel. In addition to facilitating syndication, web feeds allow a website's frequent readers to track updates on the site using an aggregator. Atom is the name of a specific web feed format. Web feeds, from a user's perspective, allow Internet users to subscribe to websites that change or add content regularly. Web feeds in general provide web content or summaries of web content together with links to the full versions of the content, and other meta-data in a developer-friendly standardized format Atom, from a technical perspective, is an open standard that includes an XML-based web syndication format used by weblogs, news websites and web mail.


2. Definitions

The following definitions are used throughout this description:


Weblog or blog: a weblog is a website where an individual or group of individuals publishes posts periodically. The posts are usually displayed in reverse chronological order. Each post generally consists of: a date, a title, the body of the post, a permalink to the post, an author, and one or more categorizations.


Weblog entry: a post or a set of posts published on a specific day.


Post: item published to weblog at a specific time of day.


Weblog feed/syndication: weblogs may or may not make posts available via syndication using RSS or Atom feeds. Web feeds provide web content or summaries of web content together with links to the full versions of the content, and other metadata. Atom feeds are XML documents. In addition, there are several versions of the RSS standard in use.


Weblog host: a company or website that hosts weblogs for individuals. Examples of popular weblog hosts are: livejournal.com, xanga.com, spaces.msn.com, blogspot.com, and the family of per-country domain typepad hosts.


Weblog software: software that enables creation and publishing of weblog posts to a weblog host, or to a self-hosted weblog. Each weblog host has its own weblog software tool for publishing posts. In addition, there are a number of weblog software tools for publishing a self-hosted weblog, such as Typepad, Moveable Type, and Wordpress.


Weblog ping: A weblog ping is an XML-RPC mechanism that notifies a ping server, such as weblogs.com or blo.gs, that the weblog has changed (e.g., the author has written a new post). Many weblog software tools can be set (or are automatically pre-set) to ping centralized servers whenever the weblog is updated. Example ping servers are http://blogs/ping.php and http://rpc.technorati.com/rpc/pingl. Some ping servers accept “extended pings” that include both the URL and feed URL of the updated weblog.


Crawl: A web crawler (also known as a web spider or web robot) is a program which browses the World Wide Web in a methodical, automated manner. A web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.


Screen scraping: a technique in which a computer program extracts data from the display output of another program. The program doing the scraping is called a screen scraper. The key element that distinguishes screen scraping from regular parsing is that the output being scraped was nominally intended for human consumption, not machine interpretation. There are a number of synonyms for screen scraping, including: Data scraping, data extraction, web scraping, page scraping, and HTML scraping (the last three being specific to scraping web pages).


Wrapper: a program that performs screen scraping.


“Document Object Model” (DOM): a description of how an HTML or XML document is represented in an object-oriented fashion. DOM provides an application programming interface to access and modify the content, structure and style of the document.


Permalink: a term used in the world of blogging to indicate a URL which points to a specific blog entry.


XPath (XML Path Language): a terse (non-XML) syntax for addressing portions of an XML document.


3. Process for Extracting Posts from a Weblog

Here we describe a process for extracting individual posts from a weblog, according to an exemplary embodiment of the present invention. First we describe the typical layout of a weblog.


3.1. Modelling Weblogs



FIG. 1 shows the home page of a well-known weblog. Notice the extraneous content on the page: header, footer (not displayed) and sidebars (in this example, ads). However, the main content is a sequence of entries ordered in reverse chronological order, with each entry consisting of sequence of posts, also in reverse chronological order.


A weblog can be described formally as follows:

    • Weblog: Entry+
    • Entry: Date Post+
    • Post: Title? Content Permalink? Author? Timestamp? Link to comments? Categories*


The ordering of the sub-elements for the Entry elements and the Post elements is typically not standardized across weblogs, although it is assumed to be fixed within a weblog.


Also, the model assumes that the entry dates are monotonically decreasing.


3.2. Weblog Syndication


Many weblog publishing software tools also publish a feed in association with the weblog. The feed is updated whenever a new item is posted to the weblog. The feed is a “pull” mechanism, as is the weblog page. As a “pull” mechanism, the feed is accessed in order to find out if the weblog has been updated. However, feeds are designed to be read via a feed reader/aggregator (such as Bloglines, NewsGator, etc. or via an extension to a mail reader), which polls the feed on the behalf of the user(s). Thus, the end user who reads feeds via a feed reader experiences weblogs as a “push” phenomena: the newly published weblog posts are pushed to the user's screen.


Some weblog software tools have provided customization of the weblog's feed: the publication of the feed can be turned on or off, the feed can be updated whenever a new item is posted or modified, and the feed can be full content or partial content. Full vs. partial content is an important distinction. We define a full content feed as a feed that publishes the entire content of the post as viewable on the front page of the weblog. We define a partial content feed as a feed that publishes a summary of the post content available via the weblog.


With respect to feed publication, weblog software tools fall into three categories: (1) automatic generation of feeds (partial or full); (2) customized generation of feeds; or (3) no feed generation capability. In the last case, some tech-savvy bloggers will use custom software to create a feed and associate it with their weblog, or turn to a third-party feed generator to host a feed for the weblog (e.g., FeedBurner: http://www.feedburner.com/).


3.3. Segmenting Weblogs into Posts


This section describes our approach for segmenting weblogs into posts, according to an exemplary embodiment of the present invention. It would be costly to manually create individual wrappers for each weblog. However, weblogs tend to conform to a common model, as described in Section 3.1 above. Thus, we have focused on developing an approach that generalizes well over the majority of weblogs.


If a full content feed is available for a weblog, then the task of extracting posts from the weblog is the straightforward mapping of the XML, format to an internal format. If a partial content feed exists for a weblog, then we use the partial content to guide the extraction process. If no partial content feed exists for a weblog, then we apply a model-based approach to extracting posts from the weblog page, taking advantage of regularities more or less common to most weblogs. Our work on model-based segmentation is similar to that of Nanno et al. Nanno, Automatic collection and monitoring of Japanese weblogs, In Proceedings WWW-2004 Workshop on the Weblogging Ecosystem, New York City, N.Y., 2004 (incorporated herein by reference).


Accordingly, here is an outline of the algorithm used for extracting posts from a weblog, according to an exemplary embodiment of the present invention:


1. Crawl home page of weblog.


2. Discover feed(s) associated with weblog


3. For each feed:

    • (a) Determine if feed satisfies minimal requirements for proceeding. Our feed finder considers an item in the feed sufficient if it contains, at minimum, the following fields: date-posted AND (content OR description).
    • (b) If the feed is sufficient, classify the feed as full content or partial content.
    • (c) If feed is full content, then we map the data found in the feed into a representation for weblog posts.
    • (d) If feed is partial content, then use feed data to guide screen scraping of the weblog to construct a representation for weblog posts.
    • (e) If the feed has insufficient content, then try next feed associated with weblog.


4. If there are no feeds with sufficient full or partial content, then fall back on screen scraping of weblog. Screen scraping uses a model-based approach to segment the weblog page into posts using textual and HTML elements of the page as markers.


3.4. Feed Discovery


After reaching the home page of the weblog, the first step consists of discovering the feed(s) for the weblog. If the weblog update was collected from a ping server relaying extended pings, and if the accepted ping includes the feed URL for the weblog, then we have located the feed. Alternatively, if the weblog is hosted by a weblog host which publishes full content feeds for its weblogs, then we need only map the weblog URL to the feed URL.


Otherwise, the next step in discovering the feed(s) for a weblog is to use “RSS auto-discovery.” RSS auto-discovery is an agreed-upon standard for specifying the location(s) of a weblog feed(s) as metadata in the HTML for the weblog home page.


If RSS auto-discovery fails to find a set of feeds for the weblog, the next step is to search for links to feeds from body of the weblog. First, all hyperlinks are extracted from the weblog. Next, the set of extracted hyperlinks are filtered using a classifier to identify which one(s) belong to the set of feeds for the weblogs. Currently, we use a set of heuristics to identify the feed(s) for a weblog from the extracted hyperlinks. The following is a non-exclusive list of criteria that can be used to identify the feed:

    • URLs that allow readers to subscribe to the feed in their RSS reader; these urls match “?url=?” or “bloglines.com/sub/”
    • URLs with one of a set of common feed suffixes, including {“atom.xml”, “.xml”, “.rss”, “rdf”, . . . } AND matching the host name of the blog
    • URLs with a host with one of a set of common feed prefixes, including {“rxml”, “rss”, . . . } AND matching the domain name of the blog.


3.5. Full Content vs. Partial Content Feeds


The multiple XML standards for weblog feeds (several versions of RSS and Atom) all satisfy the following minimal conditions:

    • The feed has the following top-level fields: weblog url, weblog title
    • The feed consists of a set of items (which for weblogs, correspond to posts). Each item may have the following fields: date-posted, permalink, post title, author, content, description


Our feed finder considers an item in the feed content to be sufficient if it contains, at minimum, the following fields: date-posted AND (content OR description). If no items in the feed contain sufficient content, the feed is rejected and weblog segmentation falls back upon model-based weblog segmentation (aka screen scraping).


The actual names of the fields depend on the feed standard being used. For example, for RSS v0.91, date-posted maps onto the XPath/item/title; content maps onto the XPath/item/description; and description maps onto the XPath/item/description. (There is no separate content field in the RSS v0.91 specification.)


Typically, the description field is used to provide a summary of the post (usually the first few lines) while the content field is used to provide either the full content of the post or a summary. Some feeds contain both, in which case, typically, the description contains the summary and the content contains the full post.


The feed classifier, which classifies the feed as full content or partial content, takes as input features of the content and description text, such as: presence/absence of HTML tags, % posts ending in ellipses, and type of feed. Based on these features, it uses heuristics decides whether or not the items in the feed are full content. Other features could be added, such as the variance in the length of text, etc.


If the feed is classified as full content, then we map the data found in the feed into our own internal representation for weblog posts, using XML representation of the content of the post+meta-data. Elements in the XML representation include: weblog url, permalink, weblog title, post title, date posted, time posted, and content.


If the feed is not full content, then we create skeletal posts from the data in the feed. For each post, we fill in the following data: weblog url; date-posted; partial content; post title (if found); post author (if found); and permalink (if found).


3.6. Feed-guided Weblog Segmentation


The next step is to fill in the skeletal posts constructed from the feed by using the content of the weblog page itself. Missing from the skeletal posts is the full content of the post. To find the full content, the partial content is first processed to remove summarization artifacts (e.g., ending ellipsis). Then, we search for the partial content in the weblog. If the partial content is not found, then we will omit that particular post from our segmentation because not enough information can be located to construct the post. If we end up finding insufficient information on all posts, then we will fall back on model-based segmentation.


If the partial content matches text on the weblog home page, then we find the enclosing node for the matching text in the tidied XHTML for the weblog page. The Extensible HyperText Markup Language, or XHTML, is a markup language that has the same expressive possibilities as HTML, but a stricter syntax. The text inside the enclosing node is then used as the content for the post. If enclosing nodes for successive posts overlap, then we throw an error indicating that feed-guided segmentation has failed for the weblog, and, again, fall back on model-based segmentation.


3.7. Model-based Weblog Segmentation


If there are no feeds with sufficient full or partial content, then we attempt to segment the weblog into posts using screen scraping of the weblog. Screen scraping uses a model-based approach to segment the weblog page into posts using textual and HTML elements of the page as markers.


Model-based weblog segmentation assumes that weblogs can be modeled as described in Section 3.1. Our approach then starts from a simplification of that model: (date ([title] content)+)+. This model assumes that dates appear first. This means that if we are able to extract the weblog entry dates, then we can use the dates as markers for the entries. Of course, a weblog page may have many other dates apart from the dates marking the entries: dates in the content of the posts; dates in the sidebars or in other non-weblog content included in the HTML page. However, as weblogs are produced by weblog software, we can expect certain regularities in the underlying DOM of the generated HTML. In particular, we expect that the relative XPaths of the weblog entry dates to be identical. A relative XPath is an Xpath that is defined relative to a location (XML node) in an XML document. In practice we've found that the relative XPaths of the entry dates are identical if we ignore certain elements in the XPath:/align/and repeating/font/s.


The first step in our model-based segmentation algorithm consists of extracting all the dates from the tidied XHTML for the weblog page using a date extractor. The dates are sorted into ordered lists, one list for each unique relative XPath. The order within the list corresponds to the ordering of the dates with the DOM for the weblog page.


We then filter the lists according to a set of heuristics in order to identify which list corresponds to the actual weblog entry dates. The filtering process for the date lists can be performed using the following sequence of steps:

    • 1. Keep only lists whose dates all belong to the current year and/or the past year.
    • 2. Keep only non-singleton date lists.
    • 3. Keep only lists whose dates conform to a similar format (e.g. MM/DD/YYYY).
    • 4. Keep only lists whose dates decrease monotonically.
    • 5. Keep only lists with most recent dates (but not in the future).
    • 6. Keep only lists with longest date string representation.
    • 7. Keep only lists with the greatest number of dates.
    • 8. Keep only first list.


One might think that after step 5 in the filtering process, we would be left with at most one list of dates. In practice, this is frequently not the case, because weblogs often have a sidebar with a dated list of recent posts which corresponds exactly the full set of posts in the main part of the weblog. The last few filtering steps help correctly identify the weblog entry dates as opposed to the dates in the sidebar.


If we fail to find a conforming list of dates, then model-based segmentation fails. There are some known cases where our approach fails: when only one entry appears on the home page of the weblog; or when weblog software for some reason generates irregular XPaths for the dates and/or content. But in many cases, segmentation fails when the HTML page in question is not actually a weblog. Thus, our model-based segmentation algorithm has the additional functionality of serving as a classifier that identifies whether or not an HTML page is indeed a weblog.


Once we have identified the entry dates for the weblog, model-based segmentation proceeds as follows:


1. Segment weblog into entries, using dates as markers.


2. Segment each weblog entry into posts using post titles markers.


3. For each post, identify permalink and author.


In step 1, we assume that all DOM nodes between subsequent entry dates form the weblog entry associated with the earlier date. The main difficulty is identifying the end of the last post. For this we use a set of heuristics to identify the end of the blog entry by looking for the start of boilerplate weblog end template. Example end markers include: the start of a sidebar, a copyright notice, or a form, or a comment. Another heuristic for finding the end of the blog entry is to look for a node in the DOM whose XPath is analogous in structure to the XPath of the last node in the previous weblog entry.


In step 2, we attempt to use post titles to demarcate boundaries between posts for an entry. First, we iterate over the nodes of the entry searching for a node that matches one of our conditions for being a title node. These conditions include: class attribute of the node equals ‘title’ or ‘subtitle’ or ‘blogpost’, etc. Once we have found the first matching title, we then assume that all subsequent post titles will have the same relative XPath. Again, we assume that all DOM nodes between subsequent title nodes are associated with the earlier title.


If we are unable to find titles, then we treat the entire entry as a single post. In fact, we have found that the majority of bloggers do not post more than once per day.


The final post-processing step identifies the permalink and author from the content of each extracted post using common patterns for permalinks and author signatures. To find authors, we look for patterns like “posted by.” To find permalinks, we look for hrefs (hyperlinks) in the post content that match, for example, “comment” or “archive.” Some patterns are given higher priority than others for matching against permalinks.


A weakness of our current implementation of model-based wrapper segmentation is that it assumes that the date field comes first in a weblog entry. In fact, while most blogs exhibit the pattern date ([title] content)+, others use (title date content)+ or even ([title] content date)+. Our approach is still able to segment blogs exhibiting these less common patterns, although the segmentation associates the date with the incorrect content. That is, if we have a sequence of N posts (post 1 through post N), the date for post 1 will be associated with the content of post 2 and so on. In addition, we will fail to extract the content of post 1. We call this error a parity error.


4. Segmentation Statistics


We have implemented weblog segmentation as part of the BlogPulse weblog post collection, indexing and search system.


In tests of the model-based segmentation algorithm, we have found that the precision of this algorithm is about 90%—that is about 90% of extracted posts have date, title and content fields that correspond to those of actual posts on the weblogs. The recall is approximately 70%—that is, we are able to extract posts from about 70% of true weblogs.









TABLE 1







Segmentation statistics for Apr. 13, 2005










Segmentation method
% of weblogs







Full content feed
78%



Feed-guided segmentation
11%



Model-based segmentation
11%










Table 1 shows the statistics for our segmentation process, the percentage of weblogs segmented using: (1) full content feeds (78%); (2) feed-guided segmentation (11%); or (3) model-based segmentation (11%).


We have implemented our segmentation algorithm as part of the weblog post collection subsystem of BlogPulse. This enables BlogPulse to provide search over individual blog posts. Furthermore, the corpus of dated weblog posts serves as a data set for tracking trends over time, and for analyzing how memes spread through the blogosphere.


Having described the invention with reference to embodiments, it is to be understood that the invention is defined by the claims, and it is not intended that any limitations or elements describing the embodiments set forth herein are to be incorporated into the meanings of the claims unless such limitations or elements are explicitly listed in the claims. Likewise, it is to be understood that it is not necessary to meet any or all of the identified advantages or objects of the invention disclosed herein in order to fall within the scope of any claims, since the invention is defined by the claims and since inherent and/or unforeseen advantages of the present invention may exist even though they may not have been explicitly discussed herein.

Claims
  • 1. A method of extracting weblog posts from a weblog, the method comprising: retrieving a feed referenced on a webpage of the weblog; andin response to determining that the feed does not contain a first portion of a weblog post: creating, via a processor, a representation of the weblog post based on a second portion of the weblog post included in the feed;filtering the representation of the weblog post to summarization artefacts;searching, via the processor, the weblog for the filtered representation of the second portion of the weblog post;when the second portion of the weblog post is found in the weblog, identifying, via the processor, a node associated with the second portion in the webpage;extracting, via the processor, information from markup language contained within the node associated with the second portion of the webpage; andmodifying, via the processor, the representation based on the information extracted from within the node to reconstruct the weblog post.
  • 2. A method as defined in claim 1, further including, in response to determining that the feed contains the first portion and the second portion, mapping the first and second portions into the representation of the weblog post.
  • 3. A method as defined in claim 1, wherein the first portion is at least one of a date for the weblog post, a permalink of the weblog post, a post title of the weblog post, an author of the weblog post, or a summary of the weblog post.
  • 4. A method as defined in claim 1, wherein the determining that the feed does not contain the first portion of the weblog post is based on at least one of a presence of tags, a percentage of posts including ellipses, or a variance in length of the weblog post.
  • 5. A method as defined in claim 1, further including, in response to determining that the feed does not contain a date of the weblog post and at least one of a summary or a full description of the weblog post: extracting dates from the markup language of the webpage;sorting the extracted dates into ordered lists;filtering the ordered lists to determine which of the lists correspond to entry dates of the weblog post;segmenting the weblog into entries based on dates from the filtered list as markers for the entries;extracting the weblog post from the weblog entries based on post title markers; andidentifying a permalink and an author of the weblog post.
  • 6. A method as defined in claim 5, wherein the filtering of the ordered lists includes: extracting lists whose dates belong to a current year or a past year;extracting non-singleton date lists;extracting lists whose dates conform to a similar format;extracting lists whose dates decrease monotonically;extracting lists with most recent dates;extracting lists with a longest date string representation; andextracting lists with a greatest number of dates.
  • 7. A method as defined in claim 1, further including screen scraping the weblog.
  • 8. An apparatus for extracting weblog posts from a weblog, the apparatus comprising: a web crawler to retrieve a feed referenced on a webpage of the weblog;a feed classifier to determine whether the feed contains a first portion of a weblog post; anda wrapper to, in response to determining that the feed does not contain a first portion of a weblog post: filter the representation of the weblog post to summarization artefacts;create a representation of the weblog post based on a second portion of the weblog post included in the feed;search the weblog for the filtered representation of the second portion of the weblog post;when the second portion of the weblog post is found in the weblog, identify a node associated with the second portion in the webpage;extract information from markup language contained within the node associated with the second portion of the webpage; andmodify the representation based on the information extracted from within the node to reconstruct the weblog post, at least one of the web crawler, the feed classifier, or the wrapper implemented by a logic circuit.
  • 9. An apparatus as defined in claim 8, wherein the wrapper is to, in response to determining that the feed contains the first portion and the second portion, map the first and second portions into the representation of the weblog post.
  • 10. An apparatus as defined in claim 8, wherein the first portion is at least one of a date for the weblog post, a permalink of the weblog post, a post title of the weblog post, an author of the weblog post, or a summary of the weblog post.
  • 11. An apparatus as defined in claim 8, wherein the feed classifier is to determine whether the feed contains the first portion of the weblog post based on at least one of a presence of tags, a percentage of posts including ellipses, or a variance in length of the weblog post.
  • 12. An apparatus as defined in claim 8, further including a date extractor to, in response to determining that the feed does not contain a date of the weblog post and at least one of a summary or a full description of the weblog post: extract dates from the markup language of the webpage;sort the extracted dates into ordered lists;filter the ordered lists to determine which of the lists correspond to entry dates of the weblog post;segment the weblog into entries based on dates from the filtered list as markers for the entries;extract the weblog post from the weblog entries based on post title markers; andidentify a permalink and an author of the weblog post.
  • 13. An apparatus as defined in claim 12, wherein to filter the ordered lists, the date extractor is to: extract lists whose dates belong to a current year or a past year;extract non-singleton date lists;extract lists whose dates conform to a similar format;extract lists whose dates decrease monotonically;extract lists with most recent dates;extract lists with a longest date string representation; andextract lists with a greatest number of dates.
  • 14. An apparatus as defined in claim 8, wherein the wrapper is to screen scrape the weblog.
  • 15. A computer readable storage hardware device or storage disc comprising instructions, that when executed, cause a machine to at least: retrieve a feed referenced on a webpage of the weblog; andin response to determining that the feed does not contain a first portion of a weblog post: create a representation of the weblog post based on a second portion of the weblog post included in the feed;filter the representation of the weblog post to summarization artefactssearch the weblog for the filtered representation of the second portion of the weblog post;when the second portion of the weblog post is found in the weblog, identify a node associated with the second portion in the webpage;extract information from markup language contained within the node associated with the second portion of the webpage; andmodify the representation based on the information extracted from within the node to reconstruct the weblog post.
  • 16. A hardware device as defined in claim 15, wherein the instructions, when executed, cause the machine to, when the feed contains the first portion and the second portion, map the first and second portions into the representation of the weblog post.
  • 17. A hardware device as defined in claim 15, wherein the first portion is at least one of a date of the weblog post, a permalink of the weblog post, a post title of the weblog post, an author of the weblog post, or a summary of the weblog post.
  • 18. A hardware device as defined in claim 15, wherein the instructions, when executed, cause the machine to, determine the feed does not contain the first portion of the weblog post based on at least one of a presence of tags, a percentage of posts including ellipses, or a variance in length of the weblog post.
  • 19. A hardware device as defined in claim 15, wherein the instructions, when executed, cause the machine to, in response to determining that the feed does not contain a date of the weblog post and at least one of a summary or a full description of the weblog post: extract dates from the markup language of the webpage;sort the extracted dates into ordered lists;filter the ordered lists to determine which of the lists correspond to entry dates of the weblog post;segment the weblog into entries based on dates from the filtered list as markers for the entries;extract the weblog post from the weblog entries based on post title markers; andidentify a permalink and an author of the weblog post.
  • 20. A hardware device as defined in claim 19, wherein to filter the ordered lists, the instructions, when executed, cause the machine to: extract lists whose dates belong to a current year or a past year;extract non-singleton date lists;extract lists whose dates conform to a similar format;extract lists whose dates decrease monotonically;extract lists with most recent dates;extract lists with a longest date string representation; andextract lists with a greatest number of dates.
  • 21. A hardware device as defined in claim 15, wherein the instructions, when executed, cause the machine to screen scrape the weblog.
CROSS-REFERENCE TO RELATED APPLICATIONS

This patent arises from a continuation of U.S. patent application Ser. No. 11/454,301, which was filed on Jun. 16, 2006, which claims the priority of U.S. Provisional Patent Application Ser. No. 60/691,200, which was filed Jun. 16, 2005. U.S. patent application Ser. No. 11/454,301 and U.S. Provisional Patent Application Ser. No. 60/691,200 are incorporated by reference in their entirety.

US Referenced Citations (223)
Number Name Date Kind
3950618 Bloisi Apr 1976 A
5041972 Frost Aug 1991 A
5077785 Monson Dec 1991 A
5124911 Sack Jun 1992 A
5301109 Landauer et al. Apr 1994 A
5317507 Gallant May 1994 A
5321833 Chang et al. Jun 1994 A
5371673 Fan Dec 1994 A
5495412 Thiessen Feb 1996 A
5519608 Kupiec May 1996 A
5537618 Boulton et al. Jul 1996 A
5659732 Kirsch Aug 1997 A
5659742 Beattie et al. Aug 1997 A
5668953 Sloo Sep 1997 A
5671333 Catlett et al. Sep 1997 A
5675710 Lewis Oct 1997 A
5696962 Kupiec Dec 1997 A
5754939 Herz et al. May 1998 A
5761383 Engel et al. Jun 1998 A
5778363 Light Jul 1998 A
5794412 Ronconi Aug 1998 A
5819285 Damico et al. Oct 1998 A
5822744 Kesel Oct 1998 A
5836771 Ho et al. Nov 1998 A
5845278 Kirsch et al. Dec 1998 A
5857179 Vaithyanathan et al. Jan 1999 A
5884302 Ho Mar 1999 A
5895450 Sloo Apr 1999 A
5911043 Duffy et al. Jun 1999 A
5920854 Kirsch et al. Jul 1999 A
5924094 Sutter Jul 1999 A
5950172 Klingman Sep 1999 A
5950189 Cohen et al. Sep 1999 A
5953718 Wical Sep 1999 A
5974412 Hazlehurst et al. Oct 1999 A
5983214 Lang et al. Nov 1999 A
5983216 Kirsch et al. Nov 1999 A
6006221 Liddy et al. Dec 1999 A
6012053 Pant et al. Jan 2000 A
6021409 Burrows Feb 2000 A
6026387 Kesel Feb 2000 A
6026388 Liddy et al. Feb 2000 A
6029161 Lang et al. Feb 2000 A
6029195 Herz Feb 2000 A
6032145 Beall et al. Feb 2000 A
6035294 Fish Mar 2000 A
6038610 Belfiore et al. Mar 2000 A
6061789 Hauser et al. May 2000 A
6064980 Jacobi et al. May 2000 A
6067539 Cohen May 2000 A
6078892 Anderson et al. Jun 2000 A
6081793 Challener et al. Jun 2000 A
6094657 Hailpern et al. Jul 2000 A
6098066 Snow et al. Aug 2000 A
6112203 Bharat et al. Aug 2000 A
6119933 Wong et al. Sep 2000 A
6138113 Dean et al. Oct 2000 A
6138128 Perkowitz et al. Oct 2000 A
6169986 Bowman et al. Jan 2001 B1
6185558 Bowman et al. Feb 2001 B1
6192360 Dumais et al. Feb 2001 B1
6202068 Kraay et al. Mar 2001 B1
6233575 Agrawal et al. May 2001 B1
6236977 Verba et al. May 2001 B1
6236980 Reese May 2001 B1
6236987 Horowitz et al. May 2001 B1
6236991 Frauenhofer et al. May 2001 B1
6260041 Gonzalez et al. Jul 2001 B1
6266664 Russell-Falla et al. Jul 2001 B1
6269362 Broder et al. Jul 2001 B1
6278990 Horowitz Aug 2001 B1
6289342 Lawrence et al. Sep 2001 B1
6304864 Liddy et al. Oct 2001 B1
6308176 Bagshaw Oct 2001 B1
6314420 Lang et al. Nov 2001 B1
6324648 Grantges, Jr. Nov 2001 B1
6334131 Chakrabarti et al. Dec 2001 B2
6360215 Judd et al. Mar 2002 B1
6362837 Ginn Mar 2002 B1
6366908 Chong et al. Apr 2002 B1
6377946 Okamoto et al. Apr 2002 B1
6385586 Dietz May 2002 B1
6393460 Gruen et al. May 2002 B1
6401118 Thomas Jun 2002 B1
6411936 Sanders Jun 2002 B1
6418433 Chakrabarti et al. Jul 2002 B1
6421675 Ryan et al. Jul 2002 B1
6434549 Linetsky et al. Aug 2002 B1
6473794 Guheen et al. Oct 2002 B1
6493703 Knight et al. Dec 2002 B1
6507866 Barchi Jan 2003 B1
6510513 Danieli Jan 2003 B1
6513032 Sutter Jan 2003 B1
6519571 Guheen et al. Feb 2003 B1
6519631 Rosenschein et al. Feb 2003 B1
6526440 Bharat Feb 2003 B1
6536037 Guheen et al. Mar 2003 B1
6539375 Kawasaki Mar 2003 B2
6546390 Pollack et al. Apr 2003 B1
6553358 Horvitz Apr 2003 B1
6571234 Knight et al. May 2003 B1
6571238 Pollack et al. May 2003 B1
6574614 Kesel Jun 2003 B1
6584470 Veale Jun 2003 B2
6606644 Ford et al. Aug 2003 B1
6615166 Guheen et al. Sep 2003 B1
6622140 Kantrowitz Sep 2003 B1
6640218 Golding et al. Oct 2003 B1
6651086 Manber et al. Nov 2003 B1
6654813 Black et al. Nov 2003 B1
6658389 Alpdemir Dec 2003 B1
6662170 Dom et al. Dec 2003 B1
6678516 Nordman et al. Jan 2004 B2
6708215 Hingorani et al. Mar 2004 B1
6721713 Guheen et al. Apr 2004 B1
6721734 Subasic et al. Apr 2004 B1
6751606 Fries et al. Jun 2004 B1
6751683 Johnson et al. Jun 2004 B1
6757646 Marchisio Jun 2004 B2
6772141 Pratt et al. Aug 2004 B1
6775664 Lang et al. Aug 2004 B2
6778975 Anick et al. Aug 2004 B1
6782393 Balabanovic et al. Aug 2004 B1
6795826 Flinn et al. Sep 2004 B2
6807566 Bates et al. Oct 2004 B1
6889325 Sipman et al. May 2005 B1
6892944 Chung et al. May 2005 B2
6928526 Zhu et al. Aug 2005 B1
6957186 Guheen et al. Oct 2005 B1
6978292 Murakami et al. Dec 2005 B1
6983320 Thomas et al. Jan 2006 B1
6999914 Boerner et al. Feb 2006 B1
7039621 Agrafiotis et al. May 2006 B2
7043760 Holtzman et al. May 2006 B2
7117187 Agrafiotis et al. Oct 2006 B2
7117368 Sako Oct 2006 B2
7139739 Agrafiotis et al. Nov 2006 B2
7146416 Yoo et al. Dec 2006 B1
7149698 Guheen et al. Dec 2006 B2
7165041 Guheen et al. Jan 2007 B1
7185008 Kawatani Feb 2007 B2
7185065 Holtzman et al. Feb 2007 B1
7188078 Arnett et al. Mar 2007 B2
7188079 Arnett et al. Mar 2007 B2
7197470 Arnett et al. Mar 2007 B1
7214298 Spence et al. May 2007 B2
7277919 Donoho et al. Oct 2007 B1
7292723 Tedesco et al. Nov 2007 B2
7315826 Guheen et al. Jan 2008 B1
7351376 Quake et al. Apr 2008 B1
7363243 Arnett et al. Apr 2008 B2
7401025 Lokitz Jul 2008 B1
7422150 Chung Sep 2008 B2
7431209 Chung Oct 2008 B2
7464003 Pomplun Dec 2008 B2
7536389 Prabhakar May 2009 B1
7600017 Holtzman et al. Oct 2009 B2
7818659 Kahn Oct 2010 B2
7844483 Arnett et al. Nov 2010 B2
7844484 Arnett et al. Nov 2010 B2
7865511 Kahn Jan 2011 B2
9158855 Glance Oct 2015 B2
20010011351 Sako Aug 2001 A1
20010018858 Dwek Sep 2001 A1
20010020228 Cantu et al. Sep 2001 A1
20010034708 Walker et al. Oct 2001 A1
20010042087 Kephart et al. Nov 2001 A1
20020010691 Chen Jan 2002 A1
20020019764 Mascarenhas Feb 2002 A1
20020032772 Olstad et al. Mar 2002 A1
20020059258 Kirkpatrick May 2002 A1
20020087515 Swannack et al. Jul 2002 A1
20020103801 Lyons Aug 2002 A1
20020123988 Dean et al. Sep 2002 A1
20020133481 Smith et al. Sep 2002 A1
20020159642 Whitney Oct 2002 A1
20020188586 Veale Dec 2002 A1
20030034393 Chung Feb 2003 A1
20030046144 Clark et al. Mar 2003 A1
20030062411 Chung et al. Apr 2003 A1
20030070338 Roshkoff Apr 2003 A1
20030088532 Hampshire, II May 2003 A1
20030094489 Wald May 2003 A1
20030173404 Chung et al. Sep 2003 A1
20040024752 Manber et al. Feb 2004 A1
20040059708 Dean et al. Mar 2004 A1
20040059729 Krupin et al. Mar 2004 A1
20040078432 Manber et al. Apr 2004 A1
20040111412 Broder Jun 2004 A1
20040122811 Page Jun 2004 A1
20040181675 Hansen Sep 2004 A1
20040199498 Kapur et al. Oct 2004 A1
20040205482 Basu et al. Oct 2004 A1
20040210561 Shen Oct 2004 A1
20050049908 Hawks Mar 2005 A2
20050060340 Sommerfield Mar 2005 A1
20050114161 Garg et al. May 2005 A1
20050125216 Chitrapura et al. Jun 2005 A1
20050154686 Corston et al. Jul 2005 A1
20060004691 Sifry Jan 2006 A1
20060015737 Canard et al. Jan 2006 A1
20060041605 King et al. Feb 2006 A1
20060069589 Nigam et al. Mar 2006 A1
20060085248 Arnett et al. Apr 2006 A1
20060155999 Holtzman et al. Jul 2006 A1
20060173819 Watson Aug 2006 A1
20060173837 Berstis et al. Aug 2006 A1
20060173985 Moore Aug 2006 A1
20060184629 Izdepski Aug 2006 A1
20060184630 Izdepski Aug 2006 A1
20060184631 Izdepski Aug 2006 A1
20060184674 Izdepski Aug 2006 A1
20060184678 Izdepski Aug 2006 A1
20060184679 Izdepski Aug 2006 A1
20060206505 Hyder et al. Sep 2006 A1
20060230021 Diab Oct 2006 A1
20060287989 Glance Dec 2006 A1
20070027840 Cowling et al. Feb 2007 A1
20070143853 Tsukamoto Jun 2007 A1
20070208614 Arnett et al. Sep 2007 A1
20070282621 Altman et al. Dec 2007 A1
20080059791 Lee et al. Mar 2008 A1
20080262920 O'Neill et al. Oct 2008 A1
Foreign Referenced Citations (3)
Number Date Country
1052582 Nov 2000 EP
0017824 Mar 2000 WO
0197070 Dec 2001 WO
Non-Patent Literature Citations (149)
Entry
Lada Adamic and Natalie Glance, The Political Blogoshpere and the 2004 U.S. Election: Divide the Blog, pp. 36-43 (Year: 2005).
Natalie Glance, Indexing Weblogs One Post at a Time, (Year: 2005).
Nanno et al., Automatically Collecting, Monitoring and Mining Japanese Weblogs, p. 320-321 (Year: 2004).
Glance et al., BlogPulse: Automated Trend Discovery for Weblogs, (Year: 2004).
Nakashima et al, Information Filtering for the Newspaper, Communications, Computers and Signal Processing, 10 Years PACRIM 1987-1997—Networking the Pacific Rim, vol. 1, Aug. 20, 1997 (4 pages).
Nanno et al., “Automatic Collection and Monitoring of Japanese Weblogs,” WWW2004 Workshop on the Weblogging Ecosytem: Aggregation, Analysis and Dynamics, New York NY, May 17-22, 2004 (7 pages).
Needel, “Word of Mouth Research Case Study: The Trans Fat Issue—Analysis of online consumer conversation to undersand how the Oreo lawsuit impacted word-of-mouth on trans fats,” Buzzmetrics, Aug. 16, 2004 (35 pages).
NetCurrent's archived website, Jun. 22, 2000 and Sep. 18, 2000, retrieved from <http://web.archive.org/web/2000622024845/www.netcurrents.com/report/sample.html>, retreived on Jan. 17, 2005 (16 pages).
Pang et al., “Thumbs Up? Sentiment Classification Using Machine Learning Techniques,” EMNLP '02 Proceedings of the ACL-02 conference on Empirical Methods in Natural Language Processing, vol. 10, 2002 (8 pages).
Reguly, “Caveat Emptor Rules on the Internet,” The Globe and Mail (Canada): Report on Business Column, p. B2, Apr. 10, 1999 (2 pages).
Reinartz, “Customer Lifetime Value Analysis: An Integrated Empirical Framework for Measurement and Explanation,” Dissertation, University of Houston, Apr. 1999 (68 pages).
Thomas, “International Marketing,” International Textbook Company, 1971 (p. 148).
Trigaux, “Cyberwar Erupts Over Free Speech Across Florida, Nation,” Knight-Ridder Tribune Business News, May 29, 2000 (4 pages).
Tull et al., “Marketing Research Measurement and Method,” MacMillan Publishing Company, 1984 (pp. 102, 103, 114, 115, 200, 201, 256).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 9/695,016, dated Jul. 21, 2004 (27 pages).
United States Patent and Trademark Office, “Requirement for Restriction/ Election,” issued in connection with U.S. Appl. No. 9/695,016, dated Jan. 26, 2005 (6 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 9/695,016, dated May 19, 2005 (13 pages).
United States Patent and Trademark Office, “Final Office Action,” issued in connection with U.S. Appl. No. 9/695,016, dated Nov. 2, 2005 (13 pages).
United States Patent and Trademark Office, “Final Office Action,” issued in connection with U.S. Appl. No. 9/686,516, dated May 10, 2006 (34 pages).
United States Patent and Trademark Office, “Final Office Action,” issued in connection with U.S. Appl. No. 9/686,516, dated Jun. 29, 2005 (21 pages).
United States Patent and Trademark Office, “Requirement for Restriction/Election,” issued in connection with U.S. Appl. No. 9/686,516, dated Oct. 19, 2004 (6 pages).
United States Patent and Trademark Office, “Notice of Allowance,” issued in connection with U.S. Appl. No. 9/686,516, dated Jan. 24, 2007 (9 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 9/686,516, dated Jan. 28, 2005 (19 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 9/686,516, dated Nov. 22, 2005 (25 pages).
United States Patent and Trademark Office, “Notice of Allowance,” issued in connection with U.S. Appl. No. 9/796,961, dated Feb. 24, 2003 (4 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 11/372,191, dated Dec. 23, 2008 (6 pages).
United States Patent and Trademark Office, “Notice of Allowance,” issued in connection with U.S. Appl. No. 11/710,743, dated Jul. 21, 2010 (6 pages).
United States Patent and Trademark Office, “Final Office Action,” issued in connection with U.S. Appl. No. 11/710,743, dated Jan. 12, 2009 (17 pages).
United States Patent and Trademark Office, “Final Office Action,” issued in connection with U.S. Appl. No. 11/710,743, dated Jan. 8, 2010 (21 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 11/710,743, dated Jul. 29, 2008 (12 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 11/710,743, dated Aug. 7, 2009 (18 pages).
United States Patent and Trademark Office, “Pre-Brief Appeal Conference Decision,” issued in connection with U.S. Appl. No. 11/710,743, dated Jul. 20, 2009 (2 pages).
United States Patent and Trademark Office, “Notice of Allowance,” issued in connection with U.S. Appl. No. 11/710,742, dated Jul. 26, 2010 (10 pages).
United States Patent and Trademark Office, “Notice of Allowance,” issued in connection with U.S. Appl. No. 11/710,742, dated Jun. 8, 2009 (10 pages).
United States Patent and Trademark Office, “Notice of Allowance,” issued in connection with U.S. Appl. No. 11/710,742, dated Jan. 30, 2009 (10 pages).
United States Patent and Trademark Office, “Final Office Action,” issued in connection with U.S. Appl. No. 11/710,742, dated Aug. 7, 2008 (12 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 11/710,742, dated Jan. 5, 2010 (5 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 11/710,742, dated Oct. 3, 2007 (10 pages).
United States Patent and Trademark Office, “Notice of Allowance,” issued in connection with U.S. Appl. No. 11/710,742, dated Dec. 8, 2008 (10 pages).
United States Patent and Trademark Office, “Notice of Allowance,” issued in connection with U.S. Appl. No. 11/245,542, dated Oct. 8, 2008 (5 pages).
United States Patent and Trademark Office, “Notice of Allowance,” issued in connection with U.S. Appl. No. 11/245,542, dated Aug. 21, 2008 (4 pages).
United States Patent and Trademark Office, “Notice of Allowance,” issued in connection with U.S. Appl. No. 11/245,542, dated Dec. 16, 2008 (4 pages).
United States Patent and Trademark Office, “Final Office Action,” issued in connection with U.S. Appl. No. 11/245,542, dated Jun. 12, 2008 (9 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 11/245,542, dated Oct. 9, 2007 (11 pages).
United States Patent and Trademark Office, “Notice of Allowance,” issued in connection with U.S. Appl. No. 11/651,661, dated May 19, 2009 (28 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 11/651,661, dated Jun. 4, 2008 (11 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 11/651,661, dated Aug. 3, 2007 (14 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 11/651,661, dated Dec. 4, 2008 (10 pages).
United States Patent and Trademark Office, “Final Office Action,” issued in connection with U.S. Appl. No. 12/955,586, dated Jan. 28, 2015 (34 pages).
United States Patent and Trademark Office, “Final Office Action,” issued in connection with U.S. Appl. No. 12/955,586, dated Dec. 5, 2013 (25 pages).
United States Patent and Trademark Office, “Advisory Action,” issued in connection with U.S. Appl. No. 12/955,586, dated Jan. 21, 2016 (2 pages).
United States Patent and Trademark Office, “Examiner's Answer to Appeal Brief,” issued in connection with U.S. Appl. No. 12/955,586, dated Mar. 1, 2016 (60 pages).
United States Patent and Trademark Office, “Examiner's Answer to Appeal Brief,” issued in connection with U.S. Appl. No. 12/955,586, dated Aug. 4, 2014 (33 pages).
United States Patent and Trademark Office, “Requirement for Restriction/Election,” issued in connection with U.S. Appl. No. 12/955,586, dated Jun. 6, 2013 (6 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 12/955,586, dated Aug. 20, 2013 (26 pages).
United States Patent and Trademark Office, “Notice of Allowance,” issued in connection with U.S. Appl. No. 9/879,220, dated Dec. 1, 2006 (8 pages).
United States Patent and Trademark Office, “Final Office Action,” issued in connection with U.S. Appl. No. 9/879,220, dated Apr. 28, 2005 (16 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 9/879,220, dated Dec. 2, 2004 (15 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 9/879,220, dated Mar. 28, 2006 (19 pages).
United States Patent and Trademark Office, “Notice of Allowance,” issued in connection with U.S. Appl. No. 10/801,758, dated Jan. 6, 2010 (8 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 10/801,758, dated Apr. 29, 2009 (41 pages).
United States Patent and Trademark Office, “Final Office Action,” issued in connection with U.S. Appl. No. 10/801,758, dated Oct. 1, 2008 (40 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 10/801,758, dated Mar. 7, 2008 (31 pages).
United States Patent and Trademark Office, “Notice of Allowance,” issued in connection with U.S. Appl. No. 11/239,695, dated Sep. 21, 2006 (9 pages).
United States Patent and Trademark Office, “Notice of Allowance,” issued in connection with U.S. Appl. No. 11/239,695, dated Dec. 13, 2006 (9 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 11/239,695, dated Apr. 7, 2006 (9 pages).
United States Patent and Trademark Office, “Notice of Allowance,” issued in connection with U.S. Appl. No. 11/239,632, dated Sep. 21, 2006 (8 pages).
United States Patent and Trademark Office, “Notice of Allowance,” issued in connection with U.S. Appl. No. 11/239,632, dated Dec. 18, 2006 (9 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 11/239,632, dated Apr. 5, 2006 (9 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 11/364,169, dated Jun. 11, 2007 (8 pages).
United States Patent and Trademark Office, “Final Office Action,” issued in connection with U.S. Appl. No. 11/364,169, dated Apr. 23, 2008 (10 pages).
United States Patent and Trademark Office, “Advisory Action,” issued in connection with U.S. Appl. No. 11/364,169, dated Jul. 10, 2008 (4 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 11/517,418, dated Aug. 19, 2008 (9 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 11/897,984, dated May 22, 2008 (17 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 11/897,984, dated Apr. 21, 2009 (18 pages).
United States Patent and Trademark Office, “Advisory Action,” issued in connection with U.S. Appl. No. 11/897,984, dated Jan. 21, 2009 (3 pages).
United States Patent and Trademark Office, “Final Office Action,” issued in connection with U.S. Appl. No. 11/897,984, dated Nov. 13, 2008 (20 pages).
United States Patent and Trademark Office, “Notice of Allowance,” issued in connection with U.S. Appl. No. 11/897,984, dated Oct. 7, 2009 (10 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 11/517,417, dated Dec. 17, 2008 (6 pages).
United States Patent and Trademark Office, “Notice of Allowance,” issued in connection with U.S. Appl. No. 11/517,417, dated May 29, 2009 (6 pages).
United States Patent and Trademark Office, “Final Office Action,” issued in connection with U.S. Appl. No. 11/239,696, dated Sep. 12, 2007 (8 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 11/239,696, dated Sep. 22, 2006 (11 pages).
United States Patent and Trademark Office, “Notice of Allowance,” issued in connection with U.S. Appl. No. 11/239,696, dated Nov. 21, 2007 (4 pages).
United States Patent and Trademark Office, “Advisory Action,” issued in connection with U.S. Appl. No. 11/454,301, dated Jun. 25, 2009 (3 pages).
United States Patent and Trademark Office, “Examiner's Answer to Appeal Brief,” issued in connection with U.S. Appl. No. 11/454,301, dated Mar. 12, 2012 (17 pages).
United States Patent and Trademark Office, “Final Office Action,” issued in connection with U.S. Appl. No. 11/454,301, dated Apr. 7, 2009 (17 pages).
United States Patent and Trademark Office, “Final Office Action,” issued in connection with U.S. Appl. No. 11/454,301, dated Mar. 25, 2011 (16 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 11/454,301, dated Sep. 18, 2008 (17 pages).
United States Patent and Trademark Office, “Non-Final Office Action,” issued in connection with U.S. Appl. No. 11/454,301, dated Jul. 13, 2010 (16 pages).
United States Patent and Trademark Office, “Notice of Allowance,” issued in connection with U.S. Appl. No. 11/454,301, dated Jun. 1, 2015 (8 pages).
United States Patent and Trademark Office, “Pre-Brief Appeal Conference Decision,” issued in connection with U.S. Appl. No. 11/454,301, dated Aug. 3, 2011 (2 pages).
United States Patent and Trademark Office, “Pre-Brief Appeal Conference Decision,” issued in connection with U.S. Appl. No. 11/454,301, dated May 5, 2010 (2 pages).
United States Patent and Trademark Office, “Patent Board Decision,” issued in connection with U.S. Appl. No. 11/454,301, dated Feb. 6, 2015 (10 pages).
Voorhees, “The TREC-8 Question Answering Track Report,” National Institute of Standards and Technology, 1999 (6 pages).
Wiebe et al., “Identifying Collocations for Recognizing Opinions,” Proceedings of ACL/EACL'01 Workshop on Collocation, Toulouse, France, Apr. 9, 2001 (9 pages).
Yang, “An Evaluation of Statistical Approaches to Text Categorization,” CMU-CS-97-127, Carnegie Mellon University, Pittsburgh, PA, Apr. 10, 1997 (12 pages).
zagat.com, retrieved from www.zagat.com as archived on Feb. 1999 (1 page).
zagat.com, retrieved from http://web.archive.org/web/199990418081713/www.zagat.com/partners.asp>, as archived on Apr. 29, 1999, retrieved Jun. 18, 2004 (34 pages).
Zufryden “New Film Website Promotion and Box-Office Performance,” Journal of Advertising Research, Jan.-Apr. 2000 (11 pages).
Adamic et al., “The Political Blogosphere and the 2004 U.S. Election: Divided They Blog,” Proceeding WWW-2005 2nd Annual Workshop on the Weblogging Ecosytem, Chiba, Japan, Mar. 4, 2005 (16 pages).
Adar et al., “Implicit Structure and the Dynamics of Blogspace,” Proceedings WWW-2004 Workshop on the Weblogging Ecosystem, New York, NY, Mar. 18, 2004 (8 pages).
Aliod et al., “A Real Word Implementation of Answer Extraction,” Department of Computer Science, University of Zurich, Winterthurerstr., Aug. 28, 1998 (6 pages).
Baumes et al., “SIGHTS: A Software System for Finding Coalitions and Leaders in a Social Network,” Intelligence and Security Informatics, May 23, 2007 (7 pages).
Bishop, “Arrow Question/Answering Systems,” Language Computer Corporation, 1999 (3 pages).
Bizrate's archived website, Jan. 1999, retrieved from <http://web.archive.org/web*/http://www.bizrate.com>, retrieved on Oct. 29, 2004 (22 pages).
Blum, “Empirical Support for Winnow and Weighted-Majority Algorithms: Results on a Calendar Scheduling Domain,” Machine Learning, vol. 26, No. 5, 1997 (5 pages).
Bournellis, “Tracking the Hits on Web Sites,” Communications International, vol. 22, Issue 9, Sep. 1995 (3 pages).
Business Wire, “Delahaye Group to Offer NetBench: High Level Web-Site Qualitative Analysis and Reporting; NetBench Builds on Systems Provided by I/PRO and Internet Media Services,” Business Wire, Inc., May 31, 1995 (3 pages).
Chaum, “Untraceable Electronic Mail, Return Addresses, and Digital Pseudonyms,” Communications of the ACM, vol. 24, No. 2, Feb. 1981 (5 pages).
Chaum et al., “A Secure and Privacy-Protecting Protocol for Transmitting Personal Information Between Organizations,” Advances in Cryptology-CRYPTO ,86, Aug. 11, 1986 (53 pages).
Cohen, “Data Integration Using Similarity Joins and a Word-Based Information Representation Language,” ACM Transactions on Information Systems, vol. 18, No. 3, Jul. 2000 (34 pages).
Cohn et al., “Active Learning with Statistical Models,” Journal of Artificial Intelligence Research vol. 4, Mar. 1996 (17 pages).
D'Astous et al., “Consumer Evaluations of Movies on the Basis of Critics' Judgments,” Psychology & Marketing, vol. 16, No. 8, Dec. 1999 (18 pages).
Dagan et al, “Mistake-Driven Learning in Text Categorization,” EMNLP'97, 2nd Conference on Empirical Methods in Natural Language Processing, 1997 (9 pages).
Dialogic's archived website, May 12, 2000, retrieved from <http://web.archive.org/web/*/http://www.dialogic.com>, retrieved on Dec. 2, 2003 (34 pages).
Dillon et al., “Marketing Research in a Marketing Environment,” Third Edition, Times Mirror/Mosby College Publishing, 1987 (pp. 98, 286, 288).
Eliashberg et al., “Film Critics: Influencers or Predictors?”, Journal of Marketing, vol. 61, No. 2, 1997, retrieved from <http://search.proquest.com/printviewfile?accountid=14753>, retrieved on Sep. 27, 2012 (12 pages).
European Patent Office, “Intention to Grant,” issued in connection with European Patent Application No. 02744622.8, dated Jul. 28, 2008 (36 pages).
European Patent Office, “Supplementary European Search Report under Article 157(2)(a) EPC,” issued in connection with European Patent Application No. 02744622.8, dated Sep. 26, 2007 (3 pages).
EWatch's archived website, May 22, 1998, retrieved from <http://web.archive.org/web/19980522190526/www/ewatch.com/ind_main.html>, retrieved on Sep. 8, 2004 (50 pages).
Farber, “IP: eWatch and Cybersleuth,” Jun. 29, 2000, retrieved from <http://www.interesting-people.org/archives/interesting-people/200006/msg0090.html>, retrieved on Jan. 21, 2005 (4 pages).
Freund et al., “Selective Sampling Using the Query by Committee Algorithm,” Machine Learning, vol. 28, 1997 (36 pages).
Ginsburgh et al., “On the Perceived Quality of Movies,” Journal of Cultural Economics, vol. 23, No. 4, Nov. 1999 (15 pages).
Glance et al., “Analyzing Online Discussion for Marketing Intelligence,” Proceedings WWW-2005 2nd Annual Workshop on the Weblogging Ecosystem, Chiba, Japan, May 10-14, 2005 (2 pages).
Glance et al., “Deriving Marketing Intelligence from Online Discussion,” 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, Aug. 21-24, 2005 (10 pages).
Grefenstette et al., “Chapter 10—Validating the Coverage of Lexical Resources for Affect Analysis and Automatically Classifying New Words Along Semantic Axes,”Mar. 2006, (16 pages).
Harabagiu et al., “An Intelligent System for Question Answering,” University of Southern California, Southern Methodist University, 2000 (5 pages).
Harabagiu et al., “Experiments with Open-Domain Textual Question Answering,” Department of Computer Science and Engineering at Southern Methodist University, 2000 (7 pages).
Harabagiu et al., “Mining Textual Answers with Knowledge-Based Indicators,” Department of Computer Science and Engineering at Southern Methodist University, 1996 (5 pages).
Hornaday, “Two Come Out of the Blue; ‘There's Something About Mary’ and ‘Smoke Signals’ Make it Big; Film:,” The Sun, Aug. 23, 1988 (2 pages).
Housley et al., “Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile,” Network Working Group, The Internet Society, Apr. 2002, retrieved from <http://www.ietf.org/rfc/rfc3280.txt>, retrieved on Apr. 30, 2009 (121 pages).
International Searching Authority, “International Preliminary Report on Patentability,” issued in connection with International Patent Application No. PCT/US2007/021035, dated Mar. 31, 2009 (9 pages).
International Searching Authority, Written Opinion, issued in connection with International Patent Application No. PCT/US2007/021035, dated Jul. 1, 2008 (8 pages).
International Searching Authority, “International Search Report,” issued in connection with International Patent Application No. PCT/US2007/021035, dated Jul. 1, 2008 (1 page).
International Searching Authority, “International Preliminary Report on Patentability,” issued in connection with International Patent Application No. PCT/IL2006/000905, dated Feb. 5, 2008 (4 pages).
International Searching Authority, “International Search Report and Written Opinion,” issued in connection with International Patent Application No. PCT/IL2006/000905, dated Jul. 2, 2007 (4 pages).
International Searching Authority, “International Search Report,” issued in connection with International Patent Application No. PCT/US2005/035321, dated May 8, 2007 (1 page).
International Searching Authority, “International Preliminary Report on Patentability,” issued in connection with International Patent Application No. PCT/US2005/035321, dated Jun. 19, 2007 (4 pages).
Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Machine Learning: ECML'98, Tenth European Conference on Machine Learning, 1998 (7 pages).
Khan et al., “Categorizing Web Documents Using Competitive Learning: An Ingredient of a Personal Adapative Agent,” International Conference on Neural Networks, vol. 1, Jun. 912, 1997 (4 pages).
Katz, “From Sentence Processing to Information Access on the World Wide Web,” Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Feb. 27, 1997, retrieved from <http://www.ai.mit.edu/people/boris/webaccess/>, retrieved on Jul. 2, 2001 (20 pages).
Kleppner, Advertising Procedure, 6th Edition, Prentice-Hall, Inc., 1977 (p. 492).
Kotler, Marketing Mangement, Prentice Hall International Inc., 1997 (pp. 256-257, 617-619, 665-667).
Lenz et al., “Question Answering with Textual CBR,” Department of Computer Science, Humboldt University Berlin, D-10099, 1998 (12 pages).
Littlestone, “Learning Quickly When Irrelevant Attributes Abound: A New Linear Threshold Algorithm,” Machine Learning, vol. 2, 1988 (34 pages).
Marlow, “Audience, Structure and Authority in the Weblog Community,” International Communication Association Conference, MIT Media Laboratory, New Orleans, LA, 2004 (9 pages).
McLachlan et al., “The EM Algortihm and Extensions,” Wiley Series in Probability and Statistics, 1997 (302 pages).
McCallum et al., “Text Classification by Bootstrapping with Keywords, EM and Shrinkage,” Just Research and Carnegie Mellon University, Pittsburgh, PA, 1999 (7 pages).
Moldovan et al., “LASSO: A Tool for Surfing the Answer Net,” Department of Computer Science and Engineering at Southern Methodist University, 1999 (9 pages).
Related Publications (1)
Number Date Country
20160117390 A1 Apr 2016 US
Provisional Applications (1)
Number Date Country
60691200 Jun 2005 US
Continuations (1)
Number Date Country
Parent 11454301 Jun 2006 US
Child 14881071 US