The present invention relates to electronic documents, and particularly relates to a method and system of assigning a publication date for at least one electronic document.
Programmatically assigning publication dates, or posting dates, for electronic documents in a large, hierarchical, linked collection, where the electronic documents contain both unstructured text and associated metadata that may include date information is challenging. For example, the electronic documents may be Web pages. A date associated with a Web page is not easily discerned programmatically due to the unstructured format and the frequent modifications of Web pages.
1. Need for Assigning Publication Dates
The publication date associated with an electronic document is essential (1) to develop the trending of the subject matter of the electronic document and (2) to understand the context in which the electronic document was written. The publication date of an electronic document provides a reader of the electronic document with an indication of the currency of the content in the electronic document.
2. Challenge of Assigning Dates
An assigned date for an electronic document could be (a) the date when the electronic document was posted on a Web site, (b) the date when the content of the electronic document was written by the author, or (c) the “street date” of the publication (i.e. when the publication actually is first made available in paper form).
Even for electronic documents where dates can be assigned, date formats are not standardized and vary among (a) electronic documents, (b) sources of the electronic documents (i.e. Web sites), and (c) country sources. In addition, different types of dates (e.g. expiration dates, historical dates) may occur in electronic documents.
In addition, all-numeric date patterns may be ambiguous. A common form of ambiguous date pattern is a date pattern in which the month and day may be interchanged (i.e. it is not clear if the date is of the form mmddyy or ddmmyy (such as 09/08/04)). Other language-specific complexities exist as well. For example, in Japanese, there may be ambiguity with the year as well (e.g., “12.11.10” may be December 11, 1910 or Heisei Year 10 (1998), November 10).
3. Prior Art Systems
Currently, prior art methods and systems of assigning a publication date to at least one electronic document fail to address this need. In a first prior art system, as shown in prior art
publication date of an electronic document from the metadata of the document. Therefore, method and system of assigning a publication date for at least one electronic document is needed.
The present invention provides a method and system of assigning a publication date for at least one electronic document, where the publication date includes the year that the document was published, the month that the document was published, and the day that the document was published. In an exemplary embodiment, the method and system include (1) recognizing the publication date in the document by regular expression pattern matching, (2) if the publication date is ambiguous, resolving the ambiguous publication date, and (3) validating the publication date.
In an exemplary embodiment, the recognizing includes determining at least one candidate publication date from the document identifier of the document. In an exemplary embodiment, the determining includes (1) if only one candidate publication date is determined and the candidate publication date comprises a year, a month, and a day, assigning the candidate publication date as the publication date for the document, (2) if more than one candidate publication date is determined and if each of the more than one candidate publication date comprises a year, a month, and a day, assigning the most recent candidate publication date as the publication date for the document, and (3) if the candidate publication date specifies only a month and a year, (a) scanning the textual content of the document for a date whose month and year are the same as the month and year of the candidate publication date, (b) if a scanned date whose month and year are the same as the month and year of the candidate publication date is found, assigning the scanned date as the publication date for the document, and (c) if a scanned date whose month and year are the same as the month and year of the candidate publication date is not found, assigning an arbitrary day for the publication date for the document.
In an exemplary embodiment, the recognizing includes determining the publication date from the textual content of the document. In an exemplary embodiment, the determining includes assigning the first date in the textual content as the publication date for the document. In an exemplary embodiment, the recognizing includes determining the publication date from the metadata of the document. In an exemplary embodiment, the determining includes, if the document is a static Web page and if the HTTP Last Modified date is present in the document, assigning the HTTP Last Modified date as the publication date for the document.
In an exemplary embodiment, the recognizing includes, for the regular expression pattern matching, using date patterns defined to support dates specified with textual month names. In an exemplary embodiment, the recognizing includes, for the regular expression pattern matching, using date patterns defined to support dates specified with numeric patterns.
In an exemplary embodiment, the resolving includes, if the publication date has an unambiguous date pattern, using the unambiguous date pattern in the regular expression pattern matching. In an exemplary embodiment, the resolving includes, if the document is fetched repeatedly and if the publication date has an ambiguous date pattern, (1) saving the publication date, (2) if the document is re-fetched and if the date pattern of the saved publication date matches the date pattern of the publication date of the re-fetched document, determining the portion of the publication date that has changed, (3) comparing the determined portion to the time period during which the document was re-fetched, (4) based on the comparing, determining the date pattern for the document, and (5) using the determined date pattern in the regular expression pattern matching.
In an exemplary embodiment, the resolving includes (1) tracking within a hierarchy of electronic documents the locations of the electronic documents having unambiguous date patterns and (2) if the publication date has an ambiguous date pattern, using the unambiguous date pattern associated with the tracked location of the document in the regular expression pattern matching. In an exemplary embodiment, the resolving includes, if the publication date has an ambiguous date pattern, (1) scanning the document for a month name corresponding to publication date and (2) using a date pattern that conforms to the scanned month name and the publication date in the regular expression pattern matching.
In an exemplary embodiment, the resolving includes, if the publication date has an ambiguous date pattern, (1) maintaining a list of default date patterns for a plurality of countries of origin of electronic documents and (2) if the country of origin of the document is determined and is in the list, using the default date pattern for the country of origin in the regular expression pattern matching.
In an exemplary embodiment, the validating includes characterizing the publication date as a valid publication date if the day of the publication date is between 1 and 31, the month of the publication date is between 1 and 12, and the publication date is not more than a specified number of days in the future. In an exemplary embodiment, the beginning of the specific number of days is the HTTP Last Modified date of the document. In an exemplary embodiment, the beginning of the specific number of days is the date that the document was obtained. In an exemplary embodiment, the specific number of days ranges from 1 day to 10 days.
In an exemplary embodiment, the recognizing includes (1) determining at least one candidate publication date from the document identifier of the document, (2) if the determining is unsuccessful, identifying the publication date from the textual content of the document, and (3) if the identifying is unsuccessful, noting the publication date from the metadata of the document. In an exemplary embodiment, the determining includes (1) if only one candidate publication date is determined and the candidate publication date comprises a year, a month, and a day, assigning the candidate publication date as the publication date for the document, (2) if more than one candidate publication date is determined and if each of the more than one candidate publication date comprises a year, a month, and a day, assigning the most recent candidate publication date as the publication date for the document, and (3) if the candidate publication date specifies only a month and a year, (a) scanning the textual content of the document for a date whose month and year are the same as the month and year of the candidate publication date, (b) if a scanned date whose month and year are the same as the month and year of the candidate publication date is found, assigning the scanned date as the publication date for the document, and (c) if a scanned date whose month and year are the same as the month and year of the candidate publication date is not found, assigning an arbitrary day for the publication date for the document.
In an exemplary embodiment, the identifying includes assigning the first date in the textual content as the publication date for the document. In an exemplary embodiment, the noting includes, if the document is a static Web page and if the HTTP Last Modified date is present in the document, assigning the HTTP Last Modified date as the publication date for the document.
The present invention also provides a method and system of assigning a publication date for at least one electronic document, where the publication date includes the year that the document was published and the month that the document was published. In an exemplary embodiment, the method and system include (1) recognizing the publication date in the document by regular expression pattern matching, (2) if the publication date is ambiguous, resolving the ambiguous publication date, and (3) validating the publication date.
In an exemplary embodiment, the recognizing includes determining at least one candidate publication date from the document identifier of the document. In an exemplary embodiment, the determining includes (1) if only one candidate publication date is determined, assigning the candidate publication date as the publication date for the document and (2) if more than one candidate publication date is determined, assigning the most recent candidate publication date as the publication date for the document.
The present invention provides a method and system of assigning a publication date for at least one electronic document, where the publication date includes the year that the document was published, the month that the document was published, and the day that the document was published. In an exemplary embodiment, the method and system include (1) recognizing the publication date in the document by regular expression pattern matching, (2) if the publication date is ambiguous, resolving the ambiguous publication date, and (3) validating the publication date.
Referring to
Recognizing the Publication Date
Determining the Publication Date from the Document Identifier of the Document
Referring next to
Referring next to
Referring next to
Determining the Publication Date from the Content of the Document
Referring next to
In an exemplary embodiment, anchor text used for annotating hyperlinks for Web pages (i.e. dates found in anchor text are dates found in the page that the links point to), and template or boilerplate text that occurs on all documents in a common node of a document hierarchy are not scanned for the publication date. Template text is found by existing algorithms such as that described in (1) Yi, B. Liu, X. Li, Eliminating Noisy Information in Web Pages for Data Mining, SIGKDD 03 and (2) Z. Bar-Jossef and S. Rajagopalan, Template Detection via Data Mining and Its Applications, WWW 2002.
Determining the Publication Date from the Metadata
Referring next to
Using Date Patterns
Referring next to
Referring next to
In an exemplary embodiment, recognizing step 210 includes (a) detecting abbreviated and full names of month names, (b) detecting dates in multiple languages by use of a static vocabulary of month names, (c) detecting the day of the publication date in either cardinal form (e.g. 1, 2, 3) or ordinal form (e.g. 1st, 2nd, 3rd). In an exemplary embodiment, if the publication date includes only a month and year, then a fixed day of month is assigned (e.g. the first of the month).
In an exemplary embodiment, a numeric pattern of the form nnnnnn (or nnnnnnnn) is considered as a candidate publication date only if it can be divided into patterns of dd mm yy (or ddmmyyyy, mmddyy or mmddyyyy) where dd is less than or equal to 31, mm is less than or equal to 12, and yy (yyyy) is up to the current year.
Resolving Ambiguous Dates
Referring next to
Referring next to
Referring next to
(1) “www.name.com count of mm/dd/yy count of dd/mm/yy”
or
(2) “www.name.com/directory count of mm/dd/yy count of dd/mm/yy”.
In an exemplary embodiment, the counts are counts of unambiguous dates identified.
In addition, tracking step 432 includes collapsing a directory in the hierarchy upward when one date pattern is more than a t % majority in all subdirectories in the directory. For example, tracking step 432 would collapse
“www.name.com/topdirectory/directory1” and
“www.name.com/topdirectory/directory2”
if dd/mm/yy is an 80% majority in both directory1 and directory2. When an ambiguous date is identified, if it belongs to a node with a t % majority format, interpret the date according to the unambiguous date pattern.
Referring next to
Referring next to
Validating the Publication Date
Referring next to
Publication Date Including a Year and Month
The present invention also provides a method and system of assigning a publication date for at least one electronic document, where the publication date includes the year that the document was published and the month that the document was published. In an exemplary embodiment, the method and system include (1) recognizing the publication date in the document by regular expression pattern matching, (2) if the publication date is ambiguous, resolving the ambiguous publication date, and (3) validating the publication date.
Referring to
Recognizing the Publication Date
Determining the Publication Date from the Document Identifier of the Document
Referring next to
Determining the Publication Date from the Content of the Document
Referring next to
Determining the Publication Date from the Metadata
Referring next to
Using Date Patterns
Referring next to
In an exemplary embodiment, recognizing step 710 includes (a) detecting abbreviated and full names of month names, (b) detecting dates in multiple languages by use of a static vocabulary of month names, (c) detecting the day of the publication date in either cardinal form (e.g. 1, 2, 3) or ordinal form (e.g. 1st, 2nd, 3rd). In an exemplary embodiment, if the publication date includes only a month and year, then a fixed day of month is assigned (e.g. the first of the month).
Conclusion
Having fully described a preferred embodiment of the invention and various alternatives, those skilled in the art will recognize, given the teachings herein, that numerous alternatives and equivalents exist which do not depart from the invention. It is therefore intended that the invention not be limited by the foregoing description, but only by the appended claims.