Metadata about a particular document, such as the author, title, and date can be useful for several reasons. For example, search engines and document management systems can use metadata to allow the user to see when a document was authored, to contribute to relevance ranking, or to limit the search results to only data having certain metadata, such as a date falling into a specified time range.
Unfortunately, the accuracy of the date metadata that gets automatically set on documents tends to be very low. The date metadata that users typically want is the time at which the author finished writing the document, yet the date associated with documents does not reflect this. There are several reasons for the low accuracy on date metadata. One reason for such low accuracy is that when documents are uploaded or copied to collaboration websites, the date metadata gets changed from the last modification date to the upload date, which is rarely a significant or helpful date. Another common reason is that when other document metadata is changed (e.g. publication status), the last modified date can get changed even though no text in the document changed, and thus the data metadata does not reflect reality.
Various technologies and techniques are disclosed for calculating authorship dates for a document. A portion of a document to select to look for possible authorship dates is determined. The possible authorship dates are extracted from the portion of the document. A revised authorship date of the document is generated using a neural network.
In one implementation, a method for calculating a revised authorship date for a document is described. Some possible authorship dates are extracted from a document. Features are extracted for each possible authorship date. Some weights are assigned to the features. An overall probability score is calculated for the features. When the overall probability score is above a pre-determined threshold, the possible authorship date is added to a list of possible authorship dates for the document. The retrieving, extracting, giving, calculating, and adding steps are repeated for a plurality of possible authorship dates. The revised authorship date is chosen from the list of possible authorship dates.
In another implementation, techniques for calculating an authorship date for a document when requested by a requesting application are described. A request is received from a requesting application for an authorship date for a document. The authorship date is calculated for the document using a neural network. The authorship date is sent back to the requesting application. One non-limiting example of a requesting application is a program that is displaying the document. Another non-limiting example of a requesting application includes a search engine. Yet another non-limiting example of a requesting application includes a content management application.
This Summary was provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
a-7b contains a diagrammatic view of exemplary features of one implementation that can be used to help determine whether a date should be included as a possible authorship date of a document.
The technologies and techniques herein may be described in the general context as an application that programmatically calculates an authorship date of a document, but the technologies and techniques also serve other purposes in addition to these. In one implementation, one or more of the techniques described herein can be implemented as features within any type of program or service that is responsible for calculating or requesting the authorship dates of documents.
In one implementation, techniques are described for calculating an authorship date of a given document programmatically, such as using a neural network like a single layer neural network (also called a perceptron model). A “single layer neural network” has a single layer of output nodes where the inputs are directly fed to the outputs through a series of weights. In this way, a single layer neural network is a simple kind of feed-forward network. In other words, the sum of the products of the weights and the inputs is calculated in each node, and if the value is above some threshold (typically 0), then the neuron fires and takes the activated value (typically 1); otherwise the neuron takes the deactivated value (typically −1).
With respect to calculating an authorship date of a document, various features (the input criteria) can be evaluated using the neural network to determine how likely it is that each date being considered is the authorship date of the document. The resulting probability score generated for each possible date that is produced by the neural network can be used to choose the authorship date. In one implementation, the neural network is utilized by a date extraction system to determine an authorship date of a document upon request. A date extraction system utilizing a neural network is described in further detail herein.
Turning now to
In one implementation, some or all of these techniques can be used when a search engine or content management application has requested authorship date information for one or more documents. In another implementation, some or all of these techniques can be used when one or more files are being copied over a network using a file copy process to update the date metadata associated with the document so that it is more accurate. Some techniques for determining an authorship date of a document will now be described in further detail in
Once the window size selection process 254 has been performed, a rule-based candidate selection process 256 is then performed. In one implementation, candidate selection is conducted by using some rules of date expressions 258. In other words, these rules can specify the types of formats that will be searched for and considered as dates. Examples of formats within the document that may be considered as dates can include MM-DD-YYYY, MM-DD-YY, DD/MM/YYYY, DD/MM/YY, etc.
After the rule-based candidate selection process 256 has been performed, a date classification process 260 is then performed. During the date classification process 260, a probability score is calculated for each extracted date by comparing the extracted date to various features within a neural network. The term “feature” as used herein is meant to include criteria that is considered by the neural network for which a result is assigned based upon an evaluation of the criteria. The use of features and a neural network to perform date classification is described in further detail in
Once all of the possible authorship dates are identified, some date normalization work can be performed to convert all date expressions into a uniform format. For example, “Nov. 30, 2007” could be converted into “Nov. 30, 2007” and “Nov. 30, 2007” could be converted into “Nov. 30, 2007”. The revised authorship date of the document 264 can then be selected from the complete list of possible authorship dates, such as the one having the highest probability score from the neural network analysis. The process can be repeated for multiple documents when applicable, such as when a requesting application is asking for revised authorship dates for multiple documents. Each of these steps will now be described in further detail in
Weights are then given to the features (stage 306) so that some features are given a higher priority than others. An overall probability score is then calculated for the date (stage 308), as is described in further detail in
The sum of all the multiplied values is passed to activation function (f) 408 to produce an output. A single probability score is then produced by the activation function (f) 408, which indicates a grand total probability score for how the particular date scored in all the various features (criteria) considered (i.e. how likely that date is the “authorship date” of the document). Numerous examples of criteria that can be evaluated to determine the likelihood that a given date is the authorship date are shown in
a-7b contains a diagrammatic view 450 of exemplary features of one implementation that can be used to help determine whether a date should be included as a possible authorship date of a document. An attribute ID 452 is shown, along with a feature ID 454 and a description 456. The attribute ID 452 is a unique identifier for a set of features being evaluated. Each attribute ID 452 can contain multiple feature IDs. For example, attribute ID 1001 (458) is shown with two feature IDs, 305 (460) and 306 (462). If the date being evaluated is a four-digit number, then the feature ID 305 (460) would evaluate to true, and the feature ID 306 (462) would evaluate to false. This is an example of a “true/false” feature set that can be evaluated.
Instead of or in addition to “true/false” feature sets, feature sets containing ranges or buckets of criteria that are being evaluated can also be used. Take attribute ID 2001 for example. Attribute ID 2001 has six different feature IDs assigned to it, starting with 5 (464) and ending with 10 (466). Feature ID 5 (464) may be used to hold a true evaluation for the number of characters in the previous line falling into the range of zero to ten. Feature ID 10 (466) may be used to hold a true evaluation for the number of characters in the previous line falling into the range of forty-five and higher. The features in between feature ID 5 (464) and feature ID 10 (466) may cover the ranges in between. The “true/false” feature sets and the “ranges or buckets of feature sets” are just two non-limiting examples of the types of feature sets that can be used by the single layer neural network to evaluate how likely a given date being evaluated is to be the authorship date. These are just provided for the sake of illustration, and any other type of features that could be evaluated by a single layer neural network could also be used in other implementations.
As shown in
Additionally, device 500 may also have additional features/functionality. For example, device 500 may also include additional storage (removable and/or non-removable) including, but not limited to, magnetic or optical disks or tape. Such additional storage is illustrated in
Computing device 500 includes one or more communication connections 514 that allow computing device 500 to communicate with other computers/applications 515. Device 500 may also have input device(s) 512 such as keyboard, mouse, pen, voice input device, touch input device, etc. Output device(s) 511 such as a display, speakers, printer, etc. may also be included. These devices are well known in the art and need not be discussed at length here.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. All equivalents, changes, and modifications that come within the spirit of the implementations as described herein and/or by the following claims are desired to be protected.
For example, a person of ordinary skill in the computer software art will recognize that the examples discussed herein could be organized differently on one or more computers to include fewer or additional options or features than as portrayed in the examples.