Search engines seek to identify documents among a set of documents that are the most relevant to a user-specified text string called a search query, or simply a query. While it is technically possible for search engines to compare each query to the entirety of the document set, in practice they generally apply each query to a search index compiled for the search engine by reading and analyzing the documents of the set. The contents of the documents of the set are often collected for representation in indices by programs associated with the search engine called “crawlers.”
Many of the techniques used to construct and apply search indices are tailored toward matching the documents of the set that literally contain words and multi-word phrases included in the query.
The inventors have recognized significant disadvantages in the operation of conventional search engines. First, while conventional indices are sometimes constructed to include document attributes automatically inferred from the content of documents, in practice such inference proves limited and frequently inaccurate. Accordingly, queries that seek to match documents having particular attributes are often unsuccessful. Additionally, even where a conventional search engine provides some limited ability to infer the values of certain document attributes, its querying user interface often lacks support that would enable users to explicitly specify a particular value for a particular attribute.
Also, in typical cases, documents can be added to a document set and included in search results—such as by publishing them anywhere on the Internet—without being subject to any level of quality control, leading to the undetected inclusion of inaccurate, outdated, redundant, unclear, and/or otherwise unhelpful documents in search results.
In response to recognizing these disadvantages, the inventors have conceived and reduced to practice a software and/or hardware facility for searching against attribute values of documents that are explicitly specified as part of the process of publishing the documents (“the facility”). In some embodiments, the facility enables an editor to specify a manifest template identifying different kinds of document attributes; the manifest template is populated by the publisher of each document with the document's values for these attributes, to create an attribute manifest specifying the document attribute values of the document, also called its metadata. Instead of or in addition to subjecting the literal contents of the documents of the set to the crawler, the crawler also consumes the attribute manifests. The facility uses the index produced from this crawling to service queries that explicitly specify certain values of certain document attributes. In some embodiments, in one or more ways, the facility is particularly adapted to documents that contain, reference, and/or completely embody structured or unstructured data sets, such as healthcare data sets. For example, in some embodiments, the facility's crawler is designed to digest and faithfully index the contents of such data sets. In some embodiments, the crawler follows links in a document's manifest or in the contents of the document to data sets and other information resources associated with the document to index those data sets and other information resources in connection with the document.
In various embodiments, the document attributes that are available for inclusion in the manifest template—and therefore available to specify values for in the manifests of individual documents—include title, description, author identity, author contact information, owner identity, owner contact information, publication date, effective date, category, hierarchy node, type of included or associated data, source of included or associated data, lineage of included or associated data showing the path this data has taken to the document, examples of included or associated data, links or pointers to included or associated data, associated application programming interfaces, information about access, copying, or other use of the document, etc.
In some embodiments, the facility enables the augmentation of a document's manifest with various additional information. For example, in some embodiments, the facility provides a “vouching” process for approving the content of a document. When a particular person vouches for a document, the facility adds to the document's manifest an indication of this vouching that identifies the vouching person. This vouching establishes trust in meritorious documents and data sets, and encourages the use both of (1) these document and datasets, and (2) a source of documents and datasets that explicitly surfaces this form of trust—i.e., the source operated by the facility.
In some embodiments, the facility provides a certification process for specifying a certification level for a document, such as by a human certifier or an automatic certification process. In some embodiments, each certification level specifies a subset of the attributes; if the manifest for a document contains values for all of the attributes in one of these subsets, an automatic process qualifies the document for the corresponding certification level. In some embodiments, the facility enables the fields specified for each certification level to be separately specified by and for each organization using the facility. Such a certification system incentivizes document publishers to more fully populate in a document's manifest values for the attributes most valuable to document searchers. This certification level, too, is added to the document's manifest. By making these kinds of validation information available via the search process, an organization can enable the use of high-quality information in its decision making processes.
In some embodiments, the facility makes available to query information added to documents' manifests via any supported mechanism or process. In some embodiments, the facility constructs a user interface for entering an attribute-specific query and exploring its results that is based on the contents of the manifest template. In some embodiments, the facility allows a user to filter or sort a search results using any information in the manifests of the documents included in a search result.
By operating in some or all of the ways described herein, the facility makes it possible for: an organization to specify document attributes that are available to describe and search for documents; a document's publisher to publish the document in customary ways, and explicitly describe it using values of the attributes specified by or for the organization; approvers and certifiers to weigh in on each document's level of quality, accuracy, helpfulness, currency, etc.; and/or a searching user to discover and explore documents whose attribute values match those specified by the searching user.
Additionally, the facility improves the functioning of computer or other hardware, such as by reducing the dynamic display area, processing, storage, and/or data transmission resources needed to perform a certain task, thereby enabling the task to be permitted by less capable, capacious, and/or expensive hardware devices, and/or be performed with lesser latency, and/or preserving more of the conserved resources for use in performing other tasks. For example, by enabling the explicit specifying of attribute values, the facility relieves the index-builder of the processing resource burden of performing inference to predict those attribute values. Also, by fulfilling queries that more acutely specify a querying user's intentions about certain document attributes, the facility avoids the processing resource burden of processing follow-up queries entered by querying users when initial queries fail to satisfy their needs. Also, by surfacing higher-quality documents that are more responsive to a query, the facility reduces the network resources needed to retrieve larger numbers of documents identified in a query result, only to discover that they are unhelpful.
In some embodiments, the facility uses the manifest template to generate a visual user interface that can be used by a data producer or their representative to enter values of the supported document attributes in order to create a manifest file for a particular document.
Either periodically or continuously, a crawler 241 incorporated in a data discovery engine 240—such as Apache Solr—reads the manifest files stored by the data discovery registry. In some embodiments, the crawler also reads the documents themselves in the document repository or repositories and/or data sets referenced by the manifests and/or the contained in or referenced by the documents stored in the repositories. From the information collected by this crawling, the data discovery engine generates and/or updates a search index 242 that associates the identity of different documents with data read about them by the crawler, including document contents, as well as document attributes read from the manifest. When a searching user submits a search query to a search engine 243 of the data discovery engine, it explicitly specifies values for one or more of the document attributes. The search engine applies the query against the search engine to generate a search result, which it returns to the searching user. The searching user can review the search results, and select documents from it to retrieve and/or view from the document repositories in which they are stored. Additional details about this process are provided below.
In act 302, the facility populates and submits a manifest for the data package. In some embodiments, the facility supports population of the document manifest in accordance with a document manifest template. In various embodiments, the manifest template is represented in different ways. As examples, the document manifest template may be a table that, for each included document attribute, specifies the attribute's name and data type or valid values; a document definition in a tag language such as XML or JSON; etc. Table 1 below shows a sample manifest template expressed in XML.
The template spans lines 1-121 of the table. The template defines its first attribute in lines 2-6, representing the document's title. In lines 3-5, the template specifies that the attribute's name is “TITLE,” its type is “TEXT,” and it is a required attribute—that is, each manifest must contain a value for it.
In lines 60-64, the manifest template defines a Data Store attribute whose value points to the storage location of the document/data package, which can be used by the crawler to (1) access the document/data package for indexing, and (2) refer to this document/data package in the index.
In various embodiments, the template can specify attributes of various types. One example is an attribute of a type called “Choice” called “Type” that is established in lines 32-42. In lines 36-39, the template specifies four different possible values of this document type attribute, from which one must be selected: “STRUCTURED,” “SEMISTRUCTURED,” “UNSTRUCTURED,” and “MIXED”.
In some embodiments, the template can specify that a particular document attribute—a “conditional attribute”—is to be used in a manifest only where a particular condition is satisfied. For example, in lines 43-54 the sample template specifies that an “Expire Date” attribute can be populated only if the value of a “Have Expiration” attribute is populated with the value true.
In some embodiments, the data producer uses the manifest template to generate a manifest for a new document and submits it programmatically to the data discovery registry, or causes it to be stored in a particular file system folder designated for the storage of manifests. In some embodiments, the facility uses the manifest template to generate a visual user interface designed to facilitate the population of a manifest for a new document by a user.
While
Table 2 below shows a sample document manifest. The manifest in Table 2 has been generated using the user interface shown in
Returning to
Those skilled in the art will appreciate that the acts shown in
In some embodiments, selection of certain portions of the document's visual indication in the query result causes the display of a result card containing more extensive information about that document.
The various embodiments described above can be combined to provide further embodiments. All of the U.S. patents, U.S. patent application publications, U.S. patent applications, foreign patents, foreign patent applications and non-patent publications referred to in this specification and/or listed in the Application Data Sheet are incorporated herein by reference, in their entirety. Aspects of the embodiments can be modified, if necessary to employ concepts of the various patents, applications and publications to provide yet further embodiments.
These and other changes can be made to the embodiments in light of the above-detailed description. In general, in the following claims, the terms used should not be construed to limit the claims to the specific embodiments disclosed in the specification and the claims, but should be construed to include all possible embodiments along with the full scope of equivalents to which such claims are entitled. Accordingly, the claims are not limited by the disclosure.