The disclosed embodiments relate generally to information retrieval systems and methods, and more particularly to a system and method for harmonizing content relevancy across structured and unstructured data.
With the proliferation of corporate networks and the Internet, an ever increasing amount of information is being made available in electronic form. Such information includes documents, graphics, video, audio, or the like. While corporate information is typically well indexed and stored on corporate databases within a corporate network, information on the Internet is generally highly disorganized.
Searchers looking for information typically make use of an information retrieval system. In corporate networks, such an information retrieval system typically consists of document management software, such as Applicant's QUANTUM™ suite, or iManage Inc's INFORITE™ or WORKSITE™ products. Information retrieval from the Internet, however, is typically undertaken using a search engine, such as YAHOO™ or GOOGLE™.
Generally speaking, these information retrieval systems extract keywords from each document in a network. Such keywords typically contain no semantic or syntactic information. For each document, each keyword is then indexed into a searchable data structure with a link back to the document itself. To search the network, a user supplies the information retrieval system with a query containing one or more search terms, which may be separated by Boolean operators, such as “AND” or “OR.” These search terms can be further expanded through the use of a Thesaurus. In response to the query, which might have been expanded, the information retrieval system attempts to locate information, such as documents, that match the searcher supplied (or expanded) keywords. In doing so, the information retrieval system searches through its databases to locate documents that contain at least one keyword that matches one of the search terms in the query (or its expanded version). The information retrieval system then presents the searcher with a list of document records for the documents located. The list is typically sorted based on document ranking, where each document is ranked according to the number of keyword to search term matches in that document relative to those for the other located documents. An example of a search engine that uses such a technique, where document relevancy is based solely on the content of the document, is INTELISEEK™. However, most documents retrieved in response to such a query have been found to be irrelevant.
In an attempt to improve precision, a number of advanced information retrieval techniques have been developed. These techniques include syntactic processing, natural language processing, semantic processing, or the like. Details of such techniques can be found in U.S. Pat. Nos. 5,933,822; 6,182,068; 6,311,194; and 6,199,067, all of which are incorporated herein by reference.
However, even these advanced information retrieval techniques have not been able to reach the level of precision required by today's corporations. In fact, a recent survey found that forty four percent of users say that they are frustrated with search engine results. See Internet Usage High, Satisfaction low: Web Navigation Frustrate Many Consumers, Berrier Associates—sponsored by Realnames Corporation (April 2000).
In addition, other advanced techniques have also proven to lack adequate precision. For example, GOOGLE™ and WISENUT™ rank document relevancy as a function of a network of links pointing to the document, while methods based on Salton's work (such as ORACLE™ text) rank document relevancy as a function of the number of relevant documents within the repository.
This lack of precision is at least partially caused by current information retrieval systems not taking the personal profiles of the document creator, searcher, and any contributors into account. In other words, when trying to assess the relevancy of documents within a network, most information retrieval systems ignore the searcher that performs the query, i.e., most information retrieval systems adopt a one-fit-all approach. For example, when a neurologist and a high school student both perform a search for “brain AND scan,” an identical list of located documents is presented to both the neurologist and the high school student. However, the neurologist is interested in high level documents containing detailed descriptions of brain scanning techniques, while the student is only interested in basic information on brain scans for a school project. As can be seen, a document query that does not take the searcher into account can retrieve irrelevant and imprecise results.
Moreover, not only should the profession of a searcher affect a search result, but also the expertise of the searcher within the search domain. For example, a medical doctor that is a recognized world expert would certainly assign different relevancy scores to the returned documents than say an intern. This means that information retrieval systems should be highly dynamic and consider the current expertise level of the searcher and/or creator/s at the time of the query.
In addition, the current lack of precision is at least partially caused by the treatment of documents as static entities. Current information retrieval techniques typically do not take into account the dynamic nature of documents. For example, after creation, documents may be commented on, printed, viewed, copied, etc. To this end, document relevancy should consider the activity around a document.
Another problem encountered with conventional information retrieval techniques is the handling of structured and unstructured data. An example of unstructured data is free form text. An example of structured data is data organized into one or more fields having one or more restrictions. For example, an unrestricted field can include whatever values the owner wants to provide. A restricted field, however, can include content which is constrained to a controlled vocabulary, size, or other parameter. In conventional information retrieval systems, documents or other data objects are searched for keywords without considering whether the documents are structured or unstructured. Since structured documents may be more relevant to a searcher than unstructured documents, the nature of the document structure should be considered when determining its relevancy to a search request.
Therefore, a need exists in the art for a system and method for retrieving information that can yield a significant improvement in precision over that attainable through conventional information retrieval systems. Moreover, such a system and method should personalize information retrieval based on user expertise and whether the information is structured or unstructured.
A search request is received from a requester including one or more search terms. In response to the search request, one or more objects are located that fulfill the search request. A relevancy score is computed for each object based on whether the object includes structured or unstructured data. The relevancy scores enable the requestor to determine the content relevancy of the located objects.
In some embodiments, a method for retrieving information includes receiving a search request from a requester including one or more search terms; searching a plurality of objects based on at least one search term; identifying at least one object associated with at least one search term; and determining a relevancy score for the object based on whether the object includes structured or unstructured data. The object can include structured data and the relevancy score can be determined at least in part on whether the structured data includes restricted or unrestricted fields. The restricted fields can be associated with a first modifier value and the unrestricted fields can be associated with a second modifier value. The first and second modifier values can be the same values or different values. The modifier values can be adjustable and expandable.
In some embodiments, an information retrieval system includes a processor and a memory coupled to the processor. The memory includes instructions, which, when executed by the processor, causes the processor to perform the operations of receiving a search request from a requester including one or more search terms; searching a plurality of objects based on at least one search term; identifying at least one object associated with at least one search term; and determining a relevancy score for the object based on whether the object includes structured or unstructured data.
In some embodiments, a computer-readable medium includes instructions, which, when executed by a processor, causes the processor to perform the operations of receiving a search request from a requester including one or more search terms; searching a plurality of objects based on at least one search term; identifying at least one object associated with at least one search term; and determining a relevancy score for the object based on whether the object includes structured or unstructured data.
In some embodiments, the relevancy score is determined based on the inclusion of structured or unstructured data and on one or more other factors, such as, for example, the expertise of users/searchers, creators and/or requesters.
The repository 104 is any storage device/s that is capable of storing data, such as a hard disk drive, magnetic media drive, or the like. The repository 104 is preferably contained within the information retrieval system 102, but is shown as a separate component for ease of explanation. Alternatively, the repository 104 may be dispersed throughout a network, and may even be located within the searcher device 108, creator device/s 106, and/or contributor device/s 112.
Each creator device 106 is a computing device operated by a creator who creates one or more documents. Each contributor device 112 is a computing device operated by a contributor who contributes to a document by, for example, adding to, commenting on, viewing, or otherwise accessing documents created by a creator/s. The searcher device 108 is a computing device operated by a searcher who is conducting a search for relevant documents created by the creator/s or contributed to by the contributor/s. The searcher, creator/s, and contributor/s are not limited to the above described roles and may take on any role at different times. Also, the searcher, creator/s, and contributor/s may browse the repository 104 without the use of the information retrieval system 102.
Memory 214 preferably includes an operating system 216, such as but not limited to, VXWORKS™, LINUX™, or WINDOWS™ having instructions for processing, accessing, storing, or searching data, etc. Memory 214 also preferably includes communication procedures for communicating with the network 110 (
Memory 308 preferably includes an operating system 312, such as but not limited to, VXWORKS™, LINUX™, or WINDOWS™ having instructions for processing, accessing, storing, or searching data, etc. Memory 308 also preferably includes communication procedures 314 for communicating with the network 110 (
The collection engine 316 comprises a keyword extractor or parser 318 that extracts text and/or keywords from any suitable document, such as an ASCII or XML file, Portable Document Format (PDF) file, word processing file, or the like. The collection engine 316 also preferably comprises a concept identifier 320. The concept identifier 320 is used to extract the document's important concepts. The concept identifier may be a semantic, synaptic, or linguistic engine, or the like. In a preferred embodiment the concept identifier 320 is a semantic engine, such as TEXTANALYST™ made by MEGAPUTER INTELLIGENCE™ Inc. Furthermore, the collection engine 316 also preferably comprises a metadata filter 322 for filtering and/or refining the concept/s identified by the concept identifier 320. Once the metadata filter 322 has filtered and/or refined the concept, metadata about each document is stored in the repository 104. Further details of the processes performed by the collection engine 316 are discussed in relation to
The search engine 324 is any standard search engine, such as a keyword search engine, statistical search engine, semantic search engine, linguistic search engine, natural language search engine, or the like. In a preferred embodiment, the search engine 324 is a semantic search engine.
The expertise adjustment procedures 326 are used to adjust an object's intrinsic score to an adjusted score based on the expertise of the searcher, creator/s, and/or contributor/s. The content relevancy harmonizer 325 is used to adjust the objects field intrinsic score to an adjusted score based on whether the object includes structured and/or unstructured data. The expertise adjustment procedures 326 and the content relevancy harmonizer 325 are described in further detail below in relation to
A file collection 328(1)-(N) is created in the repository 104 for each object input into the system, such as a document or source. Each file collection 328(1)-(N) preferably contains: metadata 330(1)-(N), such as associations between keywords, concepts, or the like; content 332(1)-(N), which is preferably ASCII or XML text or the content's original format; and contributions 334(1)-(N), such as contributor comments or the like. At a minimum, each file collection contains content 332(1)-(N). The repository 104 also contains user profiles 336(1)-(N) for each user, i.e., each searcher, creator, or contributor. Each user profile 336(1)-(N) includes associated user activity, such as which files a user has created, commented on, opened, printed, viewed, or the like, and links to various file collections 328(1)-(N)that the user has created or contributed to. Further details of use of the repository 104 are discussed in relation to
The document, source, and/or other data is then sent to the information retrieval system 102 (
Extraction of important keywords is undertaken using any suitable technique. These keywords, document, source, and other data are then stored at step 406 as in the repository 104 as part of a file collection 328(1)-(N) (
In a preferred embodiment, the concept identifier 320 (
At any time, contributors can supply their contributions, at step 416, such as additional comments, threads, or other activity to be associated with the file collection 328(1)-(N). These contributions are received by the information retrieval engine at step 418 and stored in the repository at step 420, as contributions 334(1)-(N). Alternatively, contributions may be received and treated in the same manner as a document/source, i.e., steps 403-414.
The search is preferably conducted to locate objects. Objects preferably include: content objects, such as documents, comments, or folders; source objects; people objects, such as experts, peers, or workgroups; or the like. A search for documents returns a list of relevant documents, and a search for experts returns a list of experts with expertise in the relevant field. A search for sources returns a list of sources from where relevant documents were obtained. For example, multiple relevant documents may be stored within a particular directory or website.
The search is received at step 504 by the information retrieval system 102 (
The search engine 324 (
The intrinsic score is then adjusted to an adjusted score by the expertise adjustment procedures 326, at step 512. This adjustment takes the expertise of the creator/s, searcher, and/or contributor/s into account, as described in further detail below.
Once the intrinsic score has been adjusted to an adjusted score, a list of the located objects is sorted at step 514. The list may be sorted by any field, such as by adjusted score, intrinsic score, source, most recently viewed, creator expertise, etc. The list, preferably containing a brief record for each located object, is then transmitted to the searcher device 108 (
Preferred algorithms for adjusting the intrinsic score (step 512 of
Expertise Adjustment when Searching for Documents
Search term(s) entered by the searcher may or may not be extended to form a query. Such possible extensions, include but are not limited to, synonyms or stemming of search term(s). Once the intrinsic score has been calculated according to step 510 above, the adjusted score (RS_ADJ) for each located document is calculated as follows:
RS ADJ=Intrinsic Document Score+Expertise Adjustment=IDS+EA (1)
where the Intrinsic Document Score (IDS) is a weighted average between a Document Content Score (DCS) and a Comments Content Score (CCS).
IDS=a*DCS+(1−a)*CCS (2)
with “a” being a number between 0 and 1 and determining the importance of the content of a document relative to the content of its attached comments.
The DCS and CCS are calculated by any methodology or technique. Existing search engine algorithms may be used to fulfill this task. Also note that the DCS and CCS are not influenced by the searcher that entered the query. In this embodiment, the DCS and CCS can be any number between 2 and 100. The Expertise Adjustment (EA) is calculated as follows:
EA=DCE+CCE (3)
where DCE is the Document Creator Expertise adjustment and CCE is the Comments Contributors Expertise adjustment. The DCE adjustment takes into account all activity performed by a given user and is computed as follows:
DCE=R1(DCS)*W1(RS—EXP—ABS) (4)
where R1(DCS) determines the maximal amount of the expertise adjustment, or, in other words, the range for the alteration due to the expertise of the creator of the document. This depends on the level of the DCS. The range function is given by:
Extreme intrinsic scores, i.e., scores near 2 or 100, are less influenced than scores near the middle, i.e., scores near 50. The maximum possible change in a score is 20 when DCS=50 and linearly decreases to 10 when DCS=100 or 2.
W1(RS_EXP_ABS) determines what percentage of available range R1(DCS), positively or negatively, is considered for adjusting the intrinsic score. It is given by:
where RS-EXP-ABS denotes the absolute relevance score of a user, that is, the user expertise, be it searcher expertise, creator expertise, or contributor expertise. The calculation of RS-EXP-ABS occurs as follows:
RS-EXP-ABS=3*F(User contribution)*G(Company expertise)*H(Query specificity) (7)
where F (User contribution) accounts for the relevancy of all contributions made by the user, considering all documents created, all comments contributed, and the user's definition of his or her folders within the software. These folders (private or public) constitute the user's personal taxonomy. G (Company expertise) accounts for the company expertise about the query, i.e., whether a few or most employees in a company have produced something relevant to the query. H (Query specificity) accounts for the specificity of the query within the repository, i.e., whether many or just a few file collections were created.
In detail:
where the first sum is over all relevant documents and the second sum is over all non-relevant documents that possessed a relevant comment, i.e., the comment was relevant but not the document. (DCS)i is the Intrinsic document relevancy score attained for the i-th relevant document. Also, Wi,max, is the user activity measure. Ci is calculated as follows:
and is the reward assigned to matching comments made on documents, relevant or not. A matched comment is not necessarily attached to a relevant document.
Wi,max accounts for the type of contribution (such as but not limited to creation, commenting, or highlighting). In short, Wi,max is the maximum of the following weights (if applicable). Wi,edit=1, if the user created or edited i-th file collection,
Wi,comment=0.5*Max*(0,7-Mincomments*(Level))/6,
if the user commented on the i-th file collection. Since these comments are organized in a threaded discussion, the weight will also depend on how remote a comment is to the file collection itself. For example, a comment on a comment on a comment to the original file collection will receive a lesser weight than a comment on the original file collection. In the formula, Level measures how remote the comment is from the file collection. The least remote comment is taken into consideration as long as it is closer than six comments away from the parent file collection.
The taxonomy in this preferred embodiment stands for folder names. Each user has built some part of the repository by naming folders, directories, or sub-directories. For example, creator 1 might have grouped his Hubble telescope pictures in a folder called “Space Images.” Then term “Space Images” becomes part of the user's taxonomy.
Within an organization or enterprise, some of the taxonomy (folder structure) has been defined by the organization or enterprise itself and has “no owners.” In this case, each folder has an administrator who bestows rights to users, such as the right to access the folder, the right to edit any documents within it, the right to edit only documents that the specific user created, or the right to view but not edit or contribute any document to the folder. Only the names of the folders that a user creates are part of his or her taxonomy.
where Log is the logarithmic function base 10; P is the total number of users; and E is the number of relevant experts. The number of relevant experts is calculated by determining how many unique creators and contributors either created or contributed to the located documents. IEF stands for Inverse Expertise Frequency.
This adjustment raises the adjusted scores when there are few relevant experts within the company.
where Log is the logarithmic function base 10; NCO is the total number of content objects available in the database at the time of the query; and NCOR is the total number of relevant content objects for a given query. IWCOF stands for the Inverse Weighted Content Objects Frequency. Preferably, in this embodiment, NCO, NCOR and IWCOF are only calculated using non-confidential content objects.
IWCOF is similar to IEF as it adjusts the score by slightly raising the adjusted score when only a few relevant content objects are found in the database. Therefore, the absolute relevance score for a given user (or the user expertise) is:
Using the above equations, the intrinsic score is increased to an adjusted score if the creator of the content objects is more knowledgeable about the searched subject matter than the person that entered the query, i.e., if the creator expertise is higher than the searcher expertise. On the other hand, the intrinsic score is decreased to an adjusted score if the creator is less knowledgeable about the searched subject matter than the searcher, i.e., if the creator expertise is lower than the searcher expertise.
To calculate the Comments Contributors Expertise Adjustment (CCE) the following equation is used:
where
Once these adjustments have been computed, one has to ensure that the relevancy score from (1) is in the appropriate range and that it is preferably in this embodiment an integer. This is obtained as follows:
RS—ADJ=Min(100, Max(1, Round(RS—ADJ))) (15)
where Round(d) rounds the number d to its nearest integer.
Expertise Adjustment when Searching for Sources
Once the intrinsic score has been calculated according to step 510 above, the adjusted score for sources (RSS_ADJ) for each source is calculated as follows:
RSS—ADJ=intrinsic Source Content score+expertise adjustment SCS+R2(SCS)*W2(RS—EXP—ABS)
where SCS is the intrinsic Source Content Score computed, which is, preferably in this embodiment, defined here as the maximum of all the intrinsic Document Content Scores (DCS) that were created from each source, ie.,
SCS=MAX(DCS) (17)
For example, multiple documents may have been saved as multiple file collections from a single Web-site.
R2(SCS) determines the maximal amount of the expertise adjustment, or, in other words, the range for the alteration due to the expertise of the creator of the document taken from the source, which depends on the level of the intrinsic source score, i.e., SCS. The range function is given by:
Extreme scores are less influenced than scores in the middle. The maximum possible change in a score is 20 when SCS=50 and linearly decreases to 10 when SCS=100 or 2.
W2(RS_EXP_ABS) determines what percentage of available range for the expertise adjustment, R2(SCS), positively or negatively, is considered for building the scoring. It is given by:
where RS_EXP_ABS is the absolute relevance score of the expert (as defined previously). MAX(RS_EXP_ABS(Creator)) is the maximum of absolute expertise scores over all creators that have created file collections from this source. RS_EXP_ABS(Searcher) is the absolute relevance score of the searcher. In other words, the intrinsic score for the source is adjusted upward to an adjusted score if the maximum creator expertise of all creators for a particular source exceeds the searcher expertise. On the other hand, the intrinsic score for the source is lowered to an adjusted score if the creator expertise of all creators for a particular source is lower than the searcher expertise.
Once this adjustment has been computed, one has to ensure that the relevancy score is in the appropriate range and that it is preferably in this embodiment an integer. This is obtained as follows:
RSS ADJ=Min(100, Max (1, Round(RSS—ADJ))) (20)
where Round(d) rounds the number d to its nearest integer.
In this way, the adjusted score for each document (RS_ADJ) or the adjusted score for sources (RSS13 ADJ) is calculated based on the expertise of the searcher, creator/s, and/or contributor/s. Such adjusted scores provide a significant improvement in precision over that attainable through conventional information retrieval systems.
Expertise Adjustment when Searching for Peers
When users are looking for peers rather than experts an adjusted relevancy score is calculated. Peers are other users that have a similar expertise or come from a similar, or the same, department as the searcher. The adjusted relevancy score uses the expertise values and adjusts them with respect to the searcher's expertise. This is the similar to resorting the list with respect to the searcher, but instead recalculates the values themselves.
Once the expertise for each user has been determined, they are adjusted with respect to the searcher expertise. The adjusted relative or personalized relevancy score for an expert is defined by:
The adjusted relevancy score is a measure of the difference between two levels of expertise. The square root maps the difference to a continuous and monotone measure while diminishing the importance of differences when two experts are far apart. It is also asymmetric in the sense that it favors expertise above the searcher expertise. Finally, recall that |K| represents the absolute value of K (i.e., the difference).
An example of a method for personalizing information retrieval using the above formulae will now be described. It should, however, be appreciated that this example is described herein merely for ease of explanation, and in no way limits the invention to the scenario described. Table 1 sets out the environment in which a search is conducted. Furthermore, in this illustration, the factor a (from formula 2, determining the importance of the content of a document relative to its attached comments) has been arbitrarily set to 1.
For this example, 100 users having a total number of 1000 file collections in the repository yields 10 experts and 10 relevant file collections. There are also 10 comments that are found to be relevant. The enterprise in which the example takes place has four departments, namely marketing, engineering, finance, and legal. For ease of explanation, each employee's last name begins with the department in which they work.
Once the repository 104 (
Using formulae 7-12 above, the expertise of each searcher, creator, and/or contributor is then calculated. The calculations for F(User contribution) yield the results in Table 2 below.
Using formulae 10 and 11, G(Company Expertise) is calculated to be 2, while H(Query Specificity) is calculated to be 1.667. These values and the values in Table 2 are plugged into formula 7 to arrive at the following expertise values:
W1(RS_EXP_ABS) is then calculated using formula 6 (for different searcher expertise) to yield the following results:
DCE and CCE are then calculated using formulae 4, 5, 13, and 14 (for different searcher expertise) to yield the following results:
The Expertise Adjustment (EA) is then calculated according to formula 3 to yield the following results for EA:
Finally, the adjusted score (RS ADJ) for each located document is calculated using formula 1 to yield the following results:
In a similar manner, the adjusted scores are calculated when searching for sources as per tables 8-12 below.
As can be seen the intrinsic scores of each document and/or source is adjusted to an adjusted score based on the expertise of the users. In other words, a document and/or source that may have been less relevant, is adjusted so that it is more relevant, or visa versa. In this way, the precision of document and/or source relevancy is improved.
Harmonizing Content Relevancy Across Structured and Unstructured Data
The document, source, and/or other data is then sent to the information retrieval system 102 (
The document is examined for structured data. If the document is a structured document (step 607), then for each field in the document, the keyword extractor or parser 318 (
Extraction of important keywords is undertaken using any suitable technique. These keywords, document, source, and other data are then stored at step 606 as in the repository 104 as part of a file collection 328(1)-(N) (
In a preferred embodiment, the concept identifier 320 (
At any time during the process, contributors can supply their contributions, at step 616, such as additional comments, threads, or other activity to be associated with the file collection 328(1)-(N). These contributions are received by the information retrieval engine at step 618 and stored in the repository at step 620, as contributions 334(1)-(N). Alternatively, contributions may be received and treated in the same manner as a document/source, i.e., steps 603-614.
Extraction of important keywords is undertaken using any suitable technique. These keywords, document, source, and other data are then stored at step 706 as in the repository 104 as part of a file collection 328(1)-(N) (
In a preferred embodiment, the concept identifier 320 (
At any time during the process, contributors can supply their contributions, at step 716, such as additional comments, threads, or other activity to be associated with the file collection 328(1)-(N). These contributions are received by the information retrieval engine at step 718 and stored in the repository at step 720, as contributions 334(1)-(N). Alternatively, contributions may be received and treated in the same manner as a document/source, i.e., steps 703-714.
A searcher using a searcher device 108 (
The search is preferably conducted to locate structured and unstructured objects. Objects preferably include: content objects, such as documents, comments, or folders; source objects; people objects, such as experts, peers, or workgroups; or the like. A search for documents returns a list of relevant documents, and a search for experts returns a list of experts with expertise in the relevant field. A search for sources returns a list of sources from where relevant documents were obtained. For example, multiple relevant documents may be stored within a particular directory or website.
The search is received at step 804 by the information retrieval system 102 (
The search engine 324 (
After the objects are located, the search engine 324 determines if the located objects are structured or unstructured (step 810). If a located object is unstructured, then the search engine 324 calculates an intrinsic score at step 812 for the unstructured object. If more located objects are available (step 813), then step 812 is repeated until all the located objects have been processed. If all the located objects have been processed, the search engine 324 adjusts the intrinsic scores based on expertise (step 814), sorts the structured and unstructured objects (step 816) based on the adjusted scores, transfers the list to the searcher (step 818), where it is received (step 820) and displayed to the searcher (step 822). Other than step 810, all of the foregoing steps were previously described with respect to
If a located object is a structured object, then the search engine 324 determines the fields of the object that match the search query (step 824) and calculates field intrinsic scores for matching fields (step 824). The Field ID stored in the repository 104 during collection (
After the field intrinsic scores are adjusted at step 814, the structured and unstructured objects are sorted (step 816) by their adjusted scores and transferred to the searcher (step 818), where they are received (step 820) and displayed (step 822) to the searcher.
Adjusting Field Intrinsic Scores To Harmonize Content Relevancy
In some embodiments, field intrinsic scores are adjusted differently based on the logical operator or operators used in the search query. These logical operators include but are not limited to: AND, OR and TMTB (“The More The Better”). Note that TMTB is an accumulate type operator that looks for objects that match as many keywords associated with a search query as possible. While the OR and AND operators are the most common logical operators used in search engine queries, it should be apparent that the formulae described below can be adapted to other types of operators, including proximity operators (e.g., ADJACENT, WITH, NEAR, FOLLOWED BY, etc.).
An example of a data source that generates documents including structured objects is a Sales Force Automation (SFA) system. A typical SFA system includes software and systems that support sales staff lead generation, contact, scheduling, performance tracking and other functions. SFA functions are normally integrated with base systems that provide order, product, inventory status and other information and may be included as part of a larger customer relationship management (CRM) system.
The structured objects generated by the SFA system include a mix of restricted and unrestricted fields. A restricted field is a field that is constrained by one or more parameters, such as a controlled vocabulary or a controlled size. An unrestricted field can include any value the owner wants to provide, such as free form text. For example, an SFA document could include records having the following four fields: SFAAccountName (the name of the account); SFAAccountDescription (a brief description of the account); SFAAccountIndustry (the industry to which the account belongs to); and SFAAccountAttachment (documents that might be attached to the record). In this example, SFAAccountName and SFAAccountIndustry are examples of restricted fields and SFAAccountDescription and SFAAccountAttachment are examples of unrestricted fields.
To account for the differences in relevancy between restricted and unrestricted fields, each field is assigned a field modifier value, which is adjustable and expandable. For example, the field modifier value can be adjusted according to a profile of the user and/or a profile of the data source.
Table 13 below summarizes the SFA fields described above with some examples of corresponding field modifier values. Note that in this example the restricted fields are weighed more heavily based on the assumption that a keyword match to a restricted field may be more relevant. It should be apparent, however, that the field modifiers can be selected and adjusted, as necessary, depending on the search engine design.
For each operator, the general formula to calculate its adjusted intrinsic relevancy score is of the form:
AIRStructured=foperator(Field, Modifier parameters), (22)
where AIRStructured is the adjusted intrinsic relevancy score for a structured object and the function depends on the operator. Specific formulae for the most common logical operators are described in turn below.
The “OR” Operator
The OR logical operator looks for the presence of keywords associated with the query of the user. If a match is found in a document on any one keyword, the document is retrieved. In some embodiments, a formula for an OR operator can be determined by
where the Field Intrinsic score is, for example in the case of an unrestricted field, the raw score as provided by a semantic engine or a full-text engine for that given keyword. Note that formula (23) uses the maximal subscores to define the intrinsic value of the document. Another approach is to take all fields to participate,
where λ is a normalizing parameter. Finally, the growth of the score within each keyword is preferably concave as scores accumulate and convex as they start accumulating. For example, as the score accumulates, a difference between two high scores (e.g., the scores 1000 and 1010) is less significant than the difference between two low scores (e.g., the scores 10 and 20). That is, for high scores the score function behaves like a utility function with respect to relevancy. On the other hand, as the score accumulates, a difference between a first pair of low scores (e.g., 0 and 2) could be less significant than the difference between the a second pair of low scores (e.g., 10 and 12) due to uncertainties and lack of relevancy associated with low scores.
where θ is again a normalizing constant. In some embodiments, one or more parameters can be added, as necessary, to tune the behavior of the inverse logit function.
The AND Operator
The AND operator looks for objects that match all keywords associated with the search query. In some embodiments, a simple scoring mechanism is used that assumes that all keywords are equal (i.e., the need to be all present). The relevancy is then associated to the least keyword. This would lead to the following “logit” type of formula:
where the MAX calculation used in the OR formulas (23) through (25) are switched to MIN calculations.
In an alternative embodiment, an average “concept” relevancy score can be determined by,
Note that by averaging the relevancy score, an overall score is determined that does not overweight a single potential low score.
The TMTB Operator
The TMTB operator is an accumulation operator that tends to provide highest scores for those objects that are relevant to more terms. It also, however, allows for objects that do not match all concepts to still receive a high score if those matches are good. Using the “logit” style of formula for illustration, gives:
where NQC is the number of concepts present in the query and FQC is the number of concepts that were matched to the object.
Field Extensions
In the above approaches, no field was an exclusion field. It is also possible, however, to enforce that one of the matches has to occur in a specific field. For example, in embodiments using the MAX, OR scheme described above, the formula would be:
where 1{MatchεSet
The process begins by initializing a keyword counter x to one (step 902) or any other suitable number. For the first keyword in a search query including N keywords, the fields in the located object containing keyword x matches are determined (step 904) and field intrinsic scores are computed (step 906). In some embodiments, prior to executing the matching step 904, search queries with N terms are matched (step 903) to M concepts (where M is less than or equal to N). If multiple terms in a search query are matched to the same concept, then only one term in the query will be used in the matching step 904 to reduce computation time. For example, if a searcher types “Car OR Car,” then only one unique concept (i.e., a type of vehicle) was specified by the searcher. Similarly, if a searcher types “A Car”, then only one unique concept was specified. Thus, in the above examples the keyword “Car” would be included in the matching step 904, and the second instance of the term “Car” and the term “A” would be excluded from the matching step 904. The foregoing additional steps would be used in, for example, the calculation of an adjusted intrinsic relevancy score based on the TMTB operator, as described with respect to equation 30.
In some embodiments, the calculation of field intrinsic scores can be based on the number of times that a search term appears in a particular field of the located object, or on a semantic analysis of the relationship between keywords and content. Next, the appropriate field modifiers are applied to the field intrinsic scores (step 908). In some embodiments, the modifiers are selected based on whether the fields with the keyword matches are restricted or unrestricted. For example, in the SFA system previously described, the fields SFAAccountName and SFAAccountIndustry are restricted and the fields SFAAccountDescription and SFAAccountAttachment are unrestricted.
After the field intrinsic scores are modified, there are several options for determining the relevancy score for the located object. A first option is to determine a maximum field intrinsic score from the set of modified field intrinsic scores for keyword x (step 910). A second option sums the modified field intrinsic scores (step 912). A third option applies an inverse logit function to the sum of the modified field intrinsic scores (step 914). Note that in some embodiments steps 912 and 914 may also include a normalizing step. In step 916, a check is made for more keywords. If there are more keywords, then the keyword counter is incremented, and the process continues at (step 904), where fields with matches for the next keyword (i.e., keyword x=x+1) are retrieved. If there are no more keywords, then the scores are adjusted relative to the types of operators used in the query (e.g., AND, OR, TMTB, NEAR, ADJACENT, WITH, FOLLOWED BY, etc.) (step 918). If a logical OR operator is used with the keywords, then a relevancy score for the located object is determined from the maximum of the field intrinsic scores for keywords 1, . . . , N (step 920). If a logical AND operator is used with the keywords, then a relevancy score for the located object is determined from a minimum of the field intrinsic scores for keywords 1, . . . , N. If another type of operator is used (e.g., TMTB, etc.), then a relevancy score for the located objects is determined using the appropriate formulas (step 924).
While the foregoing description and drawings represent preferred embodiments of the present invention, it will be understood that various additions, modifications and substitutions may be made therein without departing from the spirit and scope of the present invention as defined in the accompanying claims. In particular, it will be clear to those skilled in the art that the present invention may be embodied in other specific forms, structures, arrangements, proportions, and with other elements, materials, and components, without departing from the spirit or essential characteristics thereof. The presently disclosed embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims, and not limited to the foregoing description. Furthermore, it should be noted that the order in which the process is performed may vary without substantially altering the outcome of the process.
This application is a continuation-in-part of U.S. patent application Ser. No. 10/172,165, filed Jun. 14, 2002, which application is incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | 10172165 | Jun 2002 | US |
Child | 10972248 | Oct 2004 | US |