1. Field
Embodiments of the present invention generally relate to automated document collection and classification systems and methods. Specific embodiments generally relate to systems and methods for automated document collection and classification to match researcher expertise with research funding opportunities and to match suitable collaborators for research projects.
2. Related Art
Researchers in a variety of organizations—academic, commercial, non-commercial, in the United States and worldwide—rely on Requests for Proposal (RFPs) to learn about research opportunities in outside organizations and sometimes even within their own organizations. Indeed, many research institutions derive most of their funding from projects they win by responding to RFPs. However, to respond to the RFPs, researchers must be aware of them and have a way to determine whether the potential funding opportunities match their interests and expertise. Researchers also need to know whether another individual or institution is conducting research on which he or she might collaborate. Maintaining this awareness is no small task considering that on any given day thousands of RFPs are active from the United States federal government alone and thousands more are issued by other governments, universities, foundations, and other funding sources.
Current practice in the institutions is that researchers or support staffs are seated at a computer terminal and direct internet browsers to websites that host a limited set of online databases, or they sit at their computer terminals and read feeds from such databases in e-mail. The reader selects a set of RFPs using filters such as the presence of certain keywords, the deadline for submitting a proposal or for completing the project, and the amount of funding. Once the search produces a set of RFPs, the reader uses human judgment to review manually the text and he/she selects for further consideration those that are most relevant for the individual or institution. Following this process, the researcher may then go through another step of identifying collaborators based on their interests and experiences. Over time, the researcher or staff may revise this search strategy to improve the selection of terms and retrieve better matches for consideration.
The process is not just time consuming; it may also result in missed opportunities for the institution, for individual researchers within the institution, and even for the organization that issued the RFP. The database scan may omit key words that are unexpectedly relevant. Alternatively, perhaps the relevant key word—one of interest to a researcher—was buried within the text and therefore not picked up by a high-level scan. Alternatively, once the set of RFPs is selected for manual review, the researcher or staff person may run out of time before he/she gets to an RFP of interest at the bottom of the stack.
Table A lists various acronyms and definitions of terms as discussed in the disclosure.
1http://searchsoftwarequality.techtarget.com/sDefinition/0,,sid92_gci334062,00.html.
Various embodiments replicate the current human process in software to reduce the limitations of human error and time in order to efficiently deliver relevant RFPs to researchers based on automated collection of RFP documents and matching these RFPs to text-based researcher profiles using a matching process applying algorithms that emulate human judgment of semantic relevance. Various embodiments improve on the current process by more efficiently and thoroughly collecting and evaluating RFPs and detecting relevance to potential applicants' interests than might be done in the current human process. In various embodiments, based on feedback, the software may improve algorithms emulating the more personalized judgments over time. In various embodiments, the software identifies potential collaborators for an RFP application by detecting other researchers whose experience is relevant to the RFP. Thus, various embodiments provide for a system and method that executes this process in orders of magnitude more efficiently than the current practice.
Various embodiments are applicable to with commercial and non-commercial enterprises seeking national or international RFPs, tenders and even internal opportunities within the enterprise. In that case, researchers represent entities seeking the opportunities and collaborations and RFPs represent the opportunity.
Various embodiments are directed to a system (and/or a method implemented therein) that replicates the process that is currently performed by humans. The system uses automated document collection, ordering, and classification to match researcher expertise with active grants and RFPs. This provides an opportunity to substantially reduce costs and improve results by applying information analytics to data that are currently available on the web and within organizational databases. Accordingly, various embodiments relate to a computer system that is designed to improve the process of matching researchers with relevant research projects and opportunities for collaboration as described in researcher profiles and the thousands of RFPs issued each year by governments, universities, foundations, and other funding sources.
The system automatically collects RFPs and other documents describing project opportunities and matches them to text-based researcher profiles using algorithms that emulate human judgments of semantic relevance. Based on feedback collected via the user interface, the software may improve algorithms emulating the more personalized judgments over time. Finally, the software identifies potential collaborators for an RFP application by detecting other researchers whose experience is relevant to the RFP. Thus, in various embodiments, the system executes the process orders of magnitude more efficiently than the current process.
A semi-automated search-and-retrieve strategy that presents a researcher with a list of documents sorted by similarity to his interests has the potential to streamline the process and make it more effective and efficient. The system identifies RFPs most relevant to a researcher's interests, using semantic analysis methods to create an ordering of RFPs customized to each researcher's keywords. Various embodiments provide advantages over keyword search by accounting for synonymy and polysemi. Finally, the system includes an online interface designed so that researchers not only can browse opportunities that have been matched to their interests, but also navigate a network view of potential co-applicants and collaborators. Thus, a useful byproduct of various embodiments is that it enables researchers to identify collaborators for proposals that may be mutually interesting.
To use the system, documents are collected automatically and/or edited manually by researchers to create a personal profile of the researcher's interests and areas of expertise. The system picks up key words from reports, and text from past proposals the researcher has authored, for example. The system works in real time and scans several web-based and other databases to find funding opportunities that match the researcher's profile and then, using advanced statistical learning methods, creates a ranked list of opportunities and potential collaborators. Interactive user interfaces allow researchers to refine their profiles and searches to improve the performance of the system; i.e., produce project opportunities more relevant to their interests.
The data sources Z include data sources, such as (but not limited to) RFP data sources Z1 and research data sources Z2. RFP data sources Z1 include websites of funding agencies, internal project descriptions, and other digital text sources signifying opportunities. These include databases such as the grants.gov archive and websites such as fedbizops.gov. This might also include descriptions of other project opportunities that are not RFPs. Researcher data sources Z2 can come from organizational databases that maintain text about interests, past proposals, publications, and other data manually entered by researchers via a GUI.
The data sources Z are associated with RFP acquisition programs 1. The RFP data acquisition programs 1 are custom-coded programs written in Python and executed on a networked Linux operating system. They are “extract transform load” (ETL) programs that pull data from network sources that publish RFP data. The data can be transformed into the application's database schema. The programs and web sites from which an exemplary embodiment of the system 10 obtains active RFPs are listed in the Appendix.
The data sources Z are associated with researcher data acquisition programs 2. The researcher data acquisition programs 2 are written to seed researchers' profiles with information about their interests through direct queries to various databases. These databases include library lists of publications, from which keywords are abstracted, the employee directory database that holds researchers' contact information, past proposals, and Curricula Vitae. The researcher can also enter into the system his or her own list of interest areas or upload documents. Similar acquisition programs can be used to collect data from publications indexed in federated 2 databases such as the ISI Web of Knowledge, an academic citation indexing and search service that is combined with web linking and provided by Thomson Reuters. The database Y may be located on a local server or computer, for example, mySQL. The system's 10 native data system stores the data extracted via the data acquisition programs (e.g., 1, 2). These include the text data inputs to calculations—programmatically acquired data and user input. Similarity calculations based on advanced statistical methods produce outputs stored in two different distance tables that represent the similarities between researchers and RFPs.
The database Y may include, but is not limited to RFP data table A, researcher data table B, researcher keyword table C, researcher ratings table D, researcher-RFP distance data table E, and researcher-researcher distance table F.
Researchers may use meta-data about grants stored in the RFP data table A to filter results. For instance, filtering may be applied based on the funding agency, date, and/or the like. Here, for example, prospective applicants can customize the result list by selecting various funding agencies. The researcher data table B contains basic information about researchers, such as login information, organizational status, preferences about funding sources, and/or the like.
The researcher keyword table C contains text acquired in the researcher data acquisition programs 2 from the researcher data sources Z2.
The researcher ratings table D stores information about how researchers implicitly (e.g., by monitoring mouse clicks) or explicitly (e.g., through direct entry of ratings as in
The database Y may also include (or be associated with) a similarity/learning calculation module 3. The similarity/learning calculation module 3 performs calculations based on statistical and machine learning methods that transform text and ratings data into similarity metrics and/or predictions of researcher interest in new RFPs. Several methods for these calculations are stored as programs in the system 10 with results and calculations triggered by different events.
In some embodiments, there are generally three steps that are required to generate the similarity/learning calculation module 3. First, the similarity model is defined (i.e., how the features of the available data are to be represented and transformed in a way that can generate valid similarity metrics). Next, a method for updating and estimating model parameters (if any) is defined. Then, the similarity metrics that will be calculated from model is defined. It should be noted that for many models, similarities can easily be calculated between content that was not part of the parameter estimation process. This process of incorporating new content for similarity calculation is often referred to as “folding-in” in the semantic analysis literature.
The user interface X, which, for example, may be at a remote terminal or local terminal connected to the server or computer having the database Y is configured to gather RFP text from online sources and use semantic analysis to rank RFPs by relevance to each researcher's expertise.
The user interface X provides various views into the data system and enables users to rate the RFPs. The user interface X includes a basic login program 4 that retrieves stored information setting session parameters to the researcher's personalized values. A manual keyword entry program 5 allows researchers to modify their stored profiles by changing the words associated with their interests. User RFP list and rating program 6 presents a list of RFPs ordered by calculated similarity to the logged-in researcher's interests with links to full content and a rating buttons that enable researchers to rate the relevance of each RFP in the result list. Agency filter program 7 restricts the results presented to a researcher by eliminating results from selected funding sources selected by a particular user.
Similarity measures between researchers and RFP and similarity measures between each pair of researchers' keywords are stored in the researcher-RFP distance table E and the researcher-researcher distance table F. Network diagram program 8 renders these distances in interactive network diagram visualizations, for example, with nodes represented as circles connected by edges proportional to distances between the nodes representing researchers or RFPs. The network diagram program 8 may include various parameters such as the number of degrees of network separation to show, what type of information is shown in each such degree, and/or the like. In particular embodiments, the network diagram program 8 relays the calculations to an open source diagram layout program, such as AiSee, to complete the rendering and layout.
Table 1 provides more information about these processes, whether they are executed by human intervention or machine-triggered programs, and how they correspond to
In various embodiments, operation of defining analytic methods for similarity calculations generally has three steps (1) defining the general model for similarity calculation; (2) identifying the mechanism for setting the parameters of such a model; and (3) defining the mechanism by which similarity between two instances of data objects can be calculated using the defined model. Creating the programs for these operations enables automatically triggered calculations of distance functions and updating of model parameters based on newly available data and feedback. The similarity metric itself may take a categorical form (such as “recommended,” “not recommended,” and/or the like) or a continuous form (a distance defined on the real scale). Programs for running similarity calculations execute the defined methods—these programs automatically update model parameters in addition to executing the similarity calculations. Each of these abstract operations is embodied in the use case described, triggered by scheduled events on the underlying operating system and/or activity in the user interface X. Updates to the stored data and filtering selections trigger recalculation of similarities, and reordering of the data in the other screens of the interface.
The user interface X may be accessible by a user at a terminal device 12 (e.g., computer, cell phone, tablet, PDA, etc.). The user interface X provides, for example over a network (e.g., wide area network (e.g., Internet), local area network, or the like), the user at the terminal device 12 access to server 14 on which the database Y is located. Thus, in some embodiments, the terminal device 12 is remote from the server 14 (and/or the one or more servers 16). The server 14 may be coupled to one or more servers 16 or the like on which the data sources Z1, Z2 are located to allow the server 14 to communicate with the one or more servers 16, for example over a network (e.g., a wide area network, a local area network, or the like).
As shown in
The RFP data sources Z1 are accessed by the RFP acquisition programs 1. The RFP acquisition programs 1 may interact with the RFP data table A. In some embodiments, the RFP acquisition programs 1 are located on a same server (e.g., 14) as the database Y. In other embodiments, the RFP acquisitions programs 1 are located on a same server (e.g., 16) as the RFP data sources Z1. In yet other embodiments, the RFP acquisition programs 1 are located on a different server from the RFP data sources Z1 and the database Y.
The similarity calculation module 6 may be based on data from at least (but not limited to) the RFP data table A, the researcher data table B, the research keyword table C, and the researcher ratings table D. The researcher ratings table D may be based on at least (but not limited to) the user RFP list and rating program 6. The similarity calculation module 3 may provide data to at least (but not limited to) the researcher-RFP distance data table E and the researcher-researcher distance table F.
A user (e.g., at a remote terminal) typically begins the experience by opening an internet browser, such as Mozilla Firefox or the like, on a display of the remote terminal device 12. Users may be presented with a login screen (e.g., as shown in
In the main diagram 34 on the Researcher-Grants screen (
The right pane 32 of the screen contains information that changes as the user highlights the various nodes (e.g., by moving a mouse pointer). When the user highlights a rectangular researcher node, the researcher's profile, which may include, for example, his/her name and photo, appear in the right screen margin, as well as, for example, links to the CV file, staff directory information, the researcher's email address, a link so that the researcher can be contacted about collaboration opportunities, and/or the like. When highlighting a node representing a funding opportunity, information such as (but not limited to) the grant agency, title, funding level, proposal due date, grant description are displayed, and/or the like. A list of keywords describing the expertise of each researcher may also be displayed.
Selecting an oval RFP node creates an RFP-centric researcher diagram, as exemplified in
With reference to
This view initially displays the logged-in user positioned in the center node, surrounded by researchers with expertise that most closely match that of the user, based upon matching of their profiles. As with the Researcher-Grant screen, highlighting any of the researcher nodes will display that researcher's information from the “researcher data” table in the database, in the pane to the right. The lengths of the edges connecting nodes in the network diagram are proportional to the distance between researchers so that researchers with the most similar keywords are positioned closest together in the diagram.
The same diagramming program can take a variety of parameters to change the “degrees” and of the diagram indicating other researchers whose interests are similar to those of the researcher in the center rather than RFPs whose content is similar to the researcher's interests. This enables researchers to navigate the organizational network based on how similar researchers' interests are.
The user can then select on any of the outlying researcher nodes. This will cause the diagram to redraw and display the selected researcher in the center, surrounded by researchers with the most closely matched profiles to the researcher in the center. A list of grants ordered by similarity to researcher's interests will be displayed in a Grant List screen (refer to
With reference to
In addition to the search field 62, this particular embodiment includes some other interactive features. A “more” button 64 expands the title field to include the summary content of the RFP. Selecting (e.g., clicking) the title text will open a new browser window to the network location of the RFP.
Importantly, prospective applicants can rate results that reflect whether a particular result is of interest to them in the “topically relevant” column 66. These are graphically displayed as an icon shaped like a human thumb. The tip of the thumb extending downward toward the bottom of the screen indicates a negative rating. A thumb pointing to the top of the screen indicates a positive rating. A 50-pixel by 50-pixel box between the two indicates an intermediate, neutral rating. Users can select these icons to enter their rating. These ratings are delivered to the database and used to refine the similarity calculation algorithms as indicated with respect to
With reference to
The researcher-keyword table C in the database Y holds the words seen in a profile management screen 82 (e.g.,
With reference to
Perl is a common scripted programming language, and suited for a variety of purposes, including management of text files. Hypertext Preprocessor (PHP) is a widely-used Open Source general-purpose scripting language that is especially suited for Web development and can be embedded into HTML, with many common features with perl. Python is also similar to perl. Javascript is a common language that can be rendered by common internet browsers to create client-side programs that are executed by the local machine's internet browser. MATLAB is a matrix-based programming language; information is available at http://www.matlab.com. For graphical rendering software, the system 10 may use aiSee, currently available at (http://www.aisee.com). The system 10 may use an operating system such as Linux (e.g., Linux Red Hat 2.0) or the like. The network protocols and access programs are HTTP—Hypertext transfer protocol and FTP—File transfer protocol or the like. The embodiment described and pictured in
Interactive data may be based on explicit content; for example researchers' publications data, ratings of RFPs, previous proposals, and/or the like. Implicit data from an interactive interface may also be collected; for example response timing, computer mouse activity, requests for more information, browser-based information about internet navigation history, and/or the like.
Furthermore, the user interface X may include a mixture of results that have been created by different algorithms and/or parameters. The use of the interactive data may be used to tune and select algorithms and parameters either adaptively or by human process intervention.
As implied above, the analytic models used for similarity calculations may have parameters that change based on data collected during interactions. For example, in the use case described below, the calculations are updated dynamically as a user adds more information about preferences—if a particular result is deemed irrelevant, similar results are also “demoted” in real time based on the algorithms used.
An embodiment may use any number of methods from a large universe to calculate similarities and update metrics. Matrix factorization methods may be one of the common examples of methods that reduce and rotate the dimensions for similarity metrics that do not overfit the data. Factorization can be thought of as creating a new model or spatial transformation that can be used to calculate the angle between vectors to measure similarity between points in space, in our case semantic space. Singular value decomposition (SVD) may be applied as discussed below.
Another approach includes a generalized method of factorization called multi-relational matrix factorization (MRMF). Lippert and colleagues describe this algorithm for jointly decomposing matrices of varied dimensionality to exploit correlations between an arbitrary number of data objects represented as matrices, potentially including data representations of characteristics such as linkages between object types and temporal dynamics of data.2 These approaches typically allow feedback data, such as ratings information, to be incorporated into the spatial rotations. MRMF is a generalized method. One of the most commonly applied methods that is a special case of MRMF is nonnegative matrix factorization (NMF) (William, 1971; Paatero, 1994, both of which are herein incorporated by reference in their entirety). Related approaches have recently received publicity in relation to the Netflix Prize, a contest for developing algorithms for recommending movies most similar to individuals' interests based on ratings histories.3 This type of recommendation based on similarity across users is often referred to as “collaborative filtering.”2Lippert, C.; Weber, S. H.; Huang, Y.; Tresp, V.; Schubert, M. & Kriegel, H.-P. (2008), Relation-Prediction in Multi-Relational Domains using Matrix-Factorization, in ‘NIPS 2008 Workshop: Structured Input-Structured Output’, which is herein incorporated by reference in its entirety.3Robert Bell, Yehuda Koren, and Chris Volinsky. The bellkor 2008 solution to the netflix prize, December 2008. http://www.netflixprize.com/assets/ProgressPrize2008_BellKor.pdf, which is herein incorporated by reference in its entirety.
In various embodiments, the system 10 embodies the human process described in the background in a tool created to match and improve the ability to identify and rank documents from a corpus based on similarity to personalized participant profiles. The opportunities in one of the embodiments relates to funding opportunities that may be available in different online databases or online web pages. The group of applicants in one of the embodiments may, for example, include university researchers in a particular college, department, etc. Further, the system 10 matches the participants with each other generally and in very specific context and allows them to collaborate in solving a common challenge. The matching of participants occurs based on their commonality of general interests, or based on specific opportunities being pursued where complimentary skills may be needed. After matching the participants, the system 10 facilitates collaboration by allowing participants to exchange relevant information provided by the participants (e.g., resume, webpage, etc.) and initiate communications (e.g., e-mail). The system 10 also allows for matching policy makers to policy relevant literature, ranking of candidates for specific jobs based on resumes, and other information provided in text.
Customized extract, transform, load (ETL) programs are scheduled to run on a nightly basis to collect data from sources where funding opportunities are published. This data are collected either with direct queries of internal databases, RFP databases that can be downloaded with FTP, or by programmatically downloading web pages from RFP sites that are based on templates that have formatted fields corresponding to relevant data elements such as RFP title and RFP funding level (commonly called “web scraping”). The text in RFPs is statistically compared to text in applicants' profiles to calculate similarity between each prospective applicant's profile and the funding opportunity description. Since the texts of documents, which are used for matching, have very large vocabulary, a number of methods are used to project the text in smaller dimensional vocabulary space. This can be accomplished by using singular value decomposition, non-negative matrix factorization, MRMF, artificial neural networks, and/or the like. Profiles are generated based on both user input and source databases. In addition, the recommendations for potential collaborators whose profiles are also available are generated using the same distance calculations. The data sources in the system 10 include multiple organizational databases containing researcher information, researcher keywords, and past publications, as well as the assembled database of funding opportunities translated from network data sources.
According to various embodiments, there are two types of data objects used to calculate similarity. These include (1) the ratings each user has assigned to each RFP and (2) the terms contained in the text documents. Once defined, similarity calculation programs are executed as new data comes in order to populate tables of distances between researchers and funding opportunities and researchers and other researchers.
For the purposes of this description, the expression “document” refers to the text contained in either researchers' list of keywords, their publications and past proposals, and the text of the funding opportunity. The documents are used to create a model of semantic space that will be used to calculate similarity between the documents in that space. The space is a projection of a Term-Document Matrix (TDM), an example of which is shown in Table 2. A raw term document matrix has a column for each document and a row for each term, where term is generally a feature of the document, in particular a word or a phrase contained in the document. Each cell represents a measure of how frequently a term appears in each document. The dimensions of this matrix are m×n, n=n1+n2, where m is the number of totality of unique terms, tokens, or features of RFPs and all researcher profiles, n2 the number of researchers, and n1 the number of funding opportunities. In some embodiments, each term may be down weighted by their commonality and the document length may be normalized.
A*1(m×n1)=Term-document matrix of funding opportunities;
A*2(m×n2)=Term-document matrix of researcher expertise content;
A*m×(n1+n2)=Combined term-document matrix.
The system 10 pre-processes documents before constructing the TDM for removing common words and words which have common linguistic roots, and adding phrases to improve the performance of our methods. Such methods are described below.
The system 10 may implement a stoplist to exclude a list of common words like “is,” “have,” “it,” etc. from the analysis. A customized list may be used for this purpose.
The system 10 may also maintain a customized list of terms that are used as filters to eliminate documents from the TDM and result set. These terms may include “SBIR,” “Fellowship,” “Mentorship,” and/or the like.
In various embodiments, the system maintains a table of tokenized phrases in the table_app_tokenized_words. Multi-word phrases that appear as keywords for more than three researchers are added to a library of tokens. If a tokenized phrase appears in a document, the count for each word is incremented as well as the token. Some examples of tokenized terms in the database include: Alcohol marketing; Life expectancy; Multiple imputation; Bosnian refugees; Urban youth; Mental disorder; Updated recommendations; Los Angeles County; Medicare managed care; and Chronic care. In practice, tokens do not have to represent text content. A token can represent any type of meta-data or feature associated with a weighted value in content of interest; for example, features could include tokens for funding agencies. Weights for funding agency tokens assigned to researchers would be proportional to past funding from that source; weights assigned to funding agency tokens in RFPs according to the RFP's funding source(s). Other examples include tokens that represent co-authorship or citation ties between researchers. The purpose of weights is to indicate how strongly a particular token is associated with a researcher.
In various embodiments, the system 10 applies a Porter stemming algorithm (van Rijsbergen, 19804) to retain parity between the conceptual meaning of words like “screening” and “screen.”4C. J. van Rijsbergen, S. E. Robertson and M. F. Porter, 1980. New models in probabilistic information retrieval. London: British Library. (British Library Research and Development Report, no. 5587), which is herein incorporated by reference in its entirety.
Historically, several methods have been applied for weighting the cells in the TDM in order to adjust for how frequently terms appear within a document or globally over the entire collection of documents. For any given method created for similarity calculation, the domain expert would select the best weighting approach for his purposes and the characteristics of data sources. Salton and Buckley give a thorough treatment of this topic in Salton5, which is herein incorporated by reference in its entirety. 5Salton, Gerard and Buckley, C. (1988). “Term-weighting approaches in automatic text retrieval”. Information Processing & Management 24 (5): 513-523, which is herein incorporated by reference in its entirety.
The system 10 may implement MATLAB TMG package (Berry, 19996; Kolda, 19977) which offers many programs for semantic analysis, clustering, and classification. Table B is a snippet of current code used to invoke TMG to create a term document matrix with the desired parameters and weight the entries as appropriate. 6M. Berry and M. Browne, Understanding Search Engines, Mathematical Modeling and Text Retrieval, Philadelphia, Pa.: Society for Industrial and Applied Mathematics, 1999, which is herein incorporated by reference in its entirety.7T. Kolda, Limited-Memory Matrix Methods with Applications, Tech. Report CS-TR-3806, 1997, which is herein incorporated by reference in its entirety.
Because the number of terms can be large (10,000+) and the number of documents (e.g., RFPs) can be large, the data for assigning RFPs to researchers are very noisy and thus potentially prone to error. To deal with this, the system 10 implements a technique invented at Bellcore (see Deerwester, Dumais, Furnas, Landauer & Harshman, 19908) called “latent semantic indexing” (LSI). To extract the underlying semantic information from these documents, the system 10 needs to avoid basing researcher-RFP connections on the idiosyncrasies of individual documents and maintain only the most important underlying structure of the original TDM. LSI applies a singular value decomposition (SVD) to the TDM, A*, and selects the p most influential singular vectors to give a lower rank approximation to the original term document matrix. More specifically, SVD will approximate A* by the corresponding p-dimensional singular value decomposition into the product of three matrices,
A
m×n=Tm×p×Sp×p×(Vn×p)T,
where m=number of terms, n=number of documents (n1 RFPs, and n2 Researchers), and p is the number of singular values used for decomposition; p<=rank of (A*)<=min(m,n). Here T has orthogonal column vectors referred to as the left singular vectors, and similarly V consists of orthogonal unit vectors known as the right singular vectors. S is a diagonal matrix of positive singular values in decreasing order. 8S. Deerwester, S. T. Dumais, G. W. Furnas et al., “INDEXING BY LATENT SEMANTIC ANALYSIS,” Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407, September, 1990, which is herein incorporated by reference in its entirety.
The choice of p is somewhat based on empirical results, the size of p depends on how close approximation is desired and how different in magnitude the singular values are to each other. If p is chosen to be equal to the rank of (A*), the result is an “approximation” (corresponding to key-word searching). The rows of T matrix represent terms, and rows of V matrix represent the documents (n1 RFPs and n2 Researchers) in the same p-dimensional space.
Thus any document d (a column of A* representing a researcher or RFP) can be approximated by {circumflex over (d)}—a p dimensional vector of the terms weighted by S
{circumflex over (d)}
p
=d
1×m
T
m×p
×S
p×p
−1
Projecting a new document that was not part of the original corpus is referred to as “folding-in.” By reducing the dimensionality of the space to p our aim is to eliminate noise from A* that is not informative about how different documents are related to one another, and to create a space composed of “concepts” or “factors” of weighted terms from our texts. A rough “rule of thumb” is to set p to be 500 for medium-sized documents. That choice strikes a balance between the noisiness and efficacy.
The rows of V are document vector coordinates, so to compare any two documents cosine similarities can be computed to identify a similarity measure. The rows of T are term vector coordinates, so to compare any two terms the cosine similarities can be computed to identify a similarity measure.
The general case, an arbitrary query string q, can be represented as a frequency count of each of the terms in T present in that query and projected into this “p space” defined by S and T:
{circumflex over (q)}
(p)
=q
1×m
T
m×p
×S
p×p
−1
At this point the query is analogous to any row of D, and can be compared directly with a similarity measure. In this embodiment, users are also able to construct such queries in a search box in order to search for documents of interest.
If two documents (such as an RFP and a researcher's keywords) are similar, then the pattern of their term frequency vector will be similar. By taking the inner product of their term frequency vectors, a larger value is obtained than if they were dissimilar. The similarity is this inner product between two documents and is calculated as
sim(d1,d2)={circumflex over (d)}1,pSp2{circumflex over (d)}2,p
This is used for ranking the RFPs {circumflex over (d)}1,p's for a given researcher (represented by {circumflex over (d)}2,p) in absence of feedback (e.g., 3 in
In the program MATLAB code for singular value decomposition and the reconstruction of construction of a lower rank approximation to the original matrix is shown in Table C, which has Singular value decomposition (400 dimensions).
As described above, the cosine similarity is based on the normalized dot-product of the vectors of the TDM or reduced rank TDM. TMG provides the function VSM, short for “vector space model,” which can be used to calculate the normalized dot product. This calculation is conducted for every researcher and document. Using the TMG package in MATLAB, the calculation is implemented as shown in Table D.
The aim of the funding opportunity-researcher matching use-case is to predict the funding opportunities of highest interest to researchers based on content. The system 10 at the initial stage, without any feedback from any of the users, uses the similarities between the researcher terms projected into p-space and document terms projected into p-space. However, feedback from users (in the form users' personal RFP ratings and/or application history) may be used to customize the projection of researcher terms so that the similarity measure between the customized projection and highly rated RFPs is minimized. Here two key example models for including the calibration of this data are described. For this, the following notation is needed.
Let A*=[A*1,A*2]m×(n1+n2),A*1=RFPs,A*2=Researchers,n=n1+n2.
As above, A* is approximated by
A=[A
1
A
2
]=TSV
t
,T
m×p
,S
p×p=diag(s1, . . . ,sp),
T′=[t
1
′, . . . ,t
m
′],T=[u
1
, . . . ,u
p
],V=V
(n1+n2)×p
=[v
1
, . . . ,V
p].
−>A=Σk=1,p(skukvk′)
Now, decompose V′, the transpose of V matrix,
V′=[F,R];F=[f
1
, . . . ,f
n1
],R=[r
1
, . . . ,r
n2],
F, R and T′ represent the documents corresponding to n1 RFPs, n2 researchers, and m terms respectively in the same p dimensional concept space.
T′T=I,V′V=F′F+R′R=[v
i
′v
j
]=I, by orthogonality of u's and v's.
The similarity relationship between RFP and researchers is given by:
A1′A2=FST′TSR=FSSR==[ci,j]=[wtd inner product of ith RFP and jth Researcher weighted by the singular values]. Similarly, the researcher to researcher-relationship is given by A2′A2.
Thus similarity between ith rfp and jth researcher,
c
i,j=Σk=1,psk2ƒi,krj,k,i=1, . . . ,n1,j=1, . . . ,n2 (1)
Initially, the similarity between researcher and RFP is based on ci,j. As researchers reveal their preferences, feedback is used to modify ci,j and rj,k to cnewi,j and rnewj,k, respectively. For these purposes, the following two methods are used, one based on a statistical learning model, and the other based on the nearest neighbor smoothing method. Descriptions of each are provided below. For this, more notations are needed. Let yi,j=1 if the jth researcher rated the ith RFA favorably; yij=0 if he/she rated it unfavorably. Let Mj=set of RFPs for which preference is known for the researcher. Now we learn in this content-based model while keeping all RFP's, words and researchers in the same p space in the following two ways.
Method 1. A number of statistical models are considered. We assume that yi,j are Bernoulli (1,γi,j), where γi,j=Prob{yi,j=1}=1−Prob{yi,j=0}. Then the model is logit(γi,j)=cnewi,ji.e., γi,j=exp (cnewi,j)/(1+exp (cnewi,j)). Then,
Log Likelihood (yi,j|cnewi,j)=Σi,j{yi,j log (γi,j)+(1−yi,j) log (1−γi,j))} (2)
This would allow us to do a number of diagnostics in terms of how good the model fits, etc.
The model is now completely specified once the cnew i,j is prescribed. Initially, without any feedback, we initialize with cnewi,j=ci,j=Σk=1,psk2fi,krj,k as above and our ranking of RFPs for a given researcher and researcher-to-researcher ranking are based on it as before.
After observing yi,j, cnew i,j=α+βΣk=1,psk2fi,krj,k(1+θj,k), where α, β and θj,k are parameters which need to be estimated.
The model has institutive behavior that jth researcher is moving by an additive factor θj,k rj,k to the kth component rj,k towards the positively rated RFPs and away from negatively rated RFPs.
Thus, cnewi,j=zi,j+βΣk=1,psk2ƒi,krj,kθj,k (3)
where zi,j=α+βci,j, (4)
That is, each researcher has free parameters θj,k, k=1, . . . ,p. In effect, there are new 2+n2 p parameters that can be estimated by the Maximum Likelihood method using the likelihood given in (2). The performance of this approach will not be satisfactory, since the number of parameters, 2+n2 p is large. To reduce this, we penalize the likelihood when θj,k are non-zero. Towards this end, we use
Regularized Log Likelihood=Σi,j{yi,j log (γi,j(θ))+(1−yi,j) log (1−γi,j(θ)))}−λΣj,kƒ(θj,k), where log (γi,j(θ)/(1−γi,j(θ))=zi,j+βΣk=1,p sk2ƒi,krj,kθj,k, where zi,j is given in (5)
Here, ƒ is a convex function—ƒ(x)=x̂2 would be similar to ridge regression, while ƒ(x)=abs(x) would give rise to a Lasso type of procedure. We use ƒ(x)=abs(x) with λ determined by cross-validation.
For estimating this model, we use LARs algorithm as implemented in glmnet package in R (Friedman, 20109). As the result of this model, after the feedback, one obtains new p-dimensional representation of each researcher. 9Jerome Friedman, Trevor Hastie, Robert Tibshirani (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22, which is herein incorporated by reference in its entirety.
r
j
(new)
=[r
j,k
(new)], where, rj,k(new)=rj,k(1+θj,k) (6)
Method 2 (Nearest Neighbor Method). This method is a simpler version of Method 1t, but does not fit a formal statistical model and consequently has fewer parameters. After observing yi,j for i in Mj with cardinality mj, we modify rj,k to rj,k(new) by,
r
j,k
(new)
=r
j,kθj+((1−θj))/mj)Σi in Mjfi,k(2yi,j−1),j=1, . . . ,n2. Again, rj(new)=[rj,k(new)] (7)
Thus, instead of 2+pn2 parameters, we only have n2 parameters. For the researchers who have no feedback, θj=1. We determine θj by cross-validation on the observed feedback. If the feedback data is very scant, then we reduce the number of parameters to 1, by assuming—θj=θ.
In experiments with Method 2, with the scant data, a common θ is assigned for all researchers, which is determined by cross validation on previously collected rating data. Iterating over each of the researcher's ratings with a positively rated test case “held out” in each repetition, similarity is calculated for rated RFPs and using various values of θ to generate a rank-ordered list. The value for θ is selected that generates similarity measures giving the lowest average rank (greatest similarity) to the positively rated test cases.
As stated above, the same system may use different distance calculation algorithms in different contexts. In the case of these two example methods, the first method can be used when time and/or CPUs are available for calculation and model estimation. The second method calculates more quickly and thus can be applied in settings when speed in updating results is an important requirement.
Both of the above methods give us new p-dimensional representation rj(new)=[rj,k(new)] of each researcher by the equation (6) and (7).
Given rj(new), we can compute after the feedback, R(new)=[r1(new), . . . ,rn2(new)] and, A 2(new)=USR(new). Recall, A1=USF, F=[ƒ1, . . . ƒn1].
Finally, A(new)=[A1, A2(new)] is the new estimated TDM.
Experiments: In the pilot deployment, 52 researchers rated over 900 unique RFPs that were presented to them in the user interface X described above. Using these data we also conducted offline calculations, using these two different methods for calculating similarity between researchers and RFPs. Two important characteristics of information retrieval performance are precision—the fraction of retrieved RFPs that are relevant and recall—the fraction of relevant RFPs that are retrieved for a given list length or distance measure threshold.
These two methods serve as examples, and do not cover the full range of algorithms or models for optimizing recommendations given rated matches between content and users.
The benefits of the system 10 are targeted at the researcher. However, to the extent that the system 10 matches researchers with appropriate research projects, it also benefits research institutions, the organizations that issue RFPs, and, potentially, the quality of the research.
The system 10 was developed with the intention of adaptation to any application where objects with common features are to be matched and presented or visualized. Thus, many uses beyond researcher-RFP matching can be implemented using the same system. For example, RFPs might be substituted with journal articles or congressional bills and researcher expertise might be substituted with legal dockets in order to identify how research and laws affect court decisions. Various embodiments could involve real time alerts sent out by organizational units as twitter feeds of “tweet-able” moments in congressional sessions relevant to recipients' interests, or automatic updates of which research is featured on an organization's website in response to current events. Under the current software architecture, these substitutions only require adapting the data sources and outputs and, optionally, adding source-specific features as desired.
It is understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged while remaining within the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented.
Those of skill in the art would understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. Such hardware, software, firmware, or any combination thereof may part of or implemented with any one or combination of the server 14 (refer to
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Listed below are the programs and web sites from which the system obtains active RFPs. They are interoperable with source data as of Jun. 26, 2010.
bidsync_parse.py—iterates over HTML of RFPs published in the website http://www.bidsync.com. BidSync, a comprehensive system that public agencies use to organize, automate, and manage their entire eProcurement processes. By using the BidSync system to process and receive bids, the agency will recognize an immediate increase in productivity and efficiency. Thanks to BidSync, agencies nationwide are saving upwards of 90 percent of the time that they spend on the bidding process and recognizing monetary savings of up to 70 percent. BidSync's bidding system dramatically reduces bid management time and administrative requirements, and improves efficiency for all who participate in the bidding processes.
fbo_parse.py—iterates over HTML of RFPs published on the website http://fbo.gov. This script accesses the page at https://www.fbo.gov and extracts all the links to RFPs in the “bidding” phase. Effective Jun. 25, 2001, the Federal government implemented Section 508 of the Rehabilitation Act of 1973, Amendments of 1998 (29 U.S.C. S 794(d)). Section 508 requires that the federal government only acquire electronic and information technology goods and services that provide for access by persons with disabilities. For more information, see www.section508.gov. Under “Buy Accessible,” a partnership between government and industry, the Information Technology Industry Council (ITI) is hosting a Voluntary Product Accessibility Template on their site. It allows vendors who choose to participate the ability to copy the template and complete it to describe how a particular product or service they offer conforms to Section 508 Access Board standards. This template should be placed on the vendor's accessible web site and the link to the template provided to the Buy Accessible database. Government procurement staff will be able to search the site by specific product or service type and see all vendors who have provided links. They can then use the links to reach the template information and product or service descriptions necessary to complete their market research.
grants_parse.py—The grants.gov website publishes and XML dump of all their RFPs at http://www07.grants.gov/search/XMLExtract do which this script accesses to download the zip file and extract the xml file which then is parsed in order to add/update the RFPs in our database. Grants.gov simplifies the grants management process and creates a centralized, online process to find and apply for over 900 grant programs from the 26 federal grant-making agencies. Grants.gov streamlines the process of awarding over $360 billion annually to state and local governments, academia, not-for-profits and other organizations. This program is one of the 24 federal cross-agency E-Government initiatives focused on improving access to services via the Internet. The vision for Grants.gov is to be a simple, unified source to electronically find, apply, and manage grant opportunities.
labavn_parse.py—This script access the page at http://www.labavn.org/index.cfm?fuseaction=contract.contract_list and extracts all the links to RFPs. The LABAVN site usually does not offer an estimated funding amount, but they may have additional documents that contain more information in their webpage. The Business Assistance Virtual Network (BAVN) is a free service provided by the City of Los Angeles Office of Small Business Services and Minority Business Opportunity Committee. BAVN allows you to view and download information about all bid opportunities offered by the City of Los Angeles in one convenient location as well as find up-to-date certified sub-contractors to complement your project bid.
metro_parse.py—This script accesses the page at http://www.metro.net/EBB/bids1.asp and extracts all the links to listings that have an “RFP” type. The RFPs on the Metro don't offer an estimated funding amount. Metro.net is the website for the Los Angeles County public transportation system. Some of Metro's procurements are for complex, specialized transportation equipment, but like any large company we also need office supplies, consulting services, paint, uniforms—practically anything you can think of We buy from small vendors and multinational corporations.
pnd_parse.py—This program extracts the links at http://foundationcenter.org/pnd/rfp/. These RFPs are sent in to Philanthropy News Digest, which posts them, along with a link for more info. The award amounts are not given.
rfpdb_parse.py—This script accesses the page at http://www.rfpdb.com/ and extracts all the links to RFPs. Since this site requires registration, this script does not extract much data. If all the RFPs on the page are new, then the next page of RFPs is parsed after a 60-second delay. Since all the data on the individual RFP pages are available from the list view, the separate pages are not accessed as in other scripts, but the data is extracted from the list of RFPs.
scag_parse.py—This script access the page at http://www.planetbids.com/SCAG/QuickSearch.cfm and extracts all the links to RFPs in the “bidding” phase. The RFPs on SCAG do not offer an estimated funding amount, but they may have additional documents that contain more information in their webpage.
trb_parse.py—This parser extracts the links at http://144.171.11.40/cmsfeed/trbnet.asp?s=3&r=5. The RFP pages have a table of information at the top, which some of the data is extracted from. A body of text follows, which varies in HTML formatting, so instead textual markers are used to extract the description. There are additional notes on the web pages that are not specific to any one RFP.
Embodiments of the present invention relate to U.S. Provisional Patent Application 61/373,781 filed on Aug. 13, 2010, and entitled “REQUESTS FOR PROPOSALS MANAGEMENT SYSTEMS AND METHODS,” which is incorporated herein in its entirety and forms a basis for a claim of priority.
Number | Date | Country | |
---|---|---|---|
61373781 | Aug 2010 | US |