The present disclosure relates generally to machine learning, and more particularly to methods and systems for training neural language models such as ranking models for information retrieval.
For neural information retrieval (IR), it would be useful to improve first-stage retrievers in ranking pipelines. For instance, while bag-of-words (BOW) models remain strong baselines for first-stage retrieval, they suffer from the longstanding vocabulary mismatch problem, in which relevant documents might not contain terms that appear in the query. Thus, there have been efforts to substitute standard BOW approaches by learned (neural) rankers.
Pretrained language models (LMs) such as those based on Bidirectional Encoder Representations from Transformers (BERT) models are increasingly popular for natural language processing (NLP) and for re-ranking tasks in information retrieval. LM-based neural models have shown a strong ability to adapt to various tasks by simple fine-tuning. LM-based ranking models have provided improved results for passage re-ranking tasks. However, LM-based models introduce challenges of efficiency and scalability. Because of strict efficiency requirements, LM-based models conventionally have been used only as re-rankers in a two-stage ranking pipeline, while a first stage retrieval (or candidate generation) is conducted with BOW models that rely on inverted indexes.
There is a desire for retrieval methods in which most of the involved computation can be done offline and where online inference is fast. Learning dense embeddings to conduct retrieval using efficient approximate nearest neighbors (ANN) methods has shown good results, but such methods have still been combined with BOW models (e.g., combining both types of signals) due to their inability to explicitly model term matching.
There has been a growing interest in learning sparse representations for queries and documents. Using sparse representations, models can inherit desirable properties from BOW models such as exact-match of (possibly latent) terms, efficiency of inverted indexes, and interpretability. Additionally, by modeling implicit or explicit (latent, contextualized) expansion mechanisms, similarly to standard expansion models in IR, models can reduce vocabulary mismatch.
Dense retrieval based on BERT Siamese models is a standard approach for candidate generation in question answering and information retrieval tasks. An alternative to dense indexes is term-based ones. For instance, building on standard BOW models, Zamani et al. disclosed SNRM, in which a model embeds documents and queries in a sparse high-dimensional latent space using L1 regularization on representations. However, SNRM's effectiveness has remained limited.
More recently, there have been attempts to transfer knowledge from pretrained LMs to sparse approaches. For example, based on BERT, DeepCT (Dai and Callan, 2019, Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval, arXiv:1910.10687 [cs.IR]) focuses on learning contextualized term weights in the full vocabulary space, akin to BOW term weights. However, as the vocabulary associated with a document remains the same, this type of approach does not address vocabulary mismatch, as acknowledged by the use of query expansion for retrieval.
Another approach is to expand documents using generative methods to predict expansion words for documents. Document expansion adds new terms to documents, thus fighting the vocabulary mismatch, and repeats existing terms, implicitly performing reweighting by boosting important terms. Current methods, though, are limited by the way in which they are trained (predicting queries), which is indirect in nature and limits their progress.
Still another approach is to estimate the importance of each term of the vocabulary implied by each term of the document; that is, to compute an interaction matrix between the document or query tokens and all the tokens from the vocabulary. This can be followed by an aggregation mechanism that allows for the computation of an importance weight for each term of the vocabulary, for the full document or query. However, current methods either provide representations that are not sparse enough to provide fast retrieval, and/or they exhibit suboptimal performance.
Provided herein, among other things, are methods implemented by a computer having a processor and memory for providing a representation of an input sequence over a vocabulary in a ranker of a neural information retrieval model. The input sequence may be, for instance, a query or a document sequence. Each token of a tokenized input sequence is embedded based at least on the vocabulary to provide an embedded input sequence of tokens. The input sequence is tokenized using the vocabulary. An importance (e.g., weight) of each token over the vocabulary is predicted with respect to each token of the embedded input sequence. A predicted term importance of the input sequence as a representation of the input sequence over the vocabulary by performing an activation over the embedded input sequence. The embedding and the determining of a prediction are performed by a pretrained language model. The term importance is output as the representation of the input sequence over the vocabulary in the ranker of the neural information retrieval model.
Other embodiments provide, among other things, a neural model implemented by a computer having a processor and memory for providing a representation of an input sequence over a vocabulary in a ranker of a neural information retrieval model. The input sequence may be, for instance, a query or a document sequence. A pretrained language model layer is configured to embed each token in a tokenized input sequence based on the vocabulary and contextual features to provide context embedded tokens, and to predict an importance (e.g., weight) with respect to each token of the embedded input sequence over the vocabulary by transforming the context embedded tokens using one or more linear layers. The tokenized input sequence is tokenized using the vocabulary. A representation layer is configured to receive the predicted importance with respect to each token over the vocabulary and obtain a representation of importance (e.g., weight) of the input sequence over the vocabulary. The representation layer can comprise a concave activation layer configured to perform a concave activation of the predicted importance over the embedded input sequence. The representation layer may output the predicted term importance of the input sequence over the vocabulary in the ranker of the neural information retrieval model. The predicted term importance of the input sequence can be used to retrieve a document.
Other embodiments provide, among other things, a computer implemented method for training of a neural model for providing a representation of an input sequence over a vocabulary in a ranker of an information retrieval model. The training may be part of an end-to-end training of the ranker or the IR model. The neural model is provided with: i) a tokenizer layer configured to tokenize the input sequence using the vocabulary; ii) an input embedding layer configured to embed each token of the tokenized input sequence based at least on the vocabulary; iii) a predictor layer configured to predict an importance (e.g., weight) for each token of the input sequence over the vocabulary; and iv) a representation layer configured to receive the predicted importance with respect to each token over the vocabulary and obtain predicted importance (e.g., weight) of the input sequence over the vocabulary. The input embedding layer and the predictor layer may be embodied in a pretrained language model. The representation layer may comprise a concave activation layer configured to perform a concave activation of the predicted importance over the input sequence. In an example training method, parameters of the neural model are initialized, and the neural model is trained using a dataset comprising a plurality of documents. Training the neural model jointly optimizes a loss comprising a ranking loss and at least one sparse regularization loss. The ranking loss and/or the at least one sparse regularization loss can be weighted by a weighting parameter.
According to a complementary aspect, the present disclosure provides a computer program product, comprising code instructions to execute a method according to the previously described aspects; and a computer-readable medium, on which is stored a computer program product comprising code instructions for executing a method according to the previously described embodiments and aspects. The present disclosure further provides a processor configured using code instructions for executing a method according to the previously described embodiments and aspects.
Other features and advantages of the invention will be apparent from the following specification taken in conjunction with the following drawings.
The accompanying drawings are incorporated into the specification for the purpose of explaining the principles of the embodiments. The drawings are not to be construed as limiting the invention to only the illustrated and described embodiments or to how they can be made and used. Further features and advantages will become apparent from the following and, more particularly, from the description of the embodiments as illustrated in the accompanying drawings, wherein:
In the drawings, reference numbers may be reused to identify similar and/or identical elements.
It is desirable to provide neural ranker models for ranking (e.g., document ranking) in information retrieval (IR) that can generate (vector) representations sparse enough to allow the use of inverted indexes for retrieval (which is faster and more reliable than methods such as approximate nearest neighbor (ANN) methods, and enables exact matching), while performing comparably to neural IR representations using dense embedding (e.g., in terms of performance metrics such as MRR (Mean Reciprocal Rank) and NDCG (Normalized Discounted Cumulative Gain)).
Example neural ranker models can combine rich term embeddings such as can be provided by trained language models (LMs) such as Bidirectional Encoder Representations from Transformers (BERT)-based LMs, with sparsity that allows efficient matching algorithms for IR based on inverted indexes. BERT-based language models are commonly used in natural language processing (NLP) tasks, and are exploited in example embodiments herein for ranking.
Example systems and methods can provide sparse representations (sparse vector representations or sparse lexical expansions) of an input sequence (e.g., a document or query) in the context of IR by predicting a term importance of the input sequence over a vocabulary. Such systems and methods can provide, among other things, expansion-aware representations of documents and queries.
An example pretrained LM, that is trained using a self-supervised pretraining objective, such as via masked language modeling (MLM) methods, can be used to determine a prediction of an importance (or weight) for an input sequence over the vocabulary (term importance) with respect to tokens of the input sequence. A final representation providing the predicted importance of the input sequence over the vocabulary can be obtained by performing an activation that includes a concave function to prevent some terms from dominating. Example concave activation functions can provide a log-saturation effect, while others can use functions such as radical functions (e.g., sqrt (1+x)).
Example neural ranker models can be further trained based in part on sparsity regularization to ensure sparsity of the produced representations and improve both the efficiency (computational speed) and the effectiveness (quality of lexical expansions) of first-stage ranking models. A trade-off between efficiency and effectiveness can be tailored using weights.
The concave activation and/or sparsity regularization can provide improvements over models such as those based on BERT architectures that require learned binary gating. Among other features, sparsity regularization may allow for end-to-end, single-stage training, without relying on handcrafted sparsification strategies such as BOW masking.
Neural ranking models may also be trained using in-batch negative sampling, in which some negative documents are included from other queries to provide a ranking loss that can be combined with sparsity regularization in an overall loss. By contrast, ranking models such as SparTerm (e.g., as disclosed in Bai et al., 2020. SparTerm: Learning Term based Sparse Representation for Fast Text Retrieval. arXiv:2010.00768 [cs.IR]), are trained using only hard negatives, e.g., generated by BM25. Training using in-batch negative sampling can further improve the performance of example models.
Experiments disclosed herein demonstrate that example neural ranking models, e.g., used for a first-stage ranker for information retrieval, can outperform other sparse retrieval methods on test datasets, yet can provide comparable results to state-of-the-art dense retrieval methods. Unlike dense retrieval approaches, example neural ranking models can learn sparse lexical expansions and thus can benefit from inverted index retrieval methods, avoiding the need for methods such as approximate nearest neighbor (ANN) search.
Example methods and systems herein can further provide training for a neural ranker model based on explicit sparsity regularization, which can be used in combination with a concave activation function for term weights. This can provide highly sparse representations and comparable results to existing dense and sparse methods. Example models can be implemented in a straightforward manner, and may be trained end-to-end in a single stage. The contribution of the sparsity regularization can be controlled in example methods to influence the trade-off between effectiveness and efficiency.
Referring now to the drawings,
Example neural ranker models according to embodiments herein may be used for providing rankings for the first-stage retriever or ranker 104, as shown in
Example neural ranker models, whether used in the first-stage 104, the second stage 108, or as a standalone model, may provide representations, e.g., vector representations, of an input sequence over a vocabulary. The vocabulary may be predetermined. The input sequence can be embodied in, for instance, a query sequence such as the query 102, a document sequence to be ranked and/or retrieved based on a query, or any other input sequence. “Document” as used herein broadly refers to any sequence of tokens that can be represented in vector space and ranked using example methods and/or can be retrieved. A query broadly refers to any sequence of tokens that can be represented in vector space for use in ranking and retrieving one or more documents.
Example neural ranker models herein can infer sparse representations for input sequences, e.g., queries or documents, directly by providing supervised query and/or document expansion. Example models can perform expansion using a pretrained language model (LM) such as but not limited to an LM trained using unsupervised methods such as Masked Language Model (MLM) training methods. For instance, a neural ranker model can perform expansion based on the log its (i.e., unnormalized outputs) 302 of a Masked Language Model (MLM)-trained LM 320. Regularization may be used to train example retrievers to ensure or encourage sparsity.
An example pretrained LM may be based on BERT. BERT, e.g., as disclosed in Devlin et al, 2019, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, CoRR abs/1810.04805, incorporated herein by reference, is a family of transformer-based training methods and associated models, which may be pre-trained on two tasks: masked-token prediction, referred to as a “masked language model” (MLM) task”; and next-sentence prediction. These models are bidirectional in that each token attends to both its left and right neighbors, not only to its predecessors. Example neural ranker models herein can exploit pretrained language model such as those provided by BERT-based models to project token-level importance over a vocabulary (such as over a BERT vocabulary space, or other vocabulary space) for an input sequence, and then obtain predicted importance of the input sequence over the vocabulary to provide a representation of the input sequence.
The input sequence 301 received by the neural ranker model 300 is tokenized at 202 by a tokenizer layer 304 using the predetermined vocabulary (in this example, a BERT vocabulary) to provide a tokenized input sequence t1 . . . tN 306. The tokenized input sequence 306 may also include one or more special tokens, such as but not limited to <CLS> (a symbol added in front of an input sequence, which may be used in some BERT methods for classification) and/or <SEP> (used in some BERT methods for a separator), as can be used in BERT embeddings.
Token-level importance is predicted at 206. Token-level importance refers to an importance (or weight, or representation) of each token in the vocabulary, with respect to each token of the input sequence (e.g., a “local” importance). For example, each token of the tokenized input sequence 306 may be embedded at 208 to provide a sequence of context-embedded tokens h1 . . . hN 312. The embedding of each token of the tokenized input sequence 306 may be based on, for instance, the vocabulary and the token's position within the input sequence. The context embedded tokens h1 . . . hN 312 may represent contextual features of the tokens within the embedded input sequence. An example context embedding 208 may use one or more embedding layers embodied in transformer-based layers such as BERT layers 308 of the pretrained LM 320.
Token-level importance of the input sequence is predicted over the vocabulary (e.g., BERT vocabulary space) at 210 from the context-embedded tokens 312. A token-level importance distribution layer, e.g., embodied in a head (log its) 302 of the pretrained LM 320 (e.g., trained using MLM methods) may be used to predict an importance (or weight) of each token of the vocabulary with respect to each token of the input sequence of tokens; that is, a (input sequence) token-level or local representation 310 in the vocabulary space. For instance, the MLM head 302 may transform the context embedded tokens 312 using one or more linear layers, each including at least one log it function, to predict an importance (e.g., weight, or other representation) of each token in the vocabulary with respect to each token of the embedded input sequence and provide the token-level representation 310 in the vocabulary space.
For example, consider an input query or document sequence after tokenization 202 (e.g., WordPiece tokenization) t=(t1, t2, . . . tN), and its corresponding BERT embeddings (or BERT-like model embeddings) after embedding 208 (h1, h2, . . . hN). The importance wij of the token j (vocabulary) for a token i (of the input sequence) can be provided at step 210 by:
w
ij=transform(hi)TEj+bj j∈{1, . . . └V┘} (1)
where Ej denotes the BERT (or BERT-like model) input embedding resulting from the tokenizer and the model parameter for token j (i.e., a vector representing token j without taking into account the context), bj is a token-level bias, and transform(.) is a linear layer with Gaussian error linear unit (GeLU) activation, e.g., as disclosed in Hendrycks and Gimpel, arXiv:1606.08415, 2016, and a normalization layer LayerNorm. GeLU can be provided, for instance, by xxϕ(x), or can be approximated in terms of the tan h(·) function (as the variance of the Gaussian goes to zero one arrives at a rectified linear unit (ReLU), but for unit variance one gets GeLU). T can correspond to the transpose operation in linear algebra, e.g., to indicate that in the end it is a dot product, and may be included in the transform function.
Equation (1) can be equivalent to the MLM prediction. Thus, it can also be initialized, for instance, from a pretrained MLM model (or other pretrained LM).
Term importance of the input sequence 318 (e.g., a global term importance for the input sequence) is predicted at 220 as a representation of importance (e.g., weight) of the input sequence over the vocabulary by performing an activation using a representation layer 322 that performs a concave activation function over the embedded input sequence. The predicted term importance of the input sequence predicted at 220 may be independent of the length of the input sequence. The concave activation function can be, as nonlimiting examples, a logarithmic activation function or a radical function (e.g., a sqrt (1+x) function; a mapping w→(√{square root over ((1+ReLU(w)))}−1)k for an appropriate scaling k, etc.).
For instance, the final representation of importance of the input sequence 318 can be obtained by combining (or maximizing, for example) importance predictors over the input sequence tokens, and applying a concave function such as a logarithmic function after applying an activation function such as ReLU to ensure the positivity of term weights:
The above example model provides a log-saturation effect that prevents some terms from dominating and (naturally) ensures sparsity in representations. Logarithmic activation has been used, for instance, in computer vision, e.g., as disclosed in Yang Liu et al., Natural-Logarithm-Rectified Activation Function in Convolutional Neural Networks, arXiv, 2019, 1908.03682. While using a log-saturation or other concave functions prevents some terms from dominating, surprisingly the implied sparsity obtains improved results and allows obtaining of sparse solutions without regularization.
The final representation (i.e., the predicted term importance of the input sequence), output at 212, may be compared to representations from other sequences, including queries or documents, or, since the representations are in the vocabulary space, simply to tokenizations of sequences (e.g., a tokenization of a query over the vocabulary can provide a representation).
An example training method for the neural ranker model 300 will now be described. Generally, training begins by initializing parameters of the model, e.g., weights and biases, which are then iteratively adjusted after evaluating an output result produced by the model for a given input against the expected output. To train the neural ranker model 300, parameters of the neural model can be initialized. Some parameters may be pretrained, such as but not limited to parameters of a pretrained LM such as an MLM. Initial parameters may additionally or alternatively be, for example, randomized, or initialized in any other suitable manner. The neural ranker model 300 may be trained using a dataset including a plurality of documents. The dataset may be used in batches to train the neural ranker model 300. The dataset may include a plurality of documents including a plurality of queries. For each of the queries the dataset may further include at least one positive document (a document associated with the query) and at least one negative document (a document not associated with the query). Negative documents can include hard negative documents, which are not associated with any of the queries in the dataset (or in the respective batch), and/or negative documents that are not associated with the particular query but are associated with other queries in the dataset (or batch). Hard documents may be generated, for instance, by sampling a model such as but not limited to a ranking model.
The example neural ranker model 500 can be trained by minimizing the loss in Equation (3).
Additionally, the ranking loss may be supplemented to provide for sparsity regularization. Learning sparse representations has been employed in methods such as SNRM (e.g., Zamani et al., 2018, from Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation of Inverted Indexing, In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (Torino, Italy) (CIKM '18). Association for Computing Machinery, New York, N.Y., USA, 497-506) via 1 regularization. However, minimizing the
1 norm of representations does not result in the most efficient index, as nothing ensures that posting lists are evenly distributed. This is even truer for standard indexes due to the Zipfian nature of the term frequency distribution.
To obtain a well-balanced index, Paria et al., 2020, Minimizing FLOPs to Learn Efficient Sparse Representations, arXiv:2004.05665, discloses the FLOPS regularizer, a smooth relaxation of the average number of floating-point operations necessary to compute the score of a document, and hence directly related to the retrieval time. It is defined using aj as a continuous relaxation of the activation (i.e., the term has a non-zero weight) probability pj for token j, and estimated for documents d in a batch of size N by
This provides the following regularization loss:
This differs from the 1 regularization used in SNRM in that the āj are not squared: using
FLOPS thus pushes down high average term weight values, giving rise to a more balanced index.
Example models may combine one or more of the above features to provide training, e.g., end-to-end training, of sparse, expansion-aware representations of documents and queries. For instance, example models can learn the log-saturation model provided by Equation (2) by jointly optimizing ranking and regularization losses:
=
rank-IBN+λq
regq+λd
regd (4)
In Equation (4), reg is a sparse regularization (e.g.,
1 or
FLOPS). Two distinct regularization weights (λq and λd) for queries and documents, respectively, can be provided in the example loss function, allowing additional pressure to be put on the sparsity for queries, which is highly useful for fast retrieval.
Neural ranker models may also employ pooling methods to further enhance effectiveness and/or efficiency. For instance, by straightforwardly modifying the pooling mechanism disclosed above, example models may increase effectiveness by a significant margin.
An example max pooling method may change the sum in Equation (2) above by a max pooling operation:
This modification can provide improved performance, as demonstrated in experiments.
Example models can also be extended without query expansion, providing a document-only method. Such models can be inherently more efficient, as everything can then be pre-computed and indexed offline, while providing results that remain competitive. Such methods can be provided in combination with the max pooling operation or separately. In such methods, there are no query expansions nor term weighting, and thus the ranking score can be provided simply by comparing a tokenization of the query in the vocabulary to (e.g., pre-computed) representations of documents that can be generated by the neural ranker model:
s(q,d)=Σj∈qwjd (6)
Another example modification may incorporate distillation into training methods. Distillation can be provided in combination with any of the above example models or training methods or provided separately. An example distillation may be based on methods disclosed in Hofstatter et al., Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation, arXiv:2010.02666, 2020. Distillation techniques can be used to further boost example model performance, as demonstrated by experiments showing near state-of-the-art performance on MS MARCO passage ranking tasks as well as the BEIR zero-shot benchmark.
Example distillation training can include at least two steps. In a first step, both a first stage retriever, e.g., as disclosed herein, and a reranker, such as those disclosed herein (as a nonlimiting example, HuggingFace, as provided by https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2) are trained using triplets (e.g., a query q, a relevant passage p+, and a non-relevant passage p−), e.g., as disclosed in Hofstatter et al., 2020, Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. arXiv:2010.02666. In a second step, triplets are generated with harder negatives using an example model trained with distillation, and the reranker is used to generate the desired scores.
A model, an example of which is referred to in experiments herein as SPLADEmax, may then be trained from scratch using these triplets and scores. The result of this second step provides a distilled model, an example of which is referred to in experiments herein as DistilSPLADEmax.
In a first set of experiments, example models were trained and evaluated on the MS MARCO passage ranking dataset (https://github.com/microsoft/MSMARCO-Passage-Ranking) in the full ranking setting. This dataset contains approximately 8.8M passages, and hundreds of thousands of training queries with shallow annotation 1.1 relevant passages per query on average). The development set contained 6980 queries with similar labels, while the TREC DL 2019 evaluation set provides fine-grained annotations from human assessors for a set of 43 queries.
Training, indexing, and retrieval: The models were initialized with the BERT-based checkpoint. Models were trained with the ADAM optimizer, using a learning rate of 2e−5 with linear scheduling and a warmup of 6000 steps, and a batch size of 124. The best checkpoint was kept using MRR@10 on a validation set of 500 queries, after training for 150 k iterations. Though experiments were validated on a re-ranking task, other validation may be used in example methods. A maximum length of 256 was considered for input sequences.
To mitigate the contribution of the regularizer at the early stages of training, the method disclosed in Paria et al., 2020, was followed, using a scheduler for λ, quadratically increasing 2 L at each training iteration, until a given step (in experiments, 50 k), from which it remained constant. Typical values for 2 L fall between 1e−1 and 1e−4. For storing the index, a custom implementation was used based on Python arrays. Numba was relied on for parallelizing retrieval. Models were trained using PyTorch and HuggingFace transformers, using 4 Tesla V100 GPUs with 32 GB memory.
Evaluation: Recall@1000 was evaluated for both datasets, as well as the official metrics MRR@10 and NDCG@10 for MS MARCO dev set and TREC DL 2019 respectively. Since the focus of the evaluation was on the first retrieval step, re-rankers based on BERT were not considered, and example methods were compared to first stage rankers only. Example methods were compared to the following sparse approaches: 1) BM25; 2) DeepCT; 3) doc2query-T5 (Nogueira and Lin, 2019. From doc2query to docTTTTTquery); and 4) SparTerm, as well as known dense approaches ANCE (Xiong et al., 2020, Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval, arXiv:2007.00808 [cs.IR]) and TCT-ColBERT (Lin et al., 2020, Distilling Dense Representations for Ranking using Tightly-Coupled Teachers. arXiv:2010.11386 [cs.IR]). Results were provided from the original disclosures for each approach. A pure lexical SparTerm trained with an example ranking pipeline (ST lexical-only) was included. To illustrate benefits of log-saturation, results were added for models trained using binary gating (wj=gj×Σi∈t ReLU(wij), where gj is a binary mask) instead of using Equation (2) above (ST exp-1 and ST exp-
FLOPS) For sparse models, an estimate was indicated of the average number of floating-point operations between a query and a document in Table 1, when available, which was defined as the expectation
q,d [Σj∈V pj(q)pj(d)] where pj is the activation probability for token j in a document d or a query q. It was empirically estimated from a set of approximately 100 k development queries, on the MS MARCO collection.
Results are shown in Table 1, below. Overall, it was observed that example models outperformed the other sparse retrieval methods by a large margin (except for recall@1000 on TREC DL), and that the results were competitive with current dense retrieval methods.
For instance, example methods for ST lexical-only outperformed the results of DeepCT as well as previously-reported results for SparTerm—including the model using expansion. Because of the additional sparse expansion mechanism, results could be obtained that were comparable to current state-of-the-art dense approaches on MS MARCO dev set (e.g., Recall@1000 close to 0.96 for ST exp-1), but with a much larger average number of FLOPS.
By adding a log-saturation effect to the expansion model, example methods greatly increased sparsity, reducing the FLOPS to similar levels than BOW approaches, at no cost to performance when compared to the best first-stage rankers. In addition, an advantage was observed for the FLOPS regularization over 1 in order to decrease the computing cost. In contrast to SparTerm, example methods were trained end-to-end in a single step. Example methods were also more straightforward compared to dense baselines such as ANCE, and they avoid resorting to approximate nearest neighbors search.
FLOPS falls far below BOW models and example methods in terms of efficiency. In the meantime, example methods (SPLADE exp-
1, SPLADE exp-
FLOPS) reached efficiency levels equivalent to sparse BOW models, while outperforming doc2query-T5. Strongly regularized models had competitive performance (e.g., FLOPS=0.05, MRR@10=0.0296). Further, the regularization effect brought by
FLOPS compared to
1 was apparent: for the same level of efficiency, performance of the latter was always lower.
The experiments demonstrated that the expansion provides improvements with respect to the purely lexical approach by increasing recall. Additionally, representations obtained from expansion-regularized models were sparser: the models learned how to balance expansion and compression, by both turning off irrelevant dimensions and activating useful ones. On a set of 10 k documents, the SPLADE-FLOPS results from Table 1 dropped on average 20 terms per document, while adding 32 expansion terms. For one of the most efficient models (FLOPS=0.05), 34 terms were dropped on average, with only 5 new expansion terms. In this case, representations were extremely sparse: documents and queries contained on average 18 and 6 non-zero values respectively, and less than 1.4 GB was required to store the index on disk.
Additional experiments were performed using the example max pooling, document encoding, and distillation features described above, and using the MS MARCO dataset. Table 2 below shows example results for MS-MARCO and TREC-2019 as in Table 1 above, as further compared to results of further experiments using modified models.
The zero-shot performance of example models was verified using a subset of datasets from the BEIR benchmark (e.g., as disclosed in Thakur et al., BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models, CoRR abs/2104.08663 (2021), arXiv:2104.08663), which encompasses various IR datasets for zero shot comparison. A subset was used due to the fact that some of the datasets were not readily available.
Comparison was made to the best performing models from Thakur et al., 2021 (ColBERT (Khattab and Zaharia, 2020, ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT, In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR '20). Association for Computing Machinery, New York, N.Y., USA, 39-48)) and the two best performing from the rolling benchmark (tuned BM25 and TAS-B). Table 3, below, shows additional results from example models against several baselines on the BEIR benchmark. Generally, it was observed that example models outperformed the other sparse retrieval methods by a large margin (except for recall@1000 on TREC DL), and that results were competitive with state-of-the-art dense retrieval methods.
Impact of Max Pooling: On MS MARCO and TREC, models including max pooling (SPLADEmax) brought almost 2 points in MRR and NDCG compared to example models without max pooling (SPLADE). Such models are competitive with models such as COIL and DeepImpact.
The example document encoder with max pooling (SPLADEmax) was able to reach the same performance as the above model (SPLADE), outperforming doc2query-T5 on MS MARCO. As this model had no query encoder, it had better latency. Further, this example document encoder is straightforward to train and to apply to a new document collection: a single forward is required, as opposed to multiple inference with beam search for methods such as doc2query-T5.
Impact of Distillation: Adding distillation significantly improved the performance of the example SPLADE model, as shown by example model in Table 2 (DistilSPLADEmax).
Network Architecture
Example systems, methods, and embodiments may be implemented within a network architecture 900 such as illustrated in
The system 100 (shown in
Client devices 904 may be any processor-based device, terminal, etc., and/or may be embodied in a client application executable by a processor-based device, etc. Client devices may be disposed within the server 902 and/or external to the server (local or remote, or any combination) and in communication with the server. Example client devices 904 include, but are not limited to, autonomous computers 904a, mobile communication devices (e.g., smartphones, tablet computers, etc.) 904b, robots 904c, autonomous vehicles 904d, wearable devices, virtual reality, augmented reality, or mixed reality devices (not shown), or others. Client devices 904 may be configured for sending data to and/or receiving data from the server 902, and may include, but need not include, one or more output devices, such as but not limited to displays, printers, etc. for displaying or printing results of certain methods that are provided for display by the server. Client devices may include combinations of client devices.
In an example training method the server 902 or client devices 904 may receive a dataset from any suitable source, e.g., from memory 910 (as nonlimiting examples, internal storage, an internal database, etc.), from external (e.g., remote) storage 912 connected locally or over the network 906. The example training method can generate a trained model that can be likewise stored in the server (e.g., memory 910), client devices 904, external storage 912, or combination. In some example embodiments provided herein, training and/or inference may be performed offline or online (e.g., at run time), in any combination. Results can be output (e.g., displayed, transmitted, provided for display, printed, etc.) and/or stored for retrieving and providing on request.
In an example document processing method the server 902 or client devices 904 may receive one or more documents from any suitable source, e.g., by local or remote input from a suitable interface, or from another of the server or client devices connected locally or over the network 906. Trained models such as the example neural ranking model can be likewise stored in the server (e.g., memory 910), client devices 904, external storage 912, or combination. In some example embodiments provided herein, training and/or inference may be performed offline or online (e.g., at run time), in any combination. Results can be output (e.g., displayed, transmitted, provided for display, printed, etc.) and/or stored for retrieving and providing on request.
In an example retrieval method the server 902 or client devices 904 may receive a query from any suitable source, e.g., by local or remote input from a suitable interface, or from another of the server or client devices connected locally or over the network 906 and process the query using example neural models (or by a more straightforward tokenization, in some example methods). Trained models such as the example neural can be likewise stored in the server (e.g., memory 910), client devices 904, external storage 912, or combination. Results can be output (e.g., displayed, transmitted, provided for display, printed, etc.) and/or stored for retrieving and providing on request.
Generally, embodiments can be implemented as computer program products with a program code or computer-executable instructions, the program code or computer-executable instructions being operative for performing one of the methods when the computer program product runs on a computer. The program code or the computer-executable instructions may, for example, be stored on a computer-readable storage medium.
In an embodiment, a storage medium (or a data carrier, or a computer-readable medium) comprises, stored thereon, the computer program or the computer-executable instructions for performing one of the methods described herein when it is performed by a processor.
Embodiments described herein may be implemented in hardware or in software. The implementation can be performed using a non-transitory storage medium such as a computer-readable storage medium, for example a floppy disc, a DVD, a Blu-Ray, a CD, a ROM, a PROM, and EPROM, an EEPROM or a FLASH memory. Such computer-readable media can be any available media that can be accessed by a general-purpose or special-purpose computer system.
General
The foregoing description is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. The broad teachings of the disclosure may be implemented in a variety of forms. Therefore, while this disclosure includes particular examples, the true scope of the disclosure should not be so limited since other modifications will become apparent upon a study of the drawings, the specification, and the following claims. It should be understood that one or more steps within a method may be executed in different order (or concurrently) without altering the principles of the present disclosure. Further, although each of the embodiments is described above as having certain features, any one or more of those features described with respect to any embodiment of the disclosure may be implemented in and/or combined with features of any of the other embodiments, even if that combination is not explicitly described. In other words, the described embodiments are not mutually exclusive, and permutations of one or more embodiments with one another remain within the scope of this disclosure. All documents cited herein are hereby incorporated by reference in their entirety, without an admission that any of these documents constitute prior art.
Each module may include one or more interface circuits. In some examples, the interface circuits may include wired or wireless interfaces that are connected to a local area network (LAN), the Internet, a wide area network (WAN), or combinations thereof. The functionality of any given module of the present disclosure may be distributed among multiple modules that are connected via interface circuits. For example, multiple modules may allow load balancing. In a further example, a server (also known as remote, or cloud) module may accomplish some functionality on behalf of a client module. Each module may be implemented using code. The term code, as used above, may include software, firmware, and/or microcode, and may refer to programs, routines, functions, classes, data structures, and/or objects.
The term memory circuit is a subset of the term computer-readable medium. The term computer-readable medium, as used herein, does not encompass transitory electrical or electromagnetic signals propagating through a medium (such as on a carrier wave); the term computer-readable medium may therefore be considered tangible and non-transitory. Non-limiting examples of a non-transitory, tangible computer-readable medium are nonvolatile memory circuits (such as a flash memory circuit, an erasable programmable read-only memory circuit, or a mask read-only memory circuit), volatile memory circuits (such as a static random access memory circuit or a dynamic random access memory circuit), magnetic storage media (such as an analog or digital magnetic tape or a hard disk drive), and optical storage media (such as a CD, a DVD, or a Blu-ray Disc).
The systems and methods described in this application may be partially or fully implemented by a special purpose computer created by configuring a general purpose computer to execute one or more particular functions embodied in computer programs. The functional blocks, flowchart components, and other elements described above serve as software specifications, which may be translated into the computer programs by the routine work of a skilled technician or programmer.
The computer programs include processor-executable instructions that are stored on at least one non-transitory, tangible computer-readable medium. The computer programs may also include or rely on stored data. The computer programs may encompass a basic input/output system (BIOS) that interacts with hardware of the special purpose computer, device drivers that interact with particular devices of the special purpose computer, one or more operating systems, user applications, background services, background applications, etc.
It will be appreciated that variations of the above-disclosed embodiments and other features and functions, or alternatives thereof, may be desirably combined into many other different systems or applications. Also, various presently unforeseen or unanticipated alternatives, modifications, variations, or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the description above and the following claims.
This application claims priority to and benefit from U.S. Provisional Patent Application Ser. No. 63/266,194, filed Dec. 30, 2021, which application is incorporated in its entirety by reference herein.
Number | Date | Country | |
---|---|---|---|
63266194 | Dec 2021 | US |