MASK PATTERN FOR PROTEIN LANGUAGE MODELS

FIELD OF THE TECHNOLOGY DISCLOSED

The technology disclosed relates to artificial intelligence type computers and digital data processing systems and corresponding data processing methods and products for emulation of intelligence (i.e., knowledge based systems, reasoning systems, and knowledge acquisition systems); and including systems for reasoning with uncertainty (e.g., fuzzy logic systems), adaptive systems, machine learning systems, and artificial neural networks. In particular, the technology disclosed relates to using neural networks to analyze ordered data.

INCORPORATIONS

The following are incorporated by reference for all purposes as if fully set forth herein:

Sundaram, L. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 50, 1161-1170 (2018);

Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535-548 (2019);

US patent application titled, “PATHOGENICITY LANGUAGE MODEL,” filed contemporaneously (Attorney Docket No. ILLM 1063-3/IP-2296-US2);

U.S. Patent Application No. 62/573,144, titled “TRAINING A DEEP PATHOGENICITY CLASSIFIER USING LARGE-SCALE BENIGN TRAINING DATA,” filed Oct. 16, 2017 (Attorney Docket No. ILLM 1000-1/IP-1611-PRV);

U.S. Patent Application No. 62/573,149, titled “PATHOGENICITY CLASSIFIER BASED ON DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs),” filed Oct. 16, 2017 (Attorney Docket No. ILLM 1000-2/IP-1612-PRV);

U.S. Patent Application No. 62/573,153, titled “DEEP SEMI-SUPERVISED LEARNING THAT GENERATES LARGE-SCALE PATHOGENIC TRAINING DATA,” filed Oct. 16, 2017 (Attorney Docket No. ILLM 1000-3/IP-1613-PRV);

U.S. Patent Application No. 62/582,898, titled “PATHOGENICITY CLASSIFICATION OF GENOMIC DATA USING DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs),” filed Nov. 7, 2017 (Attorney Docket No. ILLM 1000-4/IP-1618-PRV);

U.S. patent application Ser. No. 16/160,903, titled “DEEP LEARNING-BASED TECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS,” filed on Oct. 15, 2018 (Attorney Docket No. ILLM 1000-5/IP-1611-US);

U.S. patent application Ser. No. 16/160,986, titled “DEEP CONVOLUTIONAL NEURAL NETWORKS FOR VARIANT CLASSIFICATION,” filed on Oct. 15, 2018 (Attorney Docket No. ILLM 1000-6/IP-1612-US);

U.S. patent application Ser. No. 16/160,968, titled “SEMI-SUPERVISED LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL NETWORKS,” filed on Oct. 15, 2018 (Attorney Docket No. ILLM 1000-7/IP-1613-US);

U.S. patent application Ser. No. 16/160,978, titled “DEEP LEARNING-BASED SPLICE SITE CLASSIFICATION,” filed on Oct. 15, 2018 (Attorney Docket No. ILLM 1001-4/IP-1680-US);

U.S. patent application Ser. No. 16/407,149, titled “DEEP LEARNING-BASED TECHNIQUES FOR PRE-TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS,” filed May 8, 2019 (Attorney Docket No. ILLM 1010-1/IP-1734-US);

U.S. patent application Ser. No. 17/232,056, titled “DEEP CONVOLUTIONAL NEURAL NETWORKS TO PREDICT VARIANT PATHOGENICITY USING THREE-DIMENSIONAL (3D) PROTEIN STRUCTURES,” filed on Apr. 15, 2021, (Atty. Docket No. ILLM 1037-2/IP-2051-US);

U.S. Patent Application No. 63/175,495, titled “MULTI-CHANNEL PROTEIN VOXELIZATION TO PREDICT VARIANT PATHOGENICITY USING DEEP CONVOLUTIONAL NEURAL NETWORKS,” filed on Apr. 15, 2021, (Atty. Docket No. ILLM 1047-1/IP-2142-PRV);

U.S. Patent Application No. 63/175,767, titled “EFFICIENT VOXELIZATION FOR DEEP LEARNING,” filed on Apr. 16, 2021, (Atty. Docket No. ILLM 1048-1/IP-2143-PRV);

U.S. patent application Ser. No. 17/468,411, titled “ARTIFICIAL INTELLIGENCE-BASED ANALYSIS OF PROTEIN THREE-DIMENSIONAL (3D) STRUCTURES,” filed on Sep. 7, 2021, (Atty. Docket No. ILLM 1037-3/IP-2051A-US);

U.S. Provisional Patent Application No. 63/253,122, titled “PROTEIN STRUCTURE-BASED PROTEIN LANGUAGE MODELS,” filed Oct. 6, 2021 (Attorney Docket No. ILLM 1050-1/IP-2164-PRV);

U.S. Provisional Patent Application No. 63/281,579, titled “PREDICTING VARIANT PATHOGENICITY FROM EVOLUTIONARY CONSERVATION USING THREE-DIMENSIONAL (3D) PROTEIN STRUCTURE VOXELS,” filed Nov. 19, 2021 (Attorney Docket No. ILLM 1060-1/IP-2270-PRV); and

U.S. Provisional Patent Application No. 63/281,592, titled “COMBINED AND TRANSFER LEARNING OF A VARIANT PATHOGENICITY PREDICTOR USING GAPED AND NON-GAPED PROTEIN SAMPLES,” filed Nov. 19, 2021 (Attorney Docket No. ILLM 1061-1/IP-2271-PRV).

BACKGROUND

The subject matter discussed in this section should not be assumed to be prior art merely as a result of its mention in this section. Similarly, a problem mentioned in this section or associated with the subject matter provided as background should not be assumed to have been previously recognized in the prior art. The subject matter in this section merely represents different approaches, which in and of themselves can also correspond to implementations of the claimed technology.

The explosion of available biological sequence data has led to multiple computational approaches that infer the proteins' three-dimensional structure, biological function, fitness, and evolutionary history from sequence data. So-called protein language models, like the ones based on the Transformer architecture, have been trained on large ensembles of protein sequences by using the masked language modeling objective of filling in masked amino acids in a sequence, given the surrounding ones.

Protein language models capture long-range dependencies, learn rich representations of protein sequences, and can be employed for multiple tasks. For example, protein language models can predict structural contacts from single sequences in an unsupervised way.

Protein sequences can be classified into families of homologous proteins that descend from an ancestral protein and share a similar structure and function. Analyzing multiple sequence alignments (MSAs) of homologous proteins provides important information about functional and structural constraints. The statistics of MSA columns, representing amino-acid sites, identify functional residues that are conserved during evolution. Correlations of amino acid usage between the MSA columns contain important information about functional sectors and structural contacts.

Language models were initially developed for natural language processing and operate on a simple but powerful principle: they acquire linguistic understanding by learning to fill in missing words in a sentence, akin to a sentence completion task in standardized tests. Language models develop powerful reasoning capabilities by applying this principle across large text corpora. The Bidirectional Encoder Representations from Transformers (BERT) model instantiated this principle using Transformers, a class of neural networks in which attention is the primary component of the learning system. In a Transformer, each token in the input sentence can “attend” to all other tokens by exchanging activation patterns corresponding to the intermediate outputs of neurons in a neural network.

Protein language models like the MSA Transformer have been trained to perform inference from MSAs of evolutionarily related sequences. The MSA Transformer interleaves per-sequence (“row”) attention with per-site (“column”) attention to incorporate epistasis. Epistasis leads to a co-evolution of certain protein positions. The effect of mutation at one site depends on presence or absence of mutations at other sites, which influences mutation. Combinations of row attention heads in the MSA Transformer have led to state-of-the-art unsupervised structural contact predictions.

End-to-end deep learning approaches for variant effect predictions are applied to predict the pathogenicity of missense variants from protein sequence and sequence conservation data (See Sundaram, L. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 50, 1161-1170 (2018), referred to herein as “PrimateAI”). PrimateAI uses deep neural networks trained on variants of known pathogenicity with data augmentation using cross-species information. In particular, PrimateAI uses sequences of wild-type and mutant proteins to compare the difference and decide the pathogenicity of mutations using the trained deep neural networks. Such an approach that utilizes the protein sequences for pathogenicity prediction is promising because it can avoid the circularity problem and overfitting to previous knowledge. Compared to the adequate number of data to train the deep neural networks effectively, the number of clinical data available in ClinVar is relatively small. To overcome this data scarcity, PrimateAI uses common human variants and variants from primates as benign data while mutation rate-matched samples of unlabelled data, based on trinucleotide context, were used as unknown data.

An opportunity arises to use protein language models and MSAs for variant pathogenicity prediction. More accurate variant pathogenicity prediction may result.

BRIEF DESCRIPTION OF THE DRAWINGS

In the drawings, like reference characters generally refer to like parts throughout the different views. Also, the drawings are not necessarily to scale, with an emphasis instead generally being placed upon illustrating the principles of the technology disclosed. In the following description, various implementations of the technology disclosed are described with reference to the following drawings, in which.

FIG. 1 is a high-level diagram that shows various aspects of the technology disclosed, and, in particular, illustrates generating a masked MSA and processing the masked MSA through the disclosed PrimateAI language model to produce a phenotype prediction.

FIG. 2 shows one implementation of applying the disclosed periodically-spaced mask grid to an MSA and generating the disclosed partially-masked MSA.

FIG. 3 shows one implementation of one-hot tokens that are defined for the twenty residue one-hot vectors, the gap residue one-hot vector, and the mask one-hot vector.

FIG. 4 illustrates one implementation of channel embeddings that are defined for the twenty residue channel embedding sets, the gap channel embedding set, and the mask channel embedding set.

FIG. 5 shows cropping, padding, and masking of MSAs in accordance with various implementations of the technology disclosed.

FIG. 6 depicts one implementation of generating the disclosed MSA representation.

FIG. 7 illustrates an example architecture of the disclosed PrimateAI language model.

FIG. 8 shows details of the disclosed mask revelation.

FIG. 9 shows various components of the PrimateAI language model.

FIG. 10 shows one implementation of the disclosed revelation output head used by the disclosed PrimateAI language model.

FIG. 11 is a computer-implemented method of the logic flow of the PrimateAI language model, in accordance with one implementation of the technology disclosed.

FIG. 12 is a system that is configured to implement the PrimateAI language model, in accordance with one implementation of the technology disclosed.

FIG. 13 shows the performance evaluation of the language modelling part of the disclosed PrimateAI language model with other language models.

FIG. 14 depicts the Top-1 training accuracy of the disclosed PrimateAI language model.

FIG. 15 is a computer system that can be used for compilation and runtime execution of the disclosed PrimateAI language model.

FIG. 16 illustrates a comparison between UniRef50 HHblits MSAs and human HHblits MSAs.

FIG. 17 illustrates the training of the PrimateAI language model using LAMB optimizer with gradient pre-normalization

DETAILED DESCRIPTION

The following discussion is presented to enable any person skilled in the art to make and use the technology disclosed and is provided in the context of a particular application and its requirements. Various modifications to the disclosed implementations will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other implementations and applications without departing from the spirit and scope of the technology disclosed. Thus, the technology disclosed is not intended to be limited to the implementations shown but is to be accorded the widest scope consistent with the principles and features disclosed herein.

The detailed description of various implementations will be better understood when read in conjunction with the appended drawings. To the extent that the figures illustrate diagrams of the functional blocks of the various implementations, the functional blocks are not necessarily indicative of the division between hardware circuitry. Thus, for example, one or more of the functional blocks (e.g., modules, processors, or memories) may be implemented in a single piece of hardware (e.g., a general purpose signal processor or a block of random access memory, hard disk, or the like) or multiple pieces of hardware. Similarly, the programs may be stand-alone programs, may be incorporated as subroutines in an operating system, may be functions in an installed software package, and the like. It should be understood that the various implementations are not limited to the arrangements and instrumentality shown in the drawings.

The processing engines and databases of the figures, designated as modules, can be implemented in hardware or software, and need not be divided up in precisely the same blocks as shown in the figures. Some of the modules can also be implemented on different processors, computers, or servers, or spread among a number of different processors, computers, or servers. In addition, it will be appreciated that some of the modules can be combined, operated in parallel or in a different sequence than that shown in the figures without affecting the functions achieved. The modules in the figures can also be thought of as flowchart steps in a method. A module also need not necessarily have all its code disposed contiguously in memory; some parts of the code can be separated from other parts of the code with code from other modules or other functions disposed in between.

INTRODUCTION

The disclosed PrimateAI language model uses a masked language modeling objective for training on sequences. During training, residues at different positions in a sequence are replaced with a mask token and the PrimateAI language model is trained to predict the original residues at those positions.

Masked language modeling allows training on a large amount of unlabelled data. Fill-in-the-blank multiple sequence alignment (MSA) Transformers simultaneously classify multiple masked locations in MSAs during training. Higher numbers of mask locations can add more masked language modelling (MLM) gradients that inform optimization, thereby enabling a higher learning rate and faster training.

However, fill-in-the-blank pathogenicity prediction is fundamentally different from traditional MLM as classification at a mask location depends on predicted values of residues at other mask locations. The classification scores may often be the averages of conditional predictions over all possible combinations of residues at other mask locations.

The PrimateAI language model avoids this averaging by revealing masked tokens at other mask locations before making predictions. The PrimateAI language model achieves state-of-the-art clinical performance and denoising accuracy whilst requiring 50× less computation for training than previous MSA Transformers. Various aspects of the technology disclosed, discussed later, contribute to the 50× reduction in training compute. Examples of such aspects include periodically-spaced mask grid, mask revelation, and the architecture of PrimateAI language model.

The PrimateAI language model can be considered an MSA Transformer for fill-in-the-blank residue classification. In one implementation, the PrimateAI language model is trained end-to-end on MSAs of UniRef50 proteins to minimize an unsupervised MLM objective. The PrimateAI language model outputs classification scores for alternative and reference residues, which serve as inputs to the PrimateAI three-dimensional (3D) rank loss.

Phenotype Prediction

FIG. 1 is a high-level diagram 100 that shows various aspects of the technology disclosed, and, in particular, illustrates generating a masked MSA 140 and processing the masked MSA 140 through the disclosed PrimateAI language model (i.e., a phenotype predictor 150 or pathogenicity language model) to produce a phenotype prediction 160.

In one implementation, an MSA dataset 110 includes a multiple sequence alignment (MSA) 120 for each sequence in a UniRef50 database that is retrieved by searching a UniClust30 database. The MSA 120 is an alignment of multiple homologous protein sequences to a target protein. From the MSA 120, the degree of homology can be inferred and the evolutionary relationships among the sequences studied. Since real protein sequences are likely to have insertions, deletions, and substitutions, the sequences are aligned by minimizing a Levenshtein distance-like metric over all the sequences. In some implementations, heuristic alignment schemes are used. For example, tools like JackHMMER and HHblits can increase the number and diversity of sequences returned by iteratively performing the search and alignment steps.

It is difficult to incorporate nearby evolution due to mutational differences in creatures with a recent ancestor being significantly influenced by electromechanical susceptibilities of proteins to mutations. To avoid this, the MSAs used by the technology disclosed contain diverse proteins that align with the query sequence. Using diverse sequences from many species reduces the influence of electromechanical susceptibility on predictions as the differences are more highly determined by natural selection.

In some implementations, the MSA dataset 110 can contain twenty-six million MSAs that are created by using the protein homology detection software HHblits. In other implementations, an additional set of MSAs can be generated for 19,071 human proteins using HHblits. A person skilled in the art will appreciate that the technology disclosed can search, generate, and otherwise leverage (or use) any number of MSAs.

In some implementations, those UniRef50 MSAs can be excluded from the MSA dataset 110 whose query sequences carry rare amino acids, thereby retaining only those MSAs in MSA dataset 110 that contain the twenty most abundant residues. In other implementations, only those non-query sequences can be included in the MSAs that contain the twenty most common residues and gaps, which in turn represent deletions relative to the query sequence.

In some implementations, the MSAs that are provided as inputs to the PrimateAI language model can have a fixed size of 1024 sequences. Of the 1024 sequences, up to 1023 non-query sequences can be randomly sampled from the filtered sequences if the MSA depth is larger than 1024. If the MSA depth is less than 1024, the MSA can be padded with zeros to fill the input. The MSA depth refers to the number of protein sequences in the MSA. For example, the MSA transformer with a fixed input MSA depth of 1024 sequences can be trained. This eases the process of the model because the tensors input to the model have a fixed shape. If the full MSA depth is less than 1024, padding can be added to increase its size to 1024. If the full MSA depth is more than 1024, 1023 sequences can be randomly sampled from the full MSA depth. The one query sequence can be kept such that the remaining MSA has a depth of 1024 (1023 randomly sampled sequences and 1 query sequence).

A masking logic 130 can apply one or more masks to the MSA 120 and generate a masked MSA 140. The masks can be arranged in a periodic manner, non-periodic manner, regular manner, or irregular manner. The masks are not limited to periodically-spaced masks or a regular grid or array of masks. The masks can be irregular in shape, can be straight or curved, and can be arranged in irregular, non-evenly spaced patterns. The masks are regular in shape when the distance between adjacent masks is fixed or same. The masks are irregular in shape when the distance between adjacent masks varies.

The phenotype predictor 150 (e.g., the PrimateAI language model) can process the masked MSA 140 and generate the phenotype prediction 160. In one implementation, the phenotype prediction 160 outputs the identity of the masked residues in the masked MSA 140. In other implementations, the phenotype prediction 160 can be used for variant pathogenicity prediction, protein contact map generation, protein functionality prediction, and so on.

Note that portions of this Application refer to a protein as a “sequence,” “residue sequence,” “amino acid sequence,” and “chain of amino acids” interchangeably. Also, note that portions of this Application use “amino acids” and “residues” interchangeably. Further note that portions of this Application use “a set of periodically-spaced masks,” “periodically-spaced masks,” “mask grid,” “periodically-spaced mask gird,” “periodic mask pattern,” and “fixed mask pattern” interchangeably.

The sequences shown in the figures are protein sequences comprising amino acid residues. In other implementations, the sequences can instead comprise DNA, RNA, carbohydrates, lipids or any other straight or branched biopolymer.

Having described the technology disclosed at a high level using FIG. 1, the discussion now turns to the disclosed periodically-spaced mask grid—a particular implementation of the masking logic 130.

Periodically-Spaced Mask Grid

FIG. 2 shows one implementation of applying the disclosed periodically-spaced mask grid 210 to an MSA 220 and generating the disclosed partially-masked MSA 230.

The columns of the periodically-spaced mask grid 210 correspond to residue positions. The residue positions are also referred to herein as ordinal positions. For example, in FIG. 2, the periodically-spaced mask grid 210 has nine columns corresponding to nine residue positions (i.e., r=9).

The periodically-spaced mask grid 210 has elements (or units or tokens) that are masks. In FIG. 2, such mask elements are depicted by boxes with black fill and a “?” symbol). The periodically-spaced mask grid 210 also has elements (or units or tokens) that are not masks. In FIG. 2, such non-mask elements are depicted by boxes with yellow fill.

The rows of the periodically-spaced mask grid 210 include elements that are masks and elements that are not masks. The rows of the periodically-spaced mask grid 210 are referred to herein as mask distributions. For example, in FIG. 2, there are five mask distributions 1-5 (i.e., m mask distributions, where m=5).

Each mask distribution has k periodically-spaced masks. For example, in FIG. 2, mask distributions 1-4 each have three masks (i.e., k=3), and mask distribution 5 has two masks (i.e., k=2).

The k periodically-spaced masks in a mask distribution are at k ordinal positions that begin at varying offsets from a first residue position in the periodically-spaced mask grid 210. For example, in FIG. 2, the k periodically-spaced masks of the first mask distribution are located at the third, the sixth, and the ninth ordinal positions, and begin at an offset of two from the first residue position in the periodically-spaced mask grid 210. The k periodically-spaced masks of the second mask distribution are located at the first, the fourth, and the seventh ordinal positions, and begin at an offset of zero from the first residue position in the periodically-spaced mask grid 210. The k periodically-spaced masks of the third mask distribution are located at the second, the fifth, and the eighth ordinal positions, and begin at an offset of one from the first residue position in the periodically-spaced mask grid 210. The k periodically-spaced masks of the fourth mask distribution are located at the third, the sixth, and the ninth ordinal positions, and begin at an offset of two from the first residue position in the periodically-spaced mask grid 210. The k periodically-spaced masks of the fifth mask distribution are located at the fourth and the seventh ordinal positions, and begin at an offset of three from the first residue position in the periodically-spaced mask grid 210.

Masks in the periodically-spaced mask grid 210 are periodic because the masks have regular spacing between them and repeat at regular intervals, i.e., the masks are regularly-spaced repeats. The masks in the periodically-spaced mask grid 210 are also periodic because the masks have an ordered pattern.

The masks in the periodically-spaced mask grid 210 can have a lattice pattern, a diagonal pattern, a hexagonal pattern, a diamond pattern, a rectangle pattern, a square pattern, a triangle pattern, a convex pattern, a concave pattern, and/or a polygonal pattern.

In one implementation, the k periodically-spaced masks of each of the mask distributions in the periodically-spaced mask grid 210 have a same stride (e.g., stride=3 in FIG. 2). In another implementation, the k periodically-spaced masks across the mask distributions in the periodically-spaced mask grid 210 have a diagonal pattern. In other implementations, the stride can be any number, such as 16 or in a range of 8 to 64 or any number in or subrange of that range. As used herein, the term “stride” refers to the distance between adjacent masks.

In other implementations, the masks in the periodically-spaced mask grid 210 are quasi-periodic, such that the masks have an ordered pattern, but the masks do not recur at precisely regular intervals.

The discussion now turns to FIGS. 3 and 4 to discuss the details of how the masks are encoded for processing by the PrimateAI language model. After having described FIGS. 3 and 4, the discussion will return to FIG. 2 to discuss how the disclosed partially-masked MSA is generated.

Masks

A mask token defines the masks. The mask token is configured to conceal or replace the original residue in an MSA onto which the mask token is applied. The mask token is a special or auxiliary token in the sense that the mask token is different from the twenty residue tokens that are used to define the twenty naturally-occurring residues. The mask token is also different from the gap residue token that is used to define the gap residue. The gap residues are those residues whose identities are unresolved (or unknown), and therefore the gap residues are not reliably classified to any of the twenty-one known residues. The gap residues are encoded by the gap residue token.

The mask token can be defined by the same encoding logic that defines the twenty residue tokens and the gap residue token in a way that encodes the mask token as the twenty-second residue.

FIG. 3 shows one implementation of one-hot tokens 300 that are defined for-the twenty residue one-hot vectors 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, and 320; the gap residue one-hot vector 321; and the mask one-hot vector 322. The one-hot tokens 300 are encoded with a binary vector of twenty-two bits, with one of the bits being hot (i.e., 1) while other being 0. In some implementations, a one-hot encoder (not depicted) generates the one-hot tokens 300.

FIG. 4 illustrates one implementation of channel embeddings 400 (or learned embeddings) that are defined for-the twenty residue channel embedding sets 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, and 420; the gap channel embedding set 421; and the mask channel embedding set 422. The channel embeddings 400 span the twenty-one known residues. The channel embedding set 421 spans the gap residues. The mask channel embedding set 422 spans the mask residues. The channel embeddings 400 are tensors that have a height dimension, a width dimension, and a depth dimension, and each set of channel embeddings can include N channel embeddings, where N is an integer like ninety-four. In some implementations, an embeddings generator (not depicted (e.g., a multi-layer perceptron)) generates the channel embeddings 400.

In some implementations, the embeddings generator can be trained in conjunction with the PrimateAI language model to learn and generate the channel embeddings 400. During inference, a lookup table can store a mapping between the one-hot tokens 300 and the channel embeddings 400. The lookup table can be accessed during the inference to replace the residue tokens, the gap token, and the mask token with the corresponding channel embeddings.

In other implementations, the encoding of the mask token (e.g., one-hot or channel embeddings) can vary depending on a variety of factors. Examples include the location (i.e., residue position) of the mask, the residue-type on which the mask is applied, the sequence-type on which the mask is applied, the sequence number on which the mask is applied, and the species-type of the sequence on which the mask is applied.

In other implementations, the mask token can be encoded using other schemes. Examples include quantitative or numerical data type, qualitative data type, discreet data type, continuous data type (with lower and upper bounds), integer data type (with lower and upper bounds), nominal data type, ordinal or ranked data type, categorical data type, interval data type, and ratio data type. For example, the encoding can be based on, or any combination thereof, multiple bits, real values between 0 and 1, continuous values such as floating point numbers, Red, Green, Blue (RGB) values between 0 and 256, hexadecimal values of CSS colors (e.g., #F0F8FF), categorical color values of CSS colors, respective values of other CSS property groups and properties, size of a particular dimension (e.g., height and width), a set of different values and data types, and others.

The discussion now returns to FIG. 2 to discuss how the disclosed partially-masked MSA is generated.

Partially-Masked MSA

The MSA 220 hasp rows and r columns. The p rows correspond top protein sequences. The r columns correspond to r residue positions (e.g., r=16 in FIG. 2). The periodically-spaced mask grid 210 can have different number of rows and columns (i.e., a different shape) than the MSA 220. In some implementations, the periodically-spaced mask grid 210 can have a same number of rows and columns (i.e., a same shape) as the MSA 220.

The periodically-spaced mask grid 210 can be applied 212 (or overlaid) anywhere on the MSA 220. For example, the periodically-spaced mask grid 210 can be applied such that the periodically-spaced mask grid 210 is centered at a particular column of the MSA 220 that contains a residue-of-interest 214 (in red) at a position-of-interest 216 (in red). In another example, the periodically-spaced mask grid 210 can be applied such that the periodically-spaced mask grid 210 is placed at a particular row (e.g., the query sequence like sequence one in FIG. 2) of the MSA 220 that contains the residue-of-interest 214 at the position-of-interest 216.

In one implementation, the periodically-spaced mask grid 210 is applied to a subset of sequences in the MSA 220, spanning a window of sequences 222 (e.g., five sequences in FIG. 2). In some implementations, the periodically-spaced mask grid 210 can be applied on the MSA 220 in a left-flanking manner or a right-flanking manner. In other implementations, the periodically-spaced mask grid 210 can be applied on the MSA 220 on a portion-by-portion basis, traversing portions (e.g., quadrants) of the MSA 220 simultaneously or sequentially.

Those residues of the MSA 220 onto which the non-mask elements of the periodically-spaced mask grid 210 are overlaid remain unchanged and are referred to herein as the unmasked residues. Conversely, those residues of the MSA 220 onto which the mask elements of the periodically-spaced mask grid 210 are overlaid change to the mask token and are referred to herein as the masked residues.

A combination or aggregation of the unmasked residues and the masked residues forms the partially-masked MSA 230. The partially-masked MSA 230 can be defined as an MSA that includes some residues that are not masked (unmasked) and some residues that are masked. The partially-masked MSA 230 can also be defined as an MSA that includes some sequences that contain masked residues and some sequences that do not contain any masked residues.

A portion (or patch) of the partially-masked MSA 230 can be cropped (or selected or extracted) to generate a cropped portion 232 (in blue, dashed outline in FIG. 2). In some implementations, the cropped portion 232 can include: (i) the masked residues in the window of sequences 222, (ii) some unmasked residues that are contiguously adjacent to the masked residues within a neighborhood that coincides with (or defines) a boundary of the cropped portion 232, and (iii) portions of some additional sequences that extend beyond the window of sequences 222 and do not contain any masked residues.

MSA Cropping, Padding, and Masking

FIG. 5 shows cropping, padding, and masking of MSAs 500 in accordance with various implementations of the technology disclosed. In FIG. 5, a residue-of-interest at a position-of-interest in the query sequence is indicated by an X, mask locations are indicated by black fill, padding is indicated by grey fill, and crop regions are indicated by red, dashed lines. In these examples, mask stride is three and cropping window width is six residues.

In panel A, away from the MSA edges, the position-of-interest is at the right side of the center of a crop region. In panel B, a crop region is shifted to the right of the position-of-interest to avoid going over an MSA edge. In panel C, an MSA for a short protein is padded to fill a crop region. In panel D, a crop region is shifted to the right of the position-of-interest to minimize padding and the MSA is padded to fill the crop region.

In some implementations, the position-of-interest is randomly sampled from positions in the query sequence during training or chosen by a user during inference. To maximize information about the position-of-interest, in some implementations, a cropping window is selected with a size of 256 residues such that the position-of-interest is at the center. However, the cropping window can be shifted if the position-of-interest is near the edge of an MSA to avoid padding zeros and to increase information about the position-of-interest. If the query sequence is shorter than the cropping window, zeros can be padded to fill the window size.

In some implementations, a smaller probability, p_sample, is assigned to an MSA being sampled during training if the protein length, L, is shorter than the query sequence, for example,

$p_{sample} \propto \frac{\max (\min (L, 512), 6 4)}{5 1 2} .$

This assignment rebalances the distribution of lengths for UniRef50 proteins used for training and for human proteins, and also prevents wastage of computation on padding.

The UniRef50 proteins used for training often have short sequences, whereas a majority of human proteins has long sequences. FIG. 16 illustrates a comparison between UniRef50 HHblits MSAs and human HHblits MSAs. Many of the proteins in the UniRef50 HHblits MSAs have a short sequence, while only a few human proteins among MSAs are short. Accordingly, the sampling of longer UniRef50 proteins during training can be increased, such that the sampled distribution of short and long proteins is closer to the distribution of human proteins. Increasing the sampling of long-sequence UniRef50 proteins also increases computation efficiency. When only using short-sequence UniRef50 proteins as input, the input will be padded up to a fixed input shape, which means that the computation during the training process would be wasted on padding rather than adding gradients to the model optimization.

The probability of sampling non-query sequences to be included in the first f sequences of an MSA can also be adjusted (e.g., f=32). In one implementation, the periodically-spaced mask grid 210 is applied in a way that penalizes the occurrences of gaps in the first f sequences. The probability, p_mask, of a non-query sequence being masked decreases with increasing number of gap tokens, for example,

$N_{gap}, p_{mask} \propto \frac{{(L - N_{gap})}^{2}}{L^{2}} .$

Downsampling of sequences with a considerable number of gaps reduces the fraction of missing data in the MSAs.

MSA Representation

FIG. 6 depicts one implementation of generating 600 the disclosed MSA representation. Panel A shows the MSA 220. Panel B shows the partially-masked MSA 230. In this example, the periodically-spaced mask grid 210 is applied to the first four sequences of the MSA 220 and has a stride of three. The partially-masked MSA 230 is generated as a result of applying the periodically-spaced mask grid 210 to the MSA 220. In panel C, the unmasked residues and the masked residues in the partially-masked MSA 230 are replaced with corresponding ones of the channel embeddings 400. In one implementation, the corresponding ones of the channel embeddings 400 are summed with position embeddings for residue columns. The position embeddings can be learned and generated during the training of the PrimateAI language model. The sum of the corresponding ones of the channel embeddings 400 and the position embeddings are divided into chunks 640. In panel D, the chunks 640 are concatenated in the channel dimension into a stack 660 and then linearly projected 670 to form an MSA representation 680. In some implementations, the linear projection 670 uses a plurality of one-dimensional (1D) convolution filters.

The channel embeddings 400 are also referred to herein as learned embeddings. In one implementation, the masked residues and the unmasked residues in the partially-masked MSA 230 are translated into the learned embeddings by using a look-up table that stores learned embeddings corresponding to the masked residues and the unmasked residues.

The position embeddings are also referred to herein as residue position embeddings. The sum of the corresponding ones of the channel embeddings 400 and the position embeddings is also referred to herein as an embedded representation of the partially-masked MSA 230. The learned embeddings are concatenated with the residue position embeddings to generate the embedded representation.

The embedded representation is chunked into the series of chunks 640. The chunks in the series of chunks are concatenated into the stack 660.

The MSA representation 680 is also referred to herein as a projected (or compressed) representation of the embedded representation. The projected representation has m rows and r columns. The stack 660 is translated into the projected representation by using convolution operations, in accordance with one implementation. Note that the projected representation is not compressed at this stage in the making-data-smaller sense. The projected representation is “compressed” or “smaller” in comparison to the embedded representation if we did not stack rows, which is why row stacking lowers computational requirements. However, the projected representation is not smaller than the model input in terms of feature dimensionality.

In one implementation, the fixed mask pattern is applied to the first thirty-two sequences of MSAs. The MSA tokens are encoded by learned 96-channel embeddings, which are summed with learned 96-channel position embeddings for residue columns before layer normalization. To reduce computational requirements, embeddings for the 1024 sequences in MSAs are split into thirty-two chunks, each containing thirty-two sequences, at periodic intervals along the sequence axis. These chunks are then concatenated in the channel dimension and mixed by linear projection. In the context of this application, chunks can be referred to as different non-overlapping rows of the MSA. In other implementations, the MSA can be “chunked” in other ways, such as column-wise, or some other irregular pattern.

PrimateAI Language Model

FIG. 7 illustrates an example architecture 700 of the PrimateAI language model. The PrimateAI language model comprises a cascade of axial-attention blocks 710 (e.g., twelve axial-attention blocks). The cascade of axial-attention blocks 710 takes the MSA representation 680 as input and generates an updated MSA representation 720 as output. Each axial-attention block comprises residuals that add a tied row-wise gated self-attention layer 712, a tied column-wise gated self-attention layer 714, and a transition layer 716.

In one implementation, there are twelve heads in the tied row-wise gated self-attention layer 712. In one implementation, there are twelve heads in the tied column-wise gated self-attention layer 714. Each head generates sixty-four channels, totaling 768 channels across twelve heads. In one implementation, the transition layer 716 projects up to 3072 channels for GELU activation.

The technology disclosed modified axial-gated self-attention to include tied attention, instead of triangle attention. Triangle attention has a high computation cost. Tied attention is the sum of dot-product affinities, between keys and values, across non-padding rows, followed by division by the square root of the number of non-padding rows, which reduces computational burden substantially.

The discussion now turns to the disclosed mask revelation.

Mask Revelation

The mask revelation reveals unknown values at other mask locations after the cascade of axial-attention blocks 710. The mask revelation gathers features aligned with mask sites. For each masked residue in a row, the mask revelation reveals embedded target tokens at other masked locations in that row.

The mask revelation combines the updated 768-channel MSA representation 720 with 96-channel target token embeddings 690 at locations indicated by a Boolean mask 770 which labels positions of mask tokens. The Boolean mask 770, which is a fixed mask pattern with stride 16, is applied row-wise to gather features from the MSA representation and target token embedding at mask token locations.

Feature gathering reduces row length from 256 to 16, which drastically decreases the computational cost of attention blocks that follow mask revelation. For each location in each row of the gathered MSA representation, the row is concatenated with a corresponding row from the gathered target token embedding where that location is also masked in the target token embedding. The MSA representation and partially revealed target embedding are concatenated in the channel dimension and mixed by a linear projection.

After mask revelation 730, the now-informed MSA representation 740 is propagated though residual row-wise gated self-attention layers 750, 756 and a transition layer 754. The attention is only applied to features at mask locations as residues are known for other positions from the MSA representation 680 provided as input to the PrimateAI language model. Thus, attention only needs to be applied at mask locations where there is new information from mask revelation.

After interpretation of the mask revelations by self-attention, a masked gather operation 760 collects features from the resulting MSA representation at positions where target token embeddings remained masked. The gathered MSA representation 772 is translated to predictions 790 for 21 candidates in the amino acid and gap token vocabulary by an output head 780. The output head 780 comprises a transition layer and a perceptron.

FIG. 8 shows details 800 of the disclosed mask revelation. Mask revelation allows more information during subsequent training improving the accuracy of predicting each residue of interest.

The first step is to gather 804, 830, 862 all the tokens at the mask locations 802, 860 marked by the dots. The term gather is used here interchangeably with the term aggregate. This is done for tokens in the updated MSA representation 720, the periodically-spaced mask grid 210, and the embedded representation (embedding tokens) 690.

In FIG. 8, the dashed lines and colors show how an MSA tile 806 and an embedding tile 844 are selected. Feature gathering reduces row length from 256 to 16 (6 to 2 in FIG. 8), which drastically decreases the computational cost of attention blocks that follow mask revelation. Each of the gathered representations is tiled or replicated/cloned 808, 830, 866 by the number of masks in the rows. In the example shown in FIG. 8, there are two masks per row. Therefore, there are two tiles that are concatenated as clones 810 and 870 as a result of cloning 808 and 866, respectively.

Mask revelation 830 is the removal of all the masks in a tile except for those at a single position. The top tile of the gathered masks is masked at the first position-of-interest 834 and unmasked at all the other positions-of-interest 836. The second tile is masked at the second position-of-interest 838 and unmasked at all the other positions-of-interest 832. Mask revelation reveals other tokens in a row for each masked position in the row. In some implementations, positions are masked in the same way in both training and inference. This results in higher performance than changing to only masking the position-of-interest during inference. The location of interest's position in input chosen to maximize input information because, for example, when the location of interest is centered at the mask, then more of the flanking columns of the MSA are included in the input that is processed by the PrimateAI language model.

Next, the remaining masks after mask revelation 830 are applied 868 to the embedding tile 844 to produce cloned and masked embedding tiles 870. The cloned and masked embedding tiles 870 are concatenated 872 with the cloned MSA tiles 810 to generate concatenated tiles 873. The concatenated tiles 873 are linearly projected 874 to produce the informed MSA representation 740.

PrimateAI Language Model Components & Training

FIG. 9 shows various components 900 of the PrimateAI language model, in accordance with one implementation. The components can include tied row-wise gated self-attention, row-wise gate self-attention, and column-wise gated self-attention. The PrimateAI language model can also use tied attention. Axial-attention creates independent attention maps for each row and column of the input. Sequences in an MSA usually have similar three-dimensional structures. Direct coupling analysis exploits this fact to learn structural contact information. To leverage this shared structure, it is beneficial to tie the row attention maps between the sequences in the MSA. As an additional benefit, tied attention reduces the memory footprint of the row attentions.

In implementations involving recomputation, tied attention reduces the memory footprint of the row attentions from O(ML²) to O(L²). Let M be the number of rows, d be the hidden dimension and Q_m, K_mbe the matrix of queries and keys for the m-th row of input. Tied row attention is defined, before softmax is applied, to be:

$\frac{\sum_{m = 1}^{M} Q_{m} K_{m}^{T}}{λ (M, d)}$

The final model uses square root normalization. In other implementations, the model can also use mean normalization. In such implementations, the denominator l(M, d) is the normalization constant Aid in standard scaled-dot product attention. In such implementations, for tied row attention, two normalization functions are used to prevent attention weights linearly scaling with the number of input sequences: l(M, d)=M√d (mean normalization) and l(M, d)=√Md (square root normalization).

In FIG. 9, dimensions are shown for sequences, s=32, residues, r=256, attention heads, h=12, and channels, c=64 and c_MSA=768.

In one implementation, the PrimateAI language model can be trained on four A100 graphical processing units (GPUs). Optimizer steps are for a batch size of 80 MSAs, which is split over four gradient aggregations to fit batches into 40 GB of A100 memory. The PrimateAI language model is trained with the LAMB optimizer using the following parameters: β_1=0.9, β_2=0.999, ε=10-6, and weight decay of 0.01. Gradients are pre-normalized by division by their global L2 norm before applying the LAMB optimizer. Training is regularized by dropout with probability 0.1, which is applied after activation and before residual connections.

FIG. 17 illustrates the training of the PrimateAI language model using LAMB optimizer with gradient pre-normalization. Residual blocks are started as identity operations, which speeds up convergence and enables the PrimateAI language model. “AdamW” refers to ADAM optimizer with weight decay, “ReZeRO” refers to Zero Redundancy Optimizer and “LR” refers to LAMB optimizer with gradient pre-normalization. See, Large Batch Optimization for Deep Learning Training BERT in 76 minutes, Yang You, Jing Li, Sashank Reddi, et al., International Conference on Learning Representations (ICLR) 2020. As illustrated, the LAMB optimizer with gradient pre-normalization shows better performance (e.g., higher accuracy rate over fewer training iterations) and is more effective for a range of learning rates compared to the use of ADAMW optimizer and Zero Redundancy Optimizer.

Axial dropout can be applied in self-attention blocks before residual connections. Post-softmax spatial gating in column-wise attention is followed by column-wise dropout while post-softmax spatial gating in row-wise attention is followed by row-wise dropout. The post-softmax spatial gating allows for modulation on exponentially normalized scores or probabilities produced by the softmax.

In one implementation, the PrimateAI language model can be trained for 100,000 parameter updates. The learning rate is linearly increased over the first 5,000 steps from η=5×10⁻⁶to a peak value of η=5×10⁻⁴, and then linearly decayed to η=10⁻⁴. Automatic mixed precision (AMP) can be applied to cast suitable operations from 32-bit to 16-bit precision during training and inference. This increases throughput and reduces memory consumption without affecting performance. In addition, a Zero Redundancy Optimizer reduced memory usage by sharding optimizer states across multiple GPUs.

Revelation Output Head

FIG. 10 shows one implementation of the revelation output head 780 that can be used by the disclosed PrimateAI language model. The gathered MSA representation 772 can be translated by the output head 780 to predictions 790 for 21 candidates in an amino acid vocabulary including a gap token. In one implementation, an amino acid vocabulary can be enumerated and the amino acid enumerations are used to index a dictionary of learned embeddings. In other implementations, one-hot embeddings of amino acids can be used and combined with linear projections. In some implementations, the revelation output head 780 can comprise a transition layer 1002, a gate 1004, a layer normalization block 1006, a linear block 1008, a GELU block, and another linear block 1012. Dimensions are shown for channels, c_MSA=768, and vocabulary size, v=21.

Method

FIG. 11 is a computer-implemented method 1100 of the logic flow of the PrimateAI language model, in accordance with one implementation of the technology disclosed.

At action 1102, a multiple sequence alignment (MSA) 220 can be accessed. The MSA can have p rows and r columns. The p rows can correspond top protein sequences. The r columns can correspond to r residue positions.

At action 1104, a mask grid 210 can be accessed. The mask grid 210 can have m mask distributions. Each of the m mask distributions can have k periodically-spaced masks at k ordinal positions that begin at varying offsets from a first residue position in the mask grid.

At action 1106, the m mask distributions can be applied to m protein sequences in the p protein sequences to generate a partially-masked MSA 230 that contains masked residues and unmasked residues, where p>m. In various implementations, p>=m.

At action 1108, the masked residues and the unmasked residues can be translated into learned embeddings 400, the learned embeddings 400 can be concatenated with residue position embeddings to generate an embedded representation (embedding token) 690 of the partially-masked MSA 230.

At action 1110, the embedded representation 690 can be chunked (or split) into a series of chunks 640, chunks in the series of chunks 640 can be concatenated into a stack 650, and the stack 650 can be translated into a compressed representation 680 of the embedded representation 690. The compressed representation 680 can have m rows and r columns.

At action 1112, axial-attention 710 can be iteratively (or sequentially) applied across the m rows and the r columns of the compressed representation, and the applied attention can be interleaved (with transition layers) to generate an updated representation 720 of (or from) the compressed representation 680. The updated representation 720 can have m rows and r columns.

At action 1114, k updated representation tiles 810 can be aggregated from the updated representation 720. Each of the k updated representation tiles 810 can contain those updated representation features of the updated representation 720 that correspond to the masked residues. Each of the k updated representation tiles can have m rows and k columns. A given column in the k columns of a given updated representation tile 806 can contain a respective subset of the updated representation features. The respective subset can be located at a given ordinal position in the k ordinal positions. The given ordinal position can be represented by the given column.

At action 1116, k embedding tiles 870 corresponding to the k updated representation tiles 810 can be aggregated from the embedded representation 690. Each of the k embedding tiles 844 can contain those embedding features in a first chunk of the series of chunks that are translations of the masked residues. Each of the k embedding tiles can have m rows and k columns. A given column in the k columns of a given embedding tile can contain a respective subset of the embedding features. The respective subset can be located at a given ordinal position in the k ordinal positions. The given ordinal position can be represented by the given column.

At action 1118, k Boolean tiles 834, 838 can be applied to the k embedding tiles to generate k Booleaned (partially revealed) embedding tiles. Each of the k Boolean tiles can have m rows and k column. Each of the k Boolean tiles can cause concealment of a corresponding one of the k columns in a corresponding one of the k embedding tiles, and can cause revelation of other ones of the k columns in the corresponding one of the k embedding tiles. Each of the k Booleaned embedding tiles can have m rows and k columns.

At action 1120, the k Booleaned (partially revealed) embedding tiles 870 can be concatenated with the k updated representation tiles 810 to generate k concatenated tiles 873, and the k concatenated tiles 873 can be translated into k compressed tile representations (informed MSA representation 740) of the k concatenated tiles 873. Each of the k compressed tile representations can have m rows and k columns.

At action 1122, self-attention 750, 754, 756 can be iteratively applied to the k compressed tile representations 740 to generate interpretations of those compressed tile features in the k compressed tile representations that correspond to those embedding features in the k embedding tiles that are revealed by the k Boolean tiles.

At action 1124, those interpreted features can be aggregated from the interpretations that correspond to those embedding features in the k embedding tiles that are concealed by the k Boolean tiles to generate an aggregated representation of the interpretations (gathered MSA representation 772). The aggregated representation can have m rows and k columns.

At action 1126, the aggregated representation 772 can have translated into identities 790 of the masked residues.

System

FIG. 12 is a system 1200 that is configured to implement the PrimateAI language model, in accordance with one implementation of the technology disclosed.

A memory 1202 can store a multiple sequence alignment (MSA) with a plurality of masked residues.

A chunking logic 1204 can be configured to chunk the MSA into a series of chunks.

A first attention logic 1206 can be configured to attend to a representation of the series of chunks and produce a first attention output.

A first aggregation logic 1208 can be configured to produce a first aggregated output that contains those features in the first attention output that correspond to masked residues in the plurality of masked residues. The features include elements of an MSA, in one implementation, such as one-hot encodings of amino acids in the MSA.

A mask revelation logic 1210 can be configured to produce an informed output based on the first aggregated output and a Boolean mask that, on a subset-by-subset basis, alternates between concealing a given subset of the masked residues and revealing remaining subsets of the masked residues.

A second attention logic 1212 can be configured to attend to the informed output and produce a second attention output based on masked residues revealed by the Boolean mask.

A second aggregation logic 1214 can be configured to produce a second aggregated output that contains those features in the second attention output that correspond to masked residues concealed by the Boolean mask.

An output logic 1216 can be configured to produce identifications of the masked residues based on the second aggregated output.

Objective Indicia of Inventiveness and Non-Obviousness

FIG. 13 shows the performance evaluation 1300 of the language modelling part of the PrimateAI language model (LM) compared to the replicated VAE part of EVE (J. Frazer et al., Disease variant prediction with deep generative models of evolutionary data. Nature 599, 91-95 (2021) (Evolutionary model of Variant Effect) labelled “EVE*”) model and their combined score (labelled “PrimateAI LM+EVE*-only”). The performance is further compared to a selection of competitive unsupervised methods (ESM1v, SIFT, LIST-S2). In clockwise direction starting from the top left, the individual panels correspond to evaluation on DDD vs UKBB, Assays, ClinVar, ASD, CHD, DDD and UKBB. For Assays and UKBB, the summary statistics are given in terms of absolute value (|corr|) of correlation between score and an experimental measure of pathogenicity, i.e., mean phenotype (UKBB) or assays score (Assays). For DDD, we calculate the P-value of Wilcoxon rank-sum for control and case distribution over all datasets. For ClinVar, we measure the AUC averaged over all genes.

Evaluation Datasets

Saturation Mutagenesis Assays

Performance of the PrimateAI language model is compared using deep mutational scanning assays for the following 9 genes: Amyloid-beta, YAP1, MSH2, SYUA, VKOR1, PTEN, BRCA1, TP53, and ADRB2. A few assays of the genes for which the predication scores of some classifiers are unavailable are excluded from the evaluation analysis, including TPMT, RASH, CALM1, UBE2I, SUMO1, TPK1, and MAPK1. Also excluded are assays of KRAS (due to different transcript sequence), SLCO1B1 (only 137 variants), and Amyloid-beta. Performance of the PrimateAI language model is evaluated by computing the absolute Spearman rank correlation between model prediction scores and assay scores individually for each assay and then taking the mean across all assays.

UK Biobank

The UK Biobank (UKBB) dataset contains 61 phenotypes across 100 genes. Evaluating on common variants of all methods reduces the number to 41 phenotypes across 42 genes. The absolute Spearman rank correlation is calculated between the predicted pathogenicity scores and the quantitative phenotype scores for each pair of gene/phenotype. Only gene/phenotype pairs with at least 10 variants were included in the evaluation (14 phenotypes across 16 genes). This confirmed that the evaluation is robust to this choice of threshold.

ClinVar

Performance of the PrimateAI language model in classifying clinical labels of ClinVar missense variants as benign or pathogenic is benchmarked. Both “benign” and “likely benign” labelled variants are considered benign, the same for “pathogenic” and “likely pathogenic” labelled variants (both considered pathogenic). To ensure high-quality labels, only ClinVar variants with 1-star review status or above (including “criteria provided, single submitter”, “criteria provided, multiple submitters, no conflicts”, “reviewed by expert panel”, “practice guideline”) are included. This reduced the number of variants from 36,705 to 22,165 for the pathogenic and from 41,986 to 39,560 for the benign class. The area under the receiver operating characteristic curve for each gene is calculated and then the mean AUC across all genes is reported.

DDD/ASD/CHD De Novo Missense Variants

To evaluate the performance of the deep learning network in clinical settings, de novo mutations from published studies for intellectual disorders, including autism spectrum disorder (ASD) and developmental disorders (DDD) are obtained. ASD contained 2,127 patients with at least one de novo missense (DNM) mutation. Taken together, there are a total of 3,135 DNM mutations. This reduced to 517 patients with at least one DNM variant and a total of 558 DNM variants after requiring all methods had predictions for those variants. In DDD, 17,952 patients had at least one de novo missense variant (26,880 variants in total), reducing to 5,872 patients (6,398 variants) after requiring availability of predictions of all methods. A set of DNM variants from patients with congenital heart disorders (CHD) are obtained, consisting of 1,839 de novo missense variants from 1,342 patients (reducing to 314 variants from 299 patients after requiring availability of predictions of all methods). For all the three datasets of de novo variants from affected patients, a shared set of DNM variants from healthy controls are used, which contains 1,823 DNM variants from 1,215 healthy controls with at least one DNM variant and collected from multiple studies. It was reduced to 250 variants (235 patients) after requiring availability of variant prediction scores of all methods. For each disease set of DNMs, the Mann-Whitney U test is applied to evaluate how well each classifier can distinguish the DNM set of patients from that of controls.

Methods for Comparison

Predictions from other methods were evaluated using rank scores downloaded from the database for functional prediction dbNSFP4.2a. To avoid dramatic reductions in the number of common variants, methods with incomplete sets of scores (methods with less than 67 out of 71 million possible missense variants in hg38) are removed, except Polyphen2 due to its widespread adoption. We included the following methods (method abbreviation) for comparison: BayesDel_noAF (BayesDel), CADD_raw (CADD), DANN, DEOGEN2, LIST-S2, M-CAP, MutationTaster converted (MutationTaster), PROVEAN_converted (PROVEAN), Polyphen2_HVAR (Polyphen2; due to better performance then Polyphen2 HDIV), PrimateAI, Revel (REVEL), SIFT_converted (SIFT), VEST4, fathmm-MKL_coding (fathmm-MKL; highest performance among the fathmm models for given benchmarks).

Applying EVE to More Proteins

In the original publication, EVE is only applied to a small set of disease-associated genes in ClinVar. To generate the disclosed language model-based training data set, it is essential to expand the predictions of EVE to as many proteins as possible. Due to unavailability of EVE source code, a similar method DeepSequence is applied and converted DeepSequence scores into EVE scores by fitting Gaussian mixture models. An up-to-date version of UniRef100 is used, but otherwise followed the alignment depth and sequence coverage filtering steps described in EVE. At least 1 prediction in 18,920 proteins and a total of 50.2M predicted variants out of 71.2M possible missense variants are achieve. To validate the disclosed replication, the replicated EVE models are evaluated using published variants from EVE. Scores from the replicated EVE model result in comparable performance to the published EVE software on all benchmarking datasets, e.g., both methods achieve 0.41 mean absolute correlation on Assays and 0.22 mean absolute correlation for UKBB.

Benchmarking PrimateAI Language Model Against Other Sequence-Only Models for Pathogenicity Predictions

The PrimateAI language model falls into a class of methods only trained to model proteins sequences but performing surprisingly well as pathogenicity predictors. Despite not achieving the overall best performance by themselves, they make crucial features or components in classifiers incorporating more diverse data. FIG. 13 summarizes the evaluation performance of the PrimateAI language model against other such sequence-only methods for pathogenicity prediction: ESM1v, EVE, LIST-S2, and SIFT. Our language model outperforms another language model ESM1v on all the testing datasets except assays using only 1/50^thof the training time. This is particularly striking as PrimateAI LM does not rely on any fine-tuning on assays.

Combining PrimateAI Language Model with EVE

Language models are trained to model the entire universe of proteins. EVE trains a separate model for each human protein and all similar sequences. This and the differences in model architecture and training algorithms suggest that the models extract distinct features from their input. Therefore, we expected that the scores from EVE and our language model to be complementary and that combining scores may result in improved performance. We found that simply taking the mean of their pathogenicity scores already performs better than any of the two methods alone. More elaborate combinations, e.g., using ridge regression, did not lead to any further improvements. The resulting performance is shown in FIG. 13, where the combined score leads to a performance gain of 6.6% (or 6.8%) in mean correlation on assays compared to the PrimateAI LM (or compared to replicated EVE), 1.4% (or 1.7%) improvement mean AUC on ClinVar and increases in P-value by 11% (29%) for DDD, 3% (26%) for ASD and 17% (23%) for CHD.

Top-1 Training Accuracy

FIG. 14 depicts the Top-1 training accuracy 1400 of the PrimateAI language model. An ensemble of six PrimateAI language model networks was trained with different random seeds for training data sampling and model parameter initialization. Their top-1 accuracies during training are shown in FIG. 14 for mask locations in the query sequence and all sequences in UniRef50 MSAs. Top-1 accuracy for the query sequence is much lower than for all sequences as the query sequence does not contain gap tokens, which are easier to predict than residues because gap tokens often form long and contiguous segments in MSAs. The PrimateAI language model accuracy on query sequences continues to improve with training. In some implementations, convergence can be accelerated by adding auxiliary losses to each layer of the PrimateAI language model.

Entropy and Pathogenicity Score

Scores of the PrimateAI language model can be tabulated for future reference, rather than re-running the model every time its scores are needed. For example, the PrimateAI language model's fill-in-the-blank predictions can provided for locations of interest at every site in 19,071 human proteins, totaling predictions for 2,057,437,040 variants at 108,286,160 positions. A person skilled in the art will appreciate that these numbers would change, for example, if the small number of human proteins that were not included here were included. In some implementations, the PrimateAI language model can be ensembled to produced averaged scores that have higher performance than individual model scores. For example, each prediction can be made by an ensemble of six models, with each model contributing at least four inferences with different random seeds for sampling and ordering of sequences in human MSAs. Inferences logits can be averaged by taking means of predictions grouped by random seed, and then taking the mean of the means.

Pathogenicity prediction of a variant can be evaluated using the relative values of logits for reference and alternative amino acids, or evaluated by subtracting the logit value for the reference amino acid from the logit value for the alternative amino acid. The probabilities are normalized over all possible residues disregarding the gap token, such that Σ_rp_r=1 with probability p_rof the r^thresidue obtained from the ensembled logits. The log difference captures how unlikely the variant amino acid is compared to the reference amino acid. However, the score does not consider the prediction of the other 18 possible amino acids, which contain information about the language models internal estimate of protein site conservation as well as convergence of the language model. The entropy was used evaluated over amino acid predictions S=−Σ_rp_rlog (p_r) with probability p_rof the r^thresidue to capture a variant agnostic site-dependent contribution to the pathogenicity score. Specifically, a score, s_alt, for the alternative residue at a given site is given by the usual log difference of the alt and reference logit at that site minus the entropy over amino acids at the given site, i.e., s_alt=log (p_art)−log (p_ref)−S.

The entropy term is small whenever the probability over all amino acids is dominated by a single term and large whenever the model is uncertain about the residues and assigns multiple residues high values. Physically, in this case the site is associated with little conservation and likely to mutate. This should lead to less pathogenic signal. Adjusting the scores by entropy incorporates a model internal estimate of amino acid conservation. A given log difference between residue and reference will be considered as more pathogenic whenever it is associated with a highly conserved site. The score adjustment additionally incorporates the lack of convergence associated with a heavily undertrained model.

“Logic” (e.g., masking logic), as used herein, can be implemented in the form of a computer product including a non-transitory computer readable storage medium with computer usable program code for performing the method steps described herein. The “logic” can be implemented in the form of an apparatus including a memory and at least one processor that is coupled to the memory and operative to perform exemplary method steps. The “logic” can be implemented in the form of means for carrying out one or more of the method steps described herein; the means can include (i) hardware module(s), (ii) software module(s) executing on one or more hardware processors, or (iii) a combination of hardware and software modules; any of (i)-(iii) implement the specific techniques set forth herein, and the software modules are stored in a computer readable storage medium (or multiple such media). In one implementation, the logic implements a data processing function. The logic can be a general purpose, single core or multicore, processor with a computer program specifying the function, a digital signal processor with a computer program, configurable logic such as an FPGA with a configuration file, a special purpose circuit such as a state machine, or any combination of these. Also, a computer program product can embody the computer program and configuration file portions of the logic.

Computer System

FIG. 15 shows a computer system 1500 that can be used for compilation and runtime execution of the PrimateAI language model. Computer system 1500 includes at least one central processing unit (CPU) 1572 that communicates with a number of peripheral devices via bus subsystem 1555. These peripheral devices can include a storage subsystem 1510 including, for example, memory devices and a file storage subsystem 1536, user interface input devices 1538, user interface output devices 1576, and a network interface subsystem 1574. The input and output devices allow user interaction with computer system 1500. Network interface subsystem 1574 provides an interface to outside networks, including an interface to corresponding interface devices in other computer systems.

In one implementation, the pathogenicity predictor 150 (e.g., the PrimateAI language model) is communicably linked to the storage subsystem 1510 and the user interface input devices 1538.

User interface input devices 1538 can include a keyboard; pointing devices such as a mouse, trackball, touchpad, or graphics tablet; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems and microphones; and other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and ways to input information into computer system 1500.

User interface output devices 1576 can include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem can include an LED display, a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem can also provide a non-visual display such as audio output devices. In general, use of the term “output device” is intended to include all possible types of devices and ways to output information from computer system 1500 to the user or to another machine or computer system.

Storage subsystem 1510 stores programming and data constructs that provide the functionality of some or all of the modules and methods described herein. These software modules are generally executed by processors 1578.

Processors 1578 can be graphics processing units (GPUs), field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), and/or coarse-grained reconfigurable architectures (CGRAs). Processors 1578 can be hosted by a deep learning cloud platform such as Google Cloud Platform™, Xilinx™, and Cirrascale™. Examples of processors 1578 include Google's Tensor Processing Unit (TPU)™, rackmount solutions like GX4 Rackmount Series™, GX15 Rackmount Series™, NVIDIA DGX-1™, Microsoft, Stratix V FPGA™, Graphcore's Intelligent Processor Unit (IPU)™, Qualcomm's Zeroth Platform™ with Snapdragon Processors™, s Volta™, NVIDIA's DRIVE PX™, s JETSON TX1/TX2 MODULE™, Intel's Nirvana™, Movidius VPU™, Fujitsu DPI™, ARM's DynamiclQ™, IBM TrueNorth™, Lambda GPU Server with Testa V100s™, and others.

Memory subsystem 1522 used in the storage subsystem 1510 can include a number of memories including a main random access memory (RAM) 1532 for storage of instructions and data during program execution and a read only memory (ROM) 1534 in which fixed instructions are stored. A file storage subsystem 1536 can provide persistent storage for program and data files, and can include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. The modules implementing the functionality of certain implementations can be stored by file storage subsystem 1536 in the storage subsystem 1510, or in other machines accessible by the processor.

Bus subsystem 1555 provides a mechanism for letting the various components and subsystems of computer system 1500 communicate with each other as intended. Although bus subsystem 1555 is shown schematically as a single bus, alternative implementations of the bus subsystem can use multiple busses.

Computer system 1500 itself can be of varying types including a personal computer, a portable computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a widely-distributed set of loosely networked computers, or any other data processing system or user device. Due to the ever-changing nature of computers and networks, the description of computer system 1500 depicted in FIG. 15 is intended only as a specific example for purposes of illustrating the preferred implementations of the present invention. Many other configurations of computer system 1500 are possible having more or less components than the computer system depicted in FIG. 15.

Clauses

The technology disclosed can be practiced as a system, method, or article of manufacture. One or more features of an implementation can be combined with the base implementation. Implementations that are not mutually exclusive are taught to be combinable. One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. Omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections—these recitations are hereby incorporated forward by reference into each of the following implementations.

One or more implementations and clauses of the technology disclosed or elements thereof can be implemented in the form of a computer product, including a non-transitory computer-readable storage medium with computer usable program code for performing the method steps indicated. Furthermore, one or more implementations and clauses of the technology disclosed or elements thereof can be implemented in the form of an apparatus including a memory and at least one processor that is coupled to the memory and operative to perform exemplary method steps. Yet further, in another aspect, one or more implementations and clauses of the technology disclosed or elements thereof can be implemented in the form of means for carrying out one or more of the method steps described herein; the means can include (i) hardware module(s), (ii) software module(s) executing on one or more hardware processors, or (iii) a combination of hardware and software modules; any of (i)-(iii) implement the specific techniques set forth herein, and the software modules are stored in a computer-readable storage medium (or multiple such media).

The clauses described in this section can be combined as features. In the interest of conciseness, the combinations of features are not individually enumerated and are not repeated with each base set of features. The reader will understand how features identified in the clauses described in this section can readily be combined with sets of base features identified as implementations in other sections of this application. These and other features, aspects, and advantages of the technology disclosed will become apparent from the following detailed description of illustrative implementations thereof, which is to be read in connection with the accompanying drawings. These clauses are not meant to be mutually exclusive, exhaustive, or restrictive; and the technology disclosed is not limited to these clauses but rather encompasses all possible combinations, modifications, and variations within the scope of the claimed technology and its equivalents.

Other implementations of the clauses described in this section can include a non-transitory computer-readable storage medium storing instructions executable by a processor to perform any of the clauses described in this section. Yet another implementation of the clauses described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the clauses described in this section.

We disclose the following clauses:

Clause Set 1

1. A computer-implemented method of variant pathogenicity prediction, including:

accessing a multiple sequence alignment that aligns a query residue sequence to a plurality of non-query residue sequences;
applying a set of periodically-spaced masks to a first set of residues at a first set of positions in the multiple sequence alignment, wherein the first set of residues includes a residue-of-interest at a position-of-interest in the query residue sequence;
cropping a portion of the multiple sequence alignment that includes
- (i) the set of periodically-spaced masks at the first set of positions, and
- (ii) a second set of residues at a second set of positions in the multiple sequence alignment to which the set of periodically-spaced masks is not applied; and
generating a pathogenicity prediction for a variant at the position-of-interest based on the portion of the multiple sequence alignment.

2. The computer-implemented method of clause 1, wherein the multiple sequence alignment aligns the query residue sequence to the plurality of non-query residue sequences along a per-position dimension and along a per-sequence dimension.

3. The computer-implemented method of clause 2, wherein the set of periodically-spaced masks is applied along the per-sequence dimension within a window of sequences in the multiple sequence alignment.

4. The computer-implemented method of clause 3, wherein the set of periodically-spaced masks is applied along the per-position dimension within a window of positions across the window of sequences in the multiple sequence alignment.

5. The computer-implemented method of clause 4, wherein the portion spans the window of positions across the multiple sequence alignment.

6. The computer-implemented method of clause 4, wherein the portion spans the window of positions across a subset of sequences in the multiple sequence alignment.

7. The computer-implemented method of clause 1, wherein the portion has a predetermined width and a predetermined height.

8. The computer-implemented method of clause 7, wherein the portion is padded to compensate for multiple sequence alignments that have widths smaller the predetermined width of the portion.

9. The computer-implemented method of clause 7, wherein the portion is padded to compensate for multiple sequence alignments that have heights smaller the predetermined heights of the portion.

10. The computer-implemented method of clause 2, wherein the set of periodically-spaced masks is distributed along the per-sequence dimension into subsets of periodically-spaced masks.

11. The computer-implemented method of clause 10, wherein the subsets of periodically-spaced masks correspond to sequences in the window of sequences.

12. The computer-implemented method of clause 11, wherein successive masks in a subset of periodically-spaced masks corresponding to a given sequence in the window of sequences are spaced apart by unmasked residues in the given sequence.

13. The computer-implemented method of clause 12, wherein a number of the unmasked residues by which the successive masks are spaced apart is same across the sequences in the window of sequences.

14. The computer-implemented method of clause 12, wherein a number of the unmasked residues by which the successive masks are spaced apart varies across the sequences in the window of sequences.

15. The computer-implemented method of clause 12, wherein a starting position in a given sequence at which a corresponding subset of periodically-spaced masks begins varies between the sequences in the window of sequences.

16. The computer-implemented method of clause 12, wherein the starting position follows a diagonal pattern across the sequences in the window of sequences.

17. The computer-implemented method of clause 14, wherein the starting position follows a diagonal pattern that begins to repeat at least once across the sequences in the window of sequences.

18. The computer-implemented method of clause 17, wherein the starting position follows a diagonal pattern that repeats at least once across the sequences in the window of sequences.

19. The computer-implemented method of clause 1, wherein the set of periodically-spaced masks has a pattern.

20. The computer-implemented method of clause 19, wherein the pattern is a diagonal pattern.

21. The computer-implemented method of clause 19, wherein the pattern is a hexagonal pattern.

22. The computer-implemented method of clause 19, wherein the pattern is a diamond pattern.

23. The computer-implemented method of clause 19, wherein the pattern is a rectangle pattern.

24. The computer-implemented method of clause 19, wherein the pattern is a square pattern.

25. The computer-implemented method of clause 19, wherein the pattern is a triangle pattern.

26. The computer-implemented method of clause 19, wherein the pattern is a convex pattern.

27. The computer-implemented method of clause 19, wherein the pattern is a concave pattern.

28. The computer-implemented method of clause 19, wherein the pattern is a polygonal pattern.

29. The computer-implemented method of clause 19, further including right-shifting a cropping window used for the cropping to minimize padding of the portion.

30. The computer-implemented method of clause 29, further including left-shifting the cropping window to minimize the padding of the portion.

31. The computer-implemented method of clause 1, further including configuring the cropping window to position the position-of-interest in a center column of the portion.

32. The computer-implemented method of clause 31, further including configuring the cropping window to position the position-of-interest adjacent to the center column.

33. The computer-implemented method of clause 1, further including substituting, in the portion, the set of periodically-spaced masks at the first set of positions with learned mask embeddings, and substituting, in the portion, and the second set of residues at the second set of positions with learned residue embeddings.

34. The computer-implemented method of clause 33, wherein a one-hot encoding generator generates the learned mask embeddings and the learned residue embeddings.

35. The computer-implemented method of clause 34, wherein the learned mask embeddings and the learned residue embeddings are selected from a look-up table.

36. The computer-implemented method of clause 1, further including substituting, in the portion, the set of periodically-spaced masks at the first set of positions and the second set of residues at the second set of positions with learned position embeddings.

37. The computer-implemented method of clause 36, further including chunking the portion with the learned mask embeddings, the learned residue embeddings, and the learned position embeddings into a plurality of chunks.

38. The computer-implemented method of clause 37, further including processing the plurality of chunks as an aggregate and generating an alternative representation of the portion.

39. The computer-implemented method of clause 38, wherein a linear projection layer uses a filter bank of 1×1 convolutions to process the plurality of chunks as the aggregate and generate the alternative representation of the portion.

40. The computer-implemented method of clause 39, further including processing the alternative representation of the portion through a cascade of attention blocks to generate an updated alternative representation of the portion.

41. The computer-implemented method of clause 40, wherein attention blocks in the cascade of attention blocks use self-attention.

42. The computer-implemented method of clause 41, wherein each of the attention blocks includes a tied row-wise gate self-attention, followed by a column-wise gated self-attention, and followed by a transition logic.

43. The computer-implemented method of clause 40, wherein the attention blocks use cross-attention.

44. The computer-implemented method of clause 40, wherein a mask revelation block processes the updated alternative representation of the portion and generates an informed alternative representation of the portion.

45. The computer-implemented method of clause 44, wherein the mask revelation block gathers features aligned with masked locations in a row, and for each mask in the row reveals, embedded target tokens at other masked locations in the row.

46. The computer-implemented method of clause 44, wherein a mask gather block processes the informed alternative representation of the portion and generates a gathered alternative representation of the portion.

47. The computer-implemented method of clause 46, wherein the mask gather block processes the informed alternative representation through a cascade of transition logic and row-wise gated self-attention blocks that gather features where target embeddings remained masked.

48. The computer-implemented method of clause 47, wherein an output block processes the gathered alternative representation of the portion and predicts identities of residues masked by the set of periodically-spaced masks.

49. The computer-implemented method of clause 48, wherein the output block includes a transition logic and a perceptron logic.

50. The computer-implemented method of clause 48, wherein a probability of applying a subset of periodically-spaced masks to a non-sequence in the window of sequences is proportional to (1−a number of gap tokens in the non-sequence)−2.

51. The computer-implemented method of clause 1, further including generating the pathogenicity prediction for the variant based on a difference between a log probability of the variant and a log probability of a corresponding reference amino acid less an entropy evaluated over amino acid-wise predictions.

Clause Set 2

1. A computer-implemented method, including:

accessing a multiple sequence alignment (MSA), wherein the MSA hasp rows and r columns, wherein the p rows correspond top protein sequences, and wherein the r columns correspond to r residue positions;

accessing a mask grid, wherein the mask grid has m mask distributions, and wherein each of the m mask distributions has k periodically-spaced masks at k ordinal positions that begin at varying offsets from a first residue position in the mask grid;

applying the m mask distributions to m protein sequences in the p protein sequences to generate a partially-masked MSA that contains masked residues and unmasked residues, where p>m;

translating the masked residues and the unmasked residues into learned embeddings, concatenating the learned embeddings with residue position embeddings to generate an embedded representation of the partially-masked MSA;

chunking the embedded representation into a series of chunks, concatenating chunks in the series of chunks into a stack, and translating the stack into a compressed representation of the embedded representation, wherein the compressed representation has m rows and r columns;

iteratively applying axial-attention across the m rows and the r columns of the compressed representation, and interleaving the applied attention to generate an updated representation of the compressed representation, wherein the updated representation has m rows and r columns;

aggregating, from the updated representation, k updated representation tiles, wherein each of the k updated representation tiles contains those updated representation features of the updated representation that correspond to the masked residues, wherein each of the k updated representation tiles has m rows and k columns, wherein a given column in the k columns of a given updated representation tile contains a respective subset of the updated representation features, wherein the respective subset is located at a given ordinal position in the k ordinal positions, and wherein the given ordinal position is represented by the given column;

aggregating, from the embedded representation, k embedding tiles corresponding to the k updated representation tiles, wherein each of the k embedding tiles contains those embedding features in a first chunk of the series of chunks that are translations of the masked residues, wherein each of the k embedding tiles has m rows and k columns, wherein a given column in the k columns of a given embedding tile contains a respective subset of the embedding features, wherein the respective subset is located at a given ordinal position in the k ordinal positions, and wherein the given ordinal position is represented by the given column;

applying k Boolean tiles to the k embedding tiles to generate k Booleaned embedding tiles, wherein each of the k Boolean tiles has m rows and k columns, wherein each of the k Boolean tiles causes concealment of a corresponding one of the k columns in a corresponding one of the k embedding tiles, and causes revelation of other ones of the k columns in the corresponding one of the k embedding tiles, and wherein each of the k Booleaned embedding tiles has m rows and k columns;

concatenating the k Booleaned embedding tiles with the k updated representation tiles to generate k concatenated tiles, and translating the k concatenated tiles into k compressed tile representations of the k concatenated tiles, wherein each of the k compressed tile representations has m rows and k columns;

iteratively applying self-attention to the k compressed tile representations to generate interpretations of those compressed tile features in the k compressed tile representations that correspond to those embedding features in the k embedding tiles that are revealed by the k Boolean tiles;

aggregating those interpreted features from the interpretations that correspond to those embedding features in the k embedding tiles that are concealed by the k Boolean tiles to generate an aggregated representation of the interpretations, wherein the aggregated representation has m rows and k columns; and

translating the aggregated representation into identities of the masked residues.

2. The computer-implemented of clause 1, further including using a one-hot encoding scheme to translate twenty naturally-occurring residues, a gap residue, and a mask into respective one-hot encoded vectors.

3. The computer-implemented of clause 2, further including training a neural network to generate respective learned embeddings for the respective one-hot encoded vectors.

4. The computer-implemented of clause 3, wherein the masked residues and the unmasked residues are translated into the learned embeddings based on a lookup table that maps the respective one-hot encoded vectors to the respective learned embeddings.

5. The computer-implemented of clause 4, wherein the residue position embeddings specify an order in which residues are arranged in the p protein sequences.

6. The computer-implemented of clause 1, wherein the chunks are concatenated into the stack along a channel dimension.

7. The computer-implemented of clause 1, wherein the stack is translated into the compressed representation by processing the stack through a linear projection.

8. The computer-implemented of clause 7, wherein the linear projection uses a plurality of one-dimensional (1D) convolution filters.

9. The computer-implemented of clause 8, wherein the k concatenated tiles are translated into the k compressed tile representations by processing the k concatenated tiles through the linear projection.

10. The computer-implemented of clause 1, wherein the aggregated representation is translated into the identities of the masked residues by processing the aggregated representation through a revelation output head.

11. The computer-implemented of clause 1, wherein p=m.

12. The computer-implemented of clause 1, wherein each of the k Boolean tiles causes concealment of the corresponding one of the k columns in the corresponding one of the k embedding tiles, and causes revelation of at least some of the other ones of the k columns in the corresponding one of the k embedding tiles.

13. The computer-implemented of clause 1, wherein each of the k Boolean tiles causes concealment of a corresponding subset of the k columns in the corresponding one of the k embedding tiles, and causes revelation of at least some of the other ones of the k columns in the corresponding one of the k embedding tiles.

14. The computer-implemented of clause 1, wherein the k periodically-spaced masks of at least some of the m mask distributions begin at a same offset from the first residue position.

15. A system, comprising:

memory storing a multiple sequence alignment (MSA) with a plurality of masked residues;

chunking logic configured to chunk the MSA into a series of chunks;

first attention logic configured to attend to a representation of the series of chunks and produce a first attention output;

first aggregation logic configured to produce a first aggregated output that contains those features in the first attention output that correspond to masked residues in the plurality of masked residues;

mask revelation logic configured to produce an informed output based on the first aggregated output and a Boolean mask that, on a subset-by-subset basis, alternates between concealing a given subset of the masked residues and revealing remaining subsets of the masked residues;

second attention logic configured to attend to the informed output and produce a second attention output based on masked residues revealed by the Boolean mask;

second aggregation logic configured to produce a second aggregated output that contains those features in the second attention output that correspond to masked residues concealed by the Boolean mask; and

output logic configured to produce identifications of the masked residues based on the second aggregated output.

16. The system of clause 15, wherein the first attention logic uses axial-attention.

17. The system of clause 15, wherein the second attention logic uses self-attention.

18. A computer-implemented method, including:

accessing a multiple sequence alignment (MSA), wherein the MSA hasp rows and r columns, wherein the p rows correspond top protein sequences, and wherein the r columns correspond to r residue positions;

accessing a mask grid, wherein the mask grid has m mask distributions, and wherein each of the m mask distributions has k periodically-spaced masks at k ordinal positions;

applying the m mask distributions to m protein sequences in the p protein sequences to generate a partially-masked MSA that contains masked residues and unmasked residues, where p>m;

translating the masked residues and the unmasked residues into learned embeddings, concatenating the learned embeddings with residue position embeddings to generate an embedded representation of the partially-masked MSA;

chunking the embedded representation into a series of chunks, concatenating chunks in the series of chunks into a stack, and translating the stack into a compressed representation of the embedded representation;

iteratively applying axial-attention across the m rows and the r columns of the compressed representation, and interleaving the applied attention to generate an updated representation of the compressed representation;

aggregating, from the updated representation, k updated representation tiles, wherein each of the k updated representation tiles contains those updated representation features of the updated representation that correspond to the masked residues;

aggregating, from the embedded representation, k embedding tiles corresponding to the k updated representation tiles, wherein each of the k embedding tiles contains those embedding features in a first chunk of the series of chunks that are translations of the masked residues;

applying k Boolean tiles to the k embedding tiles to generate k Booleaned embedding tiles, wherein each of the k Boolean tiles causes concealment of a corresponding one of the k columns in a corresponding one of the k embedding tiles, and causes revelation of other ones of the k columns in the corresponding one of the k embedding tiles;

concatenating the k Booleaned embedding tiles with the k updated representation tiles to generate k concatenated tiles, and translating the k concatenated tiles into k compressed tile representations of the k concatenated tiles;

iteratively applying self-attention to the k compressed tile representations to generate interpretations of those compressed tile features in the k compressed tile representations that correspond to those embedding features in the k embedding tiles that are revealed by the k Boolean tiles;

aggregating those interpreted features from the interpretations that correspond to those embedding features in the k embedding tiles that are concealed by the k Boolean tiles to generate an aggregated representation of the interpretations; and

translating the aggregated representation into identities of the masked residues.

19. The computer-implemented of clause 18, wherein the k periodically-spaced masks of at least some of the m mask distributions begin at varying offsets from a first residue position in mask grid.

20. The computer-implemented of clause 19, wherein the k periodically-spaced masks of at least some of the m mask distributions begin at a same offset from the first residue position.

21. The computer-implemented of clause 18, wherein the compressed representation has m rows and r columns.

22. The computer-implemented of clause 18, wherein the updated representation has m rows and r columns.

23. The computer-implemented of clause 18, wherein each of the k updated representation tiles has m rows and k columns, wherein a given column in the k columns of a given updated representation tile contains a respective subset of the updated representation features, wherein the respective subset is located at a given ordinal position in the k ordinal positions, and wherein the given ordinal position is represented by the given column.

24. The computer-implemented of clause 18, wherein each of the k embedding tiles has m rows and k columns, wherein a given column in the k columns of a given embedding tile contains a respective subset of the embedding features, wherein the respective subset is located at a given ordinal position in the k ordinal positions, and wherein the given ordinal position is represented by the given column.

25. The computer-implemented of clause 18, wherein each of the k Boolean tiles has m rows and k columns.

26. The computer-implemented of clause 18, wherein each of the k Booleaned embedding tiles has m rows and k columns.

27. The computer-implemented of clause 18, wherein each of the k compressed tile representations has m rows and k columns.

28. The computer-implemented of clause 18, wherein the aggregated representation has m rows and k columns.

Number	Date	Country
63294830	Dec 2021	US
63294828	Dec 2021	US
63294827	Dec 2021	US
63294820	Dec 2021	US
63294816	Dec 2021	US
63294813	Dec 2021	US

MASK PATTERN FOR PROTEIN LANGUAGE MODELS

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

PRIORITY APPLICATIONS

Provisional Applications (6)