This specification relates to sparse-code utilization for mFISH imaging.
Multiplexed fluorescence in-situ hybridization (mFISH) imaging is a powerful technique to determine gene expression in spatial transcriptomics. In brief, a sample is exposed to multiple oligonucleotide probes that target RNA of interest. Then sequential rounds of fluorescence images are acquired with exposure to excitation light of different wavelengths and/or photobleaching followed by exposure to further rounds of oligonucleotide probes. For each given pixel, the fluorescence intensities from the different images form a signal sequence. This sequence is then compared to a library of reference codes from a codebook that associates each code with a gene. The best matching reference code is used to identify an associated gene that is expressed at that pixel in the image.
The codebook used to identify genes can include a number of negative control code words. These code words are generated by randomly assigning an on- or off-value to each bit of a code word, creating signal sequences that do not correspond to any gene in the sample. The negative control code words are used to differentiate true positive, false positive, and blank matches found in image sequences generated during imaging. The signal corresponding to the most commonly matched negative control code word determines the lowest signal that needs additional identification information to be confidently be matched to a gene.
In one aspect, a method of spatial transcriptomics includes receiving a plurality of images of a sample from an mFISH imaging system, for each pixel of a plurality of pixels registered across the plurality of images generating a pixel word from intensity values of each pixel of the plurality of pixels of the plurality of images with each pixel word represented by a sequence of N intensity values. For each pixel of the plurality of pixels, the pixel word for the pixel is compared to a codebook including a plurality of code words, and a closest matching code word of the plurality of code words to the pixel word is identified. Each code word is represented by a sequence of N bits. The plurality of code words include a plurality of gene-identifying code words and a plurality of negative control code words, and the plurality of negative control code words have an equal number of on-values. On-values of the plurality of negative control code words are evenly distributed across the N bits such that each ordinal position in the sequence of N bits has a same total number of on-bits from the plurality of negative control code words. A gene or error associated with the closest matching code word is determined, and for at least one pixel of the plurality of pixels an association of the pixel with the gene or error is stored.
In another aspect, a method of generating a codebook includes obtaining a first plurality of gene-identifying code words for the codebook. Each gene-identifying code word of the plurality of gene-identifying code words is represented by a sequence of N bits. Each code word of the first subset of code words includes a sequence of bits, and the sequence of bits correspond to a best match to a pixel data value identifying a gene. A plurality of negative control code words is generated, each negative control code word of the plurality of gene-identifying code words represented by a sequence of N bits. The plurality of negative control code words have an equal number of on-values. On-values of the plurality of negative control code words are evenly distributed across the N bits such that each ordinal position in the sequence of N bits has a same total number of on-bits from the plurality of negative control code words, and a Hamming distance between each negative control code word and each gene-identify code word is at least a distance threshold.
Advantages of implementations can include, but are not limited to, one or more of the following.
Disclosed herein is a method for generating a codebook for identifying gene targets during mFISH imaging where the negative control code words are generated with uniform numbers of on-values in each code word, and in each position across all code words. This method reduces possible degeneracy between negative control code word positions and ensures the set of code words achieves more uniform Hamming distance separation between codebook gene code words. The uniform distribution of on-off values in the set of negative control code words decreases the occurrence of false-positive matches thereby increasing the signal confidence of true-positive gene identifications and allowing more gene targets to be correctly identified without increasing the size of the codebook.
Increased positive gene identification per sequence of collected images leads to higher overall assay throughput as well as higher confidence in the results, reducing the collection of inconclusive data and increasing assay reproducibility. By filtering fewer false-positives and increasing confidence in collected signals, reagent usage is also reduced resulting in monetary benefit. Downstream analysis is also improved with regard to a lower false-positive rate and better filtration of sequence matches to negative controls.
The details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages will become apparent from the description, the drawings, and the claims.
Like reference numbers and designations in the various drawings indicate like elements.
Current approaches to generating the set of negative control code words in a codebook utilize random sequences of sparse binary code containing randomly-placed on-values, with a relatively low Hamming weight (e.g., number of on-values) compared to the length of each code word in the codebook (e.g., 25% or less of total code word length). A Hamming distance to the nearest gene-identifying code word of the codebook is calculated for each negative control code word, and negative control code words having a Hamming distance less than a distance threshold value are discarded and randomly generated again. Once all generated negative control code words exceed the Hamming distance threshold to the rest of the gene-identifying code words, the codebook is complete. This codebook is then used to deconvolute the multiplexed signals at each pixel location in a sequence of collected mFISH images and match the signal sequences to code words corresponding to gene targets for gene identification. The randomly-generated negative control code words serve as a filter for false-positives and matches to known negative control code words.
However, this approach of using randomly generated code words can lead to inconsistent bit position degeneracy, where every bit at a given column position within the set of negative control code words is either an on- or off-value across all negative control code words (e.g., all on- or all off-values in a bit position). This leads to inconsistent signal normalization and necessary assay iterations to increase data confidence. These issues are key issues causing decreased analysis throughput. Inefficiencies in identifying false-positives and high negative control signal results in duplication of work product, leading to inconsistent data output, increased reagent use, and reduced assay throughput.
An advantageous approach to creating the set of negative control code words includes a two-step process: creating code words in which each code word contains a known number of on-value bits (e.g., 1s); and creating code words with a uniform distribution of on-bits across all column positions. This method maintains the same Hamming distance threshold and increases overall quality of collected data leading to increases in assay throughput, reduction in reagents used, and reduced project times.
Referring to
The fluorescence microscope 120 includes an excitation light source 122 that can generate excitation light 130 of multiple different wavelengths. In particular, the excitation light source 122 can generate narrow-bandwidth light beams having different wavelengths at different times. For example, the excitation light source 122 can be provided by a multi-wavelength continuous wave laser system, e.g., multiple laser modules 122a that can be independently activated to generate laser beams of different wavelengths. Output from the laser modules 122a can be multiplexed into a common light beam path.
The fluorescence microscope 120 includes a microscope body 124 that includes the various optical components to direct the excitation light from the light source 122 to the flow cell 110. For example, excitation light from the light source 122 can be coupled into a multimode fiber, refocused and expanded by a set of lenses, then directed into the sample by a core imaging component, such as a high numerical aperture (NA) objective lens 136. When the excitation channel needs to be switched, one of the multiple laser modules 122a can be deactivated and another laser module 122a can be activated, with synchronization among the devices accomplished by one or more microcontrollers 144, 146.
The objective lens 136, or the entire microscope body 124, can be installed on vertically movable mount coupled to a Z-drive actuator. Adjustment of the Z-position, e.g., by a microcontroller 146 controlling the Z-drive actuator, can enable fine tuning of focal position. Alternatively, or in addition, the flow cell 110 (or a stage 118 supporting the sample in the flow cell 110) could be vertically movable by a Z-drive actuator 118b, e.g., an axial piezo stage. Such a piezo stage can permit precise and swift multi-plane image acquisition.
The sample 10 to be imaged is positioned in the flow cell 110. The flow cell 110 can be a chamber with cross-sectional area (parallel to the object or image plane of the microscope) with an area of about 2 cm by 2 cm. The sample 10 can be supported on a stage 118 within the flow cell, and the stage 118 (or the entire flow cell 110) can be laterally movable, e.g., by a pair of linear actuators 118a to permit XY motion. This permits acquisition of images of the sample 10 in different laterally offset fields of view (FOVs). Alternatively, the microscope body 124 could be carried on a laterally movable stage.
An entrance to the flow cell 110 is connected to a set of hybridization reagents sources 112. A multi-valve positioner 114 can be controlled by the controller 140 to switch between sources to select which reagent 112a is supplied to the flow cell 110. Each reagent includes a different set of one or more oligonucleotide probes. Each probe targets a different RNA sequence of interest, and has a different set of one or more fluorescent materials, e.g., phosphors, that are excited by different combinations of wavelengths. In addition to the reagents 112a, there can be a source of a purge fluid 112b, e.g., deionized (DI) water.
An exit to the flow cell 110 is connected to a pump 116, e.g., a peristaltic pump, which is also controlled by the controller 140 to control flow of liquid, e.g., the reagent or purge fluid, through the flow cell 110. Used solution from the flow cell 110 can be passed by the pump 116 to a chemical waste management subsystem 119.
In operation, the controller 140 causes the light source 122 to emit the excitation light 130, which causes fluorescence of fluorescent material in the sample 10, e.g., fluorescence of the probes that are bound to RNA in the sample and that are excited by the wavelength of the excitation light. The emitted fluorescent light 132, as well as back propagating excitation light, e.g., excitation light scattered from the sample, stage, etc., are collected by an objective lens 136 of the microscope body 124.
The collected light can be filtered by a multi-band dichroic mirror 138 in the microscope body 124 to separate the emitted fluorescent light from the back propagating illumination light, and the emitted fluorescent light is passed to a camera 134. The camera 134 can be a high resolution (e.g., 2048×2048 pixel) CMOS (e.g., a scientific CMOS) camera, and can be installed at the immediate image plane of the objective. When triggered by a signal, e.g., from a microcontroller, image data from the camera can be captured, e.g., sent to an image processing system 150. Thus, the camera 134 can collect a sequence of images from the sample.
To further remove residual excitation light and minimize cross talk between excitation channels, each laser emission wavelength can be paired with a corresponding band-pass emission filter 128a. Each filter 128a can have a wavelength of 10-50 nm, e.g., 14-32 nm. The filters are installed on a high-speed filter wheel 128 that is rotatable by an actuator 128b. The filter wheel 128 can be installed, e.g., at the infinity space, to minimize optical aberration in the imaging path. After passing the emission filter of the filter wheel 128, the cleaned fluorescence signals can be refocused by a tube lens and captured by the camera 134. The dichroic mirror 138 can be positioned in the light path between the objective lens 138 and the filter wheel 128.
The control software coordinates communication between the computer 142 and the device components of the apparatus 100. This control software can integrate drivers of all the device components into a single framework, and thus can allow a user to operate the imaging system as a single instrument (instead of having to separately control many devices).
Fluorescence images are acquired for each combination of possible values for the z-axis, color channel (excitation wavelength), lateral FOV, and reagent. A data processing system 150 is used to process the images and determine gene expression to generate the spatial transcriptomic data. At a minimum, the data processing system 150 includes a data processing device 152, e.g., one or more processors controlled by software stored on a computer readable medium, and a local storage device 154, e.g., non-volatile computer readable media, that receives the images acquired by the camera 134.
In some implementations, the data processing system 150 performs on-the-fly image processing as the images are received. In particular, while data acquisition is in progress, the data processing device 152 can perform image pre-processing steps, such as filtering and deconvolution, that can be performed on the image data in the storage device 154 but which do not require the entire data set.
The image files received from the camera can optionally include metadata, the hardware parameter values (such as stage positions, pixel sizes, excitation channels, etc.) at which the image was taken. The data schema provides a rule for ordering the images based on the hardware parameters so that the images are placed into one or more image stacks in the appropriate order. If metadata is not included, the data schema can associate an order of the images with the values for the z-axis, color channel, lateral FOV and reagent used to generate that image.
The collected images can be subjected to one or more quality metrics (step 203) before more intensive processing in order to screen out images of insufficient quality. Only images that meet the quality metric(s) are passed on for further processing.
In order to detect regions of interest, a brightness quality value can be determined for each collected image. The brightness quality can be used to determine whether any cells are present in the image. For example, the intensity values of all the pixels in the image can be summed and compared to a threshold. If the total is less than the threshold, then this can indicate that there is essentially nothing in the image, i.e., no cells are in the image, and there is no information of interest and the image need not be processed.
Next, each image is processed to remove experimental artifacts (step 204). Since each RNA molecule will be hybridized multiple times with probes at different excitation channels, strict alignment across the multi-channel, multi-round image stack is beneficial for revealing RNA identities over the whole FOV. Removing the experimental artifacts can include field flattening and/or chromatic aberration correction.
Each image is processed to provide RNA image spot sharpening (step 206). RNA image spot sharpening can include applying filters to remove cellular background and/or deconvolution with point spread function to sharpen RNA spots. In order to distinguish RNA spots from a relatively bright background, a low-pass filter is applied to the image, e.g., to the field-flattened and chromatically corrected images to remove cellular background around RNA spots. The filtered images are further de-convolved with a 2-D point spread function (PSF) to sharpen the RNA spots, and convolved with a 2-D Gaussian kernel with half pixel width to slightly smooth the spots.
The images having the same FOV are registered to align the features, e.g., the cells or cell organelles, therein (step 208). To accurately identify RNA species in the image sequences, features in different rounds of images are aligned, e.g., to sub-pixel precision. In particular, high intensity regions should generally be located at the same position across multiple images of the same FOV. Techniques that can be used for registration between images include phase-correlation algorithms and mutual-information (MI) algorithms.
After registration of the images in a FOV, spatial transcriptomic analysis can be performed (step 210). First, intensity values in the image are normalized relative to the maximum intensity value in the image. For example, the maximum intensity value is determined, and all intensity values are divided by the maximum so that intensity values vary between 0 and IMAX, e.g., 1.
Next the intensity values in the image are analyzed to determine an upper quantile that includes the highest intensity values, for example, the 99% and higher quantile (i.e., upper 1%). The intensity value at this quantile limit can be determined and stored. All pixels having intensity values within the upper quantile are reset to have the maximum intensity value, e.g., 1. Then the intensity values of the remaining pixels are binned and scaled to run to the same maximum (e.g., 1). To accomplish this, intensity values for the pixels that were not in the upper quantile are divided by the stored intensity value for the quantile limit. Decoding an image is explained with reference to
After normalization, this image stack is evaluated as a 2-D matrix 302 of pixel words. The matrix 302 has P rows 304, where P=X*Y, and B columns 306, where B is the number of images in the stack for a given FOV. Each row 304 corresponds to one of the pixels (the same pixel across the multiple images in the stack), the intensity values from the row 304 represent a pixel word 310. Each column 306 provides one of the values in the word 310, i.e., the intensity value from the image layer for that pixel. As noted above, the values can be normalized, e.g., vary between 0 and IMAX. Different intensity values are represented in
The data processing system 150 stores a codebook 322 that is used to decode the image data to identify the gene expressed at the particular pixel. The codebook 322 includes multiple reference code words, and each reference code word is associated with either a particular gene or a negative control code word. As shown in
Each row 324 contains a sequence of B values (e.g., bits) and corresponds to one of the code words 330, either a gene-identifying code word or a negative control code word, and each column 326 provides one of the values in the reference code word 330. For each column 326, the values in the reference code 330 can be binary, i.e., “on” or “off” For example, each value can be either 0 or IMAX, e.g., 1. The on and off values are represented in
Each code word of B values has 2B assignable combinations of values. However, utilizing a portion of these total assignable values for gene- or negative control words and leaving the remaining portion unassigned allows for a negative control design of the codebook 322. The codebook 322 maintains two parameters across all rows 324: each row 324 shares the same Hamming weight (HW) and minimum Hamming distance (HD) from other rows 324.
The HW of a code word is the number of on-values per row 324 and a uniform HW between rows 324 reduces disproportionate pixel value misidentification bias. Additionally, maintaining a low HW (e.g., four on-values per row) in the rows 324 compared to the total code word length of the codebook 322 further reduces misidentification frequency, thereby increasing accuracy.
The HD between each row 324 is the number of positions at which two numerical strings of equal length, e.g., a reference string and a code string, are different and is calculated as a sum of absolute differences between each value position in a code string and corresponding reference string, a means of measuring the information-distance between two binary strings. In other words, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could transform one string into the other. For example, given two six digit strings,
where HD, using the codebook 322 as a reference, has inclusive limits between 0 (e.g., identical strings) and B, the total number of columns (e.g., orthogonal strings). The information-distance criteria used to design the negative control code words can be a minimum, maximum, or exact value Hamming distance between the words of the codebook. If code words separated by a Hamming distance of 2 or more are used, then no single value-error (e.g., a “0” misidentified as a “1”) can transform one code word into another, reducing the misidentification rate. Increasing the Hamming distance separation requirement further decreases the misidentification rate. In some implementations, the Hamming distance can be at least four (e.g., >4).
The codebook 322 includes a number of code words corresponding to negative control words that when matched identify false-positive or known negative pixel words 310, non-sense words that do not correspond to any gene in the codebook 322. The negative control words are a number of rows 324 (E) that constitute a portion of the codebook 322 which includes between 5% to 25% of the total rows, R, of the codebook 322. For example, a codebook 322 including 140 rows 324 can reserve 132 rows corresponding to gene-identifying code words 330 (G) and 8 rows (˜6% of R) corresponding to negative control words (E). Using the example of
The codebook 322 can be generated algorithmically through the use of a coding language. The following example provides a method to generate a 140 word codebook 322 (e.g., codebook 322 in which R=140 and B=16, e.g., Xij where i=1, 2, . . . , 140, and j=1, 2, . . . , 16) including a set of M negative control code words (e.g., M=E<R and B=16, e.g., Yij where i=1, 2, . . . , M, and j=1, 2, . . . , 16).
The HW, HD, and the number of on-values (N) per bit position (e.g., column) of the negative control code words are defined for the codebook 322. In one example, the HW=4, HD=4, and the number of on-values is 2 (N=2, where N=iYij(2)). To determine a set of the negative control code words (e.g., Mij) that satisfies the above conditions, define a target array Lj(j=1, 2, . . . , 16)=N. Subtract the first code word (i=1) from target array (Lj−Xlj) and calculate an updated residue (S) such that S=Lj−Xlj.
Add the first negative control code word (Xlj) to the set of M negative control code words (Mlj(2)=Xlj). Determine the next negative control code word in remaining codebook (Xij where i≠1) by subtracting each remaining negative control code word from the residue calculated above such that S′=Lj−Xij where i≠1 and {x∈Z|x=−1, 0, 1, . . . } (Z represents integer set).
Determine a second negative control code word (Xi) that returns the lowest residue value and add the second negative control code word to the set of negative control code words (e.g., M2j(1)=X2j).
Repeat the above steps until iMij(2)=1. Update the codebook 322 X′{x|x∈X and x∉yij1}. Find the n+1 subset Mij(n+1) by using the updated codebook 322 X′ and iterating the above steps until n+1=N. The final set of negative control code words (M) will be Mij=n=iMij(n+1).
Bit-switching errors occur at a low rate (>10%) and negative control words in a codebook 322 allow for increased confidence in identification of gene words through identification of sense- or non-sense pixel words including one or more errors. For example, if a value in a pixel word is incorrectly identified, e.g., a “0” identified as a “1”, or vice versa, the pixel word may no longer be within an information-distance of the correct gene word and thus be misidentified. This can lead to missed gene counts and if the corresponding gene word is too close in information-distance to a neighboring gene word, the pixel word may be misidentified as a second, incorrect gene word. Negative control words are designed with a number of criteria to create a minimum information-distance between each negative control word and distribute the values within each negative control word 330 uniformly across the columns 326 of the negative control rows E.
The technique described below for generating the negative control code words of the codebook 322 can provide additional layers of data integrity protection by creating symmetric information-distance between negative control and code word code words. The technique can also uniformly (e.g., evenly) distribute the number and arrangement of on-values across all columns of the codebook 322.
Referring again to
where Ip,x are the values from the matrix 302 of pixel words and Ci,x are the values from the matrix 322 of reference code words. Other metrics, e.g., sum of absolute value of differences, cosine angle, correlation, etc., can be used instead of a Euclidean distance.
Once the distance values for each code word are calculated for a given pixel, the smallest distance value is determined, the code word that provides that smallest distance value is selected as the closest matching code word. The gene corresponding to that closest matching code word is determined, e.g., from a lookup table that associates code words with genes, and the pixel is tagged as expressing the gene.
Returning to
When the image stacking and gene word identification is complete, the maximum intensity values (e.g., counts) associated with a blank code word in the negative control words establishes a certainty threshold for filtering positive—from uncertain gene identifications. Gene code words below the certainty threshold can be raised above the certainty threshold with additional identification information. For example,
At the top of
At the top of
Although the description above focuses on code words having 16 bits, the technique described is adaptable to code words of other bit lengths.
Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface, a web browser, or an app through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet.
While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially be claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.
This application claims the benefit of priority to U.S. Application No. 63/166,204, filed on Mar. 25, 2021, the contents of which are hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
63166204 | Mar 2021 | US |