The present invention relates to the field of bioinformatics. In particular, the invention relates to a method of generating or selecting a more suitable compression parameter, based on transmission conditions.
The genome of any species can be sequenced and saved as a file to be transmitted through a computer network to relevant parties. However, large genomes contain so much information that the files are often too big to be transmitted fully within acceptable time frames. Typically, a large data file would be compressed into a smaller file before transmission, because smaller files transmit more quickly. However, compression of a large genome file is itself such a time consuming process, that it has rendered meaningless any improvement in the speed of transmission by file compression.
The problem is compounded by the rapid development of genome sequencing technology, which has led to reduction in costs of genome sequencing and an abundance of genome sequences. However, these are not accompanied by improvements in data transmission technology. Therefore, the flood of demands for genomic data is not met timely; and genome sequence inaccessibility has become the bottleneck that holds back the biotechnological and molecular biological industries.
Bandwidth is always changing. For any given piece of big genome sequence at any point in time, it is difficult to decide whether that piece of data would be transmitted faster after sacrificing more time to compress the data to a greater extent, or whether it would be faster to transmit the minimally compressed data.
Thus, it is desirable to propose a method of determining how to divide resources between the processes of compression and transmission of biodata, such that transmission of and access to such data is optimized.
The a first aspect, the invention proposes a method of transmitting a genome sequence among a series of genome sequence, comprising the steps of:
Preferably, the neural network comprises an Actor-Critic algorithm to train the neural network to select or to modify the compression algorithm to improve transmission efficiency
In a second aspect, the invention proposes a framework for reinforcement-learning-based network transmission model for a series of compressed genomes, comprising
Preferably, the Agent is capable of selecting a compression algorithm to compress the original genome according to network conditions, thereby achieving a balance between the efficient compression and transmission of genome sequence.
Therefore, the present invention uses machine learning to build and learn a model relating different network conditions (including but not limited to bandwidths), the number of parts into which a genome sequence is divide, and the time taken for successful transmission. For example, the model is constantly being updated with data of the latest genome sequence that has just been compressed and transmitted, and the data for machine learning includes:
It is also possible to include the extent of parallel processing as a parameter.
Therefore, for every next genome sequence, the model is able to choose how many parts the genome sequence should be divided into depending on the length of the entire genome and the network's current bandwidth, among other variables included in the machine learning, and so compress the genome sequence not too much and not too little, so that the compression extent (or ratio) is the optimal for the bandwidth.
In a second aspect, the invention proposes a framework for reinforcement-learning-based network transmission model for a series of compressed genomes, comprising
Typically, the Agent is capable of selecting a compression parameter for a compression algorithm, to compress the original genome according to the current (i.e. latest known) network conditions, thereby achieving a balance between the efficient compression and transmission of genome sequence.
In other words, the invention proposes a reinforcement-learning-based network transmission model for compressed genomes, which generates adaptive compression (stride) parameter for future genomes. More specifically, the method trains a neural network model that selects a compression (stride) parameter for future genome, which is based on observations provided by the process of transmitting the last compressed genome.
Accordingly, the invention applies reinforcement learning to optimize efficiency of both compression and transmission of genome sequence. Specifically, the Agent proactively and adaptively generates compression parameters or stride parameters to adjust encoding speed and compression ratio to suit different genome sizes. This provides the possibility of achieving a balance between the efficient compression and transmission of genomes.
Experiment results show that the proposed model can be used to select a compression (stride) parameter that compresses the original genome to an extent that is appropriate for optimized transmission according to present, i.e. latest, network conditions. Therefore, the invention provides a possibility of both compressing and transmitting genome sequence to optimum, i.e. without over-optimizing any one of these processes to cause a reduction in efficiency of the other.
Embodiments of the invention may comprise the following features:
(1) Reinforcement Learning based on the transmission of compressed genomes, which generates adaptive compression (stride) parameter.
This includes training a neural network model for selecting a compression (stride) parameter for a future genome, based on observations made on the process of transmitting the latest compressed genome.
(2) A specific Environment, in which the latest compressed genome is transmitted through computer networks (i.e., in a process P1) and the next genome is compressed by learning-based genome codec (i.e., in a process P2).
(3) Using the Actor-Critic (A3C) approach in the training algorithm, which is a state-of-the-art actor-critic RL algorithm.
An agent state Sa is defined by: data on past genome throughput, the size of the next genome, the number of genomes left to be compressed and transmitted, and the last genome compression algorithm (denoted Gob). Based on the definition of agent state Sa, the A3C can be applicable to train the compression algorithm so as to improve the network transmission of compressed genomes.
(4) A variety of reward goals is designed, such as to maximize the encoding speed and compression ratio for genomes (i.e., the maximization of hyper-parameters of the LEC, such as the compression parameter or stride parameter), minimize the latency in the transmissions of the compressed genome sequence in computer transmission networks, while maintaining compression speed consistency (i.e., avoiding constant Gob fluctuations or stride fluctuations).
The proposed method provides a possibility of the following advantages (for details please refer to the description of embodiments and the experimental results):
(1) it is possible to test a trained model in a simulated Environment, using network broadband datasets (i.e., network trances), RTT and noisy. In addition, the method can run experiments over mahimahi emulated network and run real-world experiments.
(2) For each species' genome sequence gn, if LEC compresses the genome by using different Gobs (GoB1, . . . , GoBi), different sizes of compressed genomes files (x1, . . . , xi) will be generated. In other words, there is a one-to-one correspondence between GoBi and xi. This means that compression is bespoke and optimised for the transmission of each genome sequence, instead of a sweeping, once-size-fits-all approach.
(3) The invention uses data from the compression of the genome of a species, particularly data on the network condition during the transmission of the compressed genome, as feed forward information to select or generate the compression algorithm for the genome of the next species.
Initially, in a process P1 of an Environment, where n−1≥1, when n−1=1, a default compression algorithm having pre-determined quality, {right arrow over (GoB)}1 is selected to be the compression algorithm GoBn-1. The number n denotes the place of a species in a queue or series of many species, and n−1 denotes the species before.
Thus, the 1st species' genome is compressed by using LEC with {right arrow over (GoB)}1, and this first compressed genome is then transmitted through computer networks. Subsequently, the compression parameter for every next species is selected by referring to network conditions during the successful transmission of the species that is just one place ahead in the queue or series.
(4) When applying the invention to a simulation based on transmission dataset, instead of propagation delay, processing delay and queuing delay, round-trip time (RTT) may be used. The number of hidden layers, the number of filters of each convolutional layer and RTT affect the rewards for training the neural network. These parameters can be set as follows, for example,
(4) The design and principle of adaptive stride algorithms are similar to those of the adaptive Gob algorithms.
It will be convenient to further describe the present invention with respect to the accompanying drawings that illustrate possible arrangements of the invention, in which like integers refer to like parts. Other arrangements of the invention are possible, and consequently the particularity of the accompanying drawings is not to be understood as superseding the generality of the preceding description of the invention.
The invention uses, but is not limited to only using, a method for compressing and encoding genome sequence which is described in US Patent U.S. Pat. No. 11,769,570 B2, This patent is owned the applicant, and has the same lead inventor. Therefore, a short description of the method will be given for completeness.
The method comprises the following steps. In a compression phase, the whole genome sequence of a species is partitioned into parts called Groups of Bases. The reader should note that the acronym Gobs in U.S. Pat. No. 11,769,570 B2 is a shorthand for “groups of bases”, while the same acronym in this specification is a shorthand for the compression algorithm use to compress a genome sequence. The groups of bases are then processed in parallel but individually, by an LEC codec that converts each group of bases into a bit stream.
Subsequently, in a transmission phase, the bit stream of each group of bases is transmitted. Since the whole genome file is composed of many sub-files (or sub-genomes), each sub-file is compressed/transmitted individually, the method is more flexible in practice.
At the receiver side, the individual bit streams are decoded back into normal uncompressed parts of the genome and concatenated to form the original genome sequence.
The number of partitions or parts into which the original genome sequence is divided is a parameter of the codec, which the user decides on and inputs, and this affects the compression ratio and the encoding speed.
In this way, a very-fast mode can be pre-set to slice the sequence data into the maximum number of groups of bases. A very-slow mode provides the highest compression ratio but with the slowest processing speed.
However, even with the method describe above, compressing a set of genome sequence, which is often a very big set, require a lot of process time. If such big data is overly compressed, the period of time starting from compressing the data up to successful arrival of the compressed data at the destination, could be significantly longer than a period of time required for compressing the data to only a lesser extent but transmitting so compressed data earlier.
On the other hand, under-compressed data may take less time to compress but under-compressed data could take so much time to transmit that, the total time spent is significantly more than that that would have been required to compress the data more and transmitting final, smaller file to the receiver.
The present invention relates to a reinforcement-learning-based network transmission for compressed genomes, which provides the possibility of selecting or generating adaptive compression parameters for the compression algorithms, or adaptive stride parameters, using reinforcement learning (RL).
Stride parameters refer to an alternative but different compression method, called Stride.
The present invention obtains these parameters of compression to determine the extent of the compression in response to transmission network conditions, so that by machine-learning, the next compression is performed only to the extent that it is optimised for the network condition.
Subsequently, at 105, a neural network which is call the Agent in this description, selects or generates a compression parameter, denoted Gob parameter, for compressing the genome sequence of the species next in the queue or series, in position n, based on the information just obtained by the network condition detector.
Finally, at 107, a Learning-based gEnome Codec (LEC), i.e. a compressor-decompressor module, compresses the genome sequence of that next species in position n based on the chosen or selected Gob parameter, to produce the compressed genome sequence as a file, i.e., n.lec, completing the process, at 109.
The above steps are repeated in as many cycles as needed for all the species to be compressed and transmitted, that is, starting again at step 101 for species in the queue or series position n+1, the network condition detector obtains real-time network condition during the successful transmission of the compressed genome sequence of the species in the queue or series position n.
The cycles are applied by a framework comprising an Environment, and the above-mentioned Agent which is a reinforcement-learning (RL) neural network, and which has a reward function for the Agent.
An illustration of the Environment 201 is shown in
An illustration of the Agent 301 is shown in
As mentioned, the Environment 201 comprises two processes, one process 203, P1 and the other process 205, P2. Specifically, P1 is the process of transmitting a compressed genome of a species through network, while P2 is the process of compressing an original genome of a species using learning-based genome codec (LEC).
P1 denotes the process of transmitting the file, n−1.lec, containing the compressed genome sequence 505 of the specifies in the queue or series position n−1, i.e. the (n−1)th species.
For each cycle, Information and variables during process P1 include:
These variables, including the genome sequence of the next species, i.e. the genome of the nth species gn, are collectively called the Environment state 207, denoted Se.
In brief, therefore, Sa includes at least the following information GoBn-1, cn-1, throughput of the transmission of the (n−1)th species un-1.
Specifically, the Agent state Sa=({right arrow over (un-1)}, {right arrow over (xn)}, cn-1, GoBn-1)
In other words, the last-used compression parameter GoBn-1, applied to the genome sequence of the species ahead in queue or serial position n−1 which has just been transmitted, is not presumed to be the appropriate compression parameter for the present genome sequence of the nth species. Instead, the proposed method is continually observing the delays in the transmission of the compressed genome sequence 505 of the species one place ahead in the queue or series, and selecting afresh the most suitable compression parameter or modifying the a compression parameter into the most suitable compression parameter, based the delays and those other information in the Agent state Sa.
Therefore, it can be said that the action taken by the Agent 301 is modified whenever the Environment 201 changes, and proposed provides a possibility of optimizing each compression of genomes data to achieve low transmission delay and avoid high transmission delay.
The efficiency of a transmission is reduced by delays in the transmission. It should be noted transmission or computer network delay refers to latency in the travel of a single data bit across a network from one communication endpoint 701 to another 703. The different types of delays in the transmission are illustrated in
The overall delay in a transmission is defined as the combined effect of the following four specific types of delays, i.e.
Regarding transmission delay, it may be expressed as a function, i.e.
Therefore, compression reduces the size of the genome sequence, and this can reduce transmission time.
Regarding propagation delay, it may be calculated as follows,
Equations (3) and (4) show how it has been taken into consideration that propagation delay varies with propagation medium and link length.
When the genome sequence of the preceding species, the (n−1)th species, has been compressed by an LEC 503, the Agent 301 inputs the Agent state Sa to the neural network 305.
Preferably, the neural network comprises an Actor-Critic algorithm (A3C). A3C is a state-of-the-art Actor-Critic method, which consists of an Actor neural network 801 and a Critic neural network 803. Given an array of available Gobs, i.e. {right arrow over (GoB)}=[GoB1, . . . , GoBi] or an array of available strides, i.e. {right arrow over (str)}=[str1, . . . , strj], the A3C) can be used to generate adaptive Gob policies or adaptive stride policies.
On receiving Agent state Sa=({right arrow over (un-1)}, {right arrow over (xn)}, cn-1, GoBn-1), the Actor neural network 801 begins to learn a policy, and takes an action A that corresponds to the Gob for compressing the genome of the present species, i.e. the nth species). Specifically, the following steps are executed.
The index of the largest probability pl in the array {right arrow over (p)} (i.e., l) is computed. Subsequently, l and {right arrow over (GoB)} are used to obtain the adaptive Gob for the present nth genome (i.e., GoBn). This may be expressed as,
In short, the policy can be described with a probability distribution over actions and states as follows,
Step 1) and step 2) in the Critic neural network 803 are similar to step 1) and step 2) in the Actor neural network, and do not need to be described again.
In step 3) of the Critic neural network 803, the 128-dimensional vector (i.e., the output of step 2)) is fed into a linear neuron (without activation function) to generate a value V(Sa). If faced with two states, the reinforcement-learning (RL) neural network, i.e. the Agent 301, compares the values of the two states and then takes the better policy.
A value function is then designed according to the policy, which is defined as vπθ(st), where πθ(·) denotes the policy function.
After applying each action A, the Environment 201 provides the Agent 301 with a reward, R, for that training data. Moreover, the Agent 301 aims to maximize the expected and cumulative rewards as follows:
The reward is designed to reflect the performance of the network transmission of each species' genome according to the following factors:
Based on this, the reward for the nth species' genome is given as follows,
Generally, there are two kinds of quality evaluation functions, a linear function and a log function. The reward function using linear quality evaluation is expressed as follows,
The reward function using log quality evaluation is expressed as follows,
While there has been described in the foregoing description preferred embodiments of the present invention, it will be understood by those skilled in the technology concerned that many variations or modifications in details of design, construction or operation may be made without departing from the scope of the present invention as claimed.
The proposed method has been evaluated using a simulated but realistic network situations, using broadband dataset that the Federal Communications Commission (FCC) collected in 2018. The records of network traces in the FCC broadband dataset consist of timestamps and bandwidths, where the former is in MB/sec and the latter is in seconds. In addition, random noise is added to each transmission of compressed genome sequence 505 to emulate real-world computer networks.
However, processing delays and queuing delays in a computer network are not easily available data. Therefore, instead of propagation delay, processing delay and queuing delay, the evaluation uses round-trip time (RTT) instead of the delay equation (1). RTT does not include transmission delay but includes the other three kinds of delays. There, RTT is defined as follows,
The RTT may be given a fixed value, e.g., 80 milliseconds. Therefore, based on the previous analysis for network delay and RTT, the combined propagation delay, processing delay and queuing delay can be replaced with RTT.
Finally, network delay is described with transmission delay and RTT as follows,
The variables throughput (in bit/sec) and duration required for calculating transmission delays can be obtained from the broadband dataset.
It can be seen that actual bandwidths in the network changes over time, but the proposed method is able to select an appropriate GOB in response to the current bandwidth.
The compressed gene sizes are shown in middle subgraph. The fluctuations in the size of the compressed genomes are basically consistent with the fluctuations in the actual network bandwidth. This shows that the choice of GOB made by the proposed method reflects the state of the bandwidth reasonably, and is therefore effective.
This application relates to and claims the benefit of U.S. Provisional Application No. U.S. 63/429,796 filed 2 Dec. 2022; the content of the application is incorporated herein by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2023/134440 | 11/27/2023 | WO |
Number | Date | Country | |
---|---|---|---|
63429796 | Dec 2022 | US |