The present invention belongs to the technical field of computer-aided design of continuous microfluidic biochips, in particular relates to a DRL-based control logic design method for continuous microfluidic biochips.
Continuous microfluidic biochips, also known as laboratory equipments on a chip, have received a lot of attentions in the last decade due to their advantages of high efficiency, high precision and low cost. With the development of such chips, traditional biological and biochemical experiments have been fundamentally changed. Compared with traditional experimental procedures that require manual operations, the execution efficiency and reliability of bioassay are greatly improved because the biochemical operations in biochips are automatically controlled by internal microcontrollers. In addition, this automated process avoids a false detection result due to human intervention. As a result, such laboratory equipments on such chip are increasingly used in some areas of biochemistry and biomedicine, such as drug discovery and cancer detection.
With advances in manufacturing technology, thousands of valves are now capable of being integrated into a single chip. These valves are arranged in a compact, regular arrangement to form a flexible, reconfigurable and universal platform, which is a Fully Programmable Valve Array (FPVA) and is capable of being used to control the execution of bioassay. However, because FPVA itself contains a large number of micro-valves, it is impractical to assign a separate pressure source to each valve. To reduce the number of pressure sources, a control logic with multiplexing capabilities is used to control valve status in the FPVA. To sum up, the control logic plays a crucial role in the biochips.
In recent years, several methods have been proposed to optimize the control logic in the biochips. For example, control logic synthesis is investigated to reduce the number of control ports used in the biochips; the relationship between switching patterns is investigated in the control logic, and the switching time of the valve is optimized by adjusting a pattern sequence required by a control valve; and, the structure of the control logic is investigated, so that a multi-channel switching mechanism is introduced to reduce the switching time of the control valve. At the same time, an independent backup path is also introduced to realize fault tolerance of the control logic. However, none of the above methods take sufficient account of the allocation order between a control pattern and a multi-channel combination, resulting in the use of redundant resources in the control logic.
Based on the above analysis, we propose PatternActor, a deep reinforcement learning based control logic design method for continuous microfluidic biochips. By using the proposed method, the number of time slices and control valves used in the control logic is capable of being greatly reduced, and better control logic synthesis performance is brought, so as to further reduce a total cost of the control logic and improve the execution efficiency of biochemical applications. According to our investigation, the present invention is the first time to carry out research by using the method of deep reinforcement learning to optimize the control logic.
The purpose of the present invention is to provide a Deep Reinforcement Learning (DRL) based control logic design method for continuous microfluidic biochips. By using the proposed method, the number of time slices and control valves used in the control logic is capable of being greatly reduced, and better control logic synthesis performance is brought, so as to further reduce a total cost of the control logic and improve the execution efficiency of biochemical applications.
To realize the above purpose, the technical solution of the present invention is as follows: a DRL-based control logic design method for continuous microfluidic biochips, wherein the method comprises the following steps:
Compared with the prior art, the present invention has the following beneficial effects: by using the proposed method, the number of time slices and control valves used in the control logic is capable of being greatly reduced, and better control logic synthesis performance is brought, so as to further reduce a total cost of the control logic and improve the execution efficiency of the biochemical applications.
The technical solution of the present invention is described in detail in combination with the accompany drawings.
Proposed in the present invention is a DRL-based control logic design method for continuous microfluidic biochips. Overall steps are as shown in
The method specifically comprises the following design process:
The specific technical solution of the present invention is realized as follows:
Normally, the transition of a control channel from a state in time t to a state in time t+1 is called a time interval. In this time interval, the control logic may need to make a plurality of times of changes to the state of the control channel, so a time interval may consist of one or more time slices, each of which involves changing the state of a relevant control channel. For an original control logic with a multiplexing function, each time slice only involves switching the state of one control channel.
As shown in
In
In order to realize the multi-channel switching of the control logic and reduce the number of time slices in the process of state switching, the most important thing is to obtain which control channels need to switch states simultaneously. Herein we consider the case where the biochemical application state transitions have been given, the control channel states known at each moment are used to reduce the number of time slices in the control logic. A state matrix {tilde over (P)} is constructed to contain a whole state transition process of the application, wherein each row in the {tilde over (P)} matrix represents a state of each control channel at every moment. For example, for the state transition sequence: 101→010→100→011, the state matrix {tilde over (P)} can be written as:
In the above given state transition sequence, for the state transition from 101→010, the first control channel and third control channel need to be connected to a core input firstly, and the pressure value of the core input is set to be 0, and then transmitted to the corresponding flow valve through these two channels. Secondly, the second control channel is connected to the core input. At this time, the pressure value of the core input needs to be set to 1, which is also transmitted to the corresponding flow valve through this channel. In addition, the switching matrix {tilde over (Y)} is used to represent the above two operations needed to be performed in the control logic. In the switching matrix {tilde over (Y)}, element 1 represents that a control channel is now connected to the core input and that the status value in the current channel has been updated to the same pressure value as the core input. Element 0 represents that a control channel is now not connected to the core input and that the status value in the current channel is not updated. Therefore, according to the state matrix in the example, the corresponding switching matrix {tilde over (Y)}can be obtained as:
Each row of the {tilde over (Y)} matrix is called a switching pattern. It is noted that there is an element with value X in the matrix {tilde over (Y)}, because in some state transition processes, such as the state transition from 010→100, the state value of the third control channel is unchanged at two adjacent moments. Therefore, the third control channel can choose to update the state value at the same time as the first control channel, and can also choose not to do any operation to keep its own state value unchanged. For a switching pattern in which each row of the {tilde over (Y)} matrix has more than one 1 elements, the states of a plurality of control channels corresponding to the switching pattern may not be updated at the same time. At this time, it is necessary to divide the switching pattern into a plurality of time slices and use a plurality of corresponding multi-channel combinations to complete the switching pattern. Therefore, in order to reduce the total number of time slices required by the overall state switching, the multi-channel combination corresponding to each switching pattern should be carefully selected. For the switching matrix {tilde over (Y)}, the number of rows in the matrix is the total number of switching patterns required to complete all state transitions, and the number of columns is the total number of control channels in the control logic.
In this example, a goal at moment is to select efficient multi-channel combinations to implement all switching patterns in the switching matrix {tilde over (Y)}while ensuring that the total number of time slices used to complete the process is minimal.
For N control channels, 2N−1 multi-channel combinations can be represented by a multiplexed matrix {tilde over (X)} with N columns, where one or more combinations need to be selected from all the rows in the {tilde over (X)} matrix to achieve the switching pattern represented by each row in the {tilde over (Y)} matrix. In fact, for the switching pattern of each row in the switching matrix {tilde over (Y)}, the number of feasible multi-channel combinations that can realize the switching pattern is far less than the total number of multi-channel combinations in the multiplexing matrix {tilde over (X)}. A closer look reveals that the multi-channel combinations that enable the switching pattern is determined by the position and number of element 1 in the pattern. For example, for the switching pattern 011, the number of elements 1 is 2 and their positions are respectively in the second and third positions of the whole switching pattern, which means that the multi-channel combinations to realize the switching pattern are only related to the second channel and third control channel in the control logic. Therefore, the optional multi-channel combinations that can realize the switching pattern 011 are respectively 011,010 and 001, and only three multi-channel combinations are needed herein. Using this feature, we can infer that the number of optional multi-channel combinations to realize a certain switching pattern is 2n−1, wherein n represents the number of elements 1 in the switching pattern.
As described above, for the switching pattern for each row in the switching matrix, a joint vector group {right arrow over (M)} can be constructed to contain alternative multi-channel combinations that can make up each switching pattern. For example, for the switching matrix {tilde over (Y)} in the above example, the corresponding joint vector group {right arrow over (M)} is defined as:
The number of vector groups in the joint vector group {right arrow over (M)} is the same as the number of rows X in the switching matrix, and each vector group contains 2n−1 sub-vectors with dimension N, which are optional multi-channel combinations to achieve the corresponding switching pattern. When the element mi,j,k in the joint vector group {right arrow over (M)} is 1, it means that the control channel corresponding to the element is related to the realization of the i-th switching pattern.
Since an ultimate goal of the multi-channel switching scheme is to realize a switching matrix {tilde over (Y)} by selecting multi-channel combinations represented by the sub-vectors of each vector group in the joint vector group {right arrow over (M)}, a method array {circumflex over (T)} is constructed to represent the positions in {right arrow over (M)} of the corresponding multi-channel combinations used for the switching pattern of each row in the switching matrix {tilde over (Y)}. At the same time, it is also convenient to obtain a specific multi-channel combination required. The method array {circumflex over (T)} contains X sub-arrays (consistent with the number of rows in the switching matrix {tilde over (Y)}), and the number of elements in the sub-array is determined by the number of elements 1 in the switching pattern corresponding to the sub-array, that is, the number of elements in the sub-array is 2n−1. For the above example, the method array {circumflex over (T)} ′ is defined as follows:
T=[[0,0,1],[1],[1],[1],[1],[0,0,1]] (4)
wherein, the i-th sub-array in {circumflex over (T)} represents that some combinations of the i-th vector group in {right arrow over (M)} are selected to realize the switching pattern of the i-th row of the switching matrix. For example,
For element yi,k in the matrix {tilde over (Y)}, when the value of the element is 1, it indicates that an i-th switching pattern involves a k-th control channel to realize the state switching, so it is necessary to select a sub-vector that is also 1 in the k-th column from the i-th vector group in the vector {right arrow over (M)} to realize the switching pattern. This constraint may be expressed as follows:
The maximum number of control patterns allowed to be used in the control logic is usually determined by the number of external pressure sources, and is expressed as a constant Qcw with a value of 2┌log2N┐, which is usually much less than 2N-1. In addition, for the sub-vectors selected from the joint vector group {right arrow over (M)}, a binary row vector {right arrow over (G)} with a value of 0 or 1 is constructed to record the non-repeating sub-vectors finally selected (multi-channel combinations). The total number of non-repeating sub-vectors finally selected cannot be greater than Qcw, so the constraint is as follows:
If the j-th element of the i-th sub-array in the method array {circumflex over (T)} is not 1, then the multi-channel combination represented by the j-th sub-vector of the i-th vector group in the joint vector group {right arrow over (M)} is not selected. However, other sub-vectors with the same value of elements as the sub-vector may exist in other vector groups in the joint vector group {right arrow over (M)}, so a multi-channel combination with the same values of the elements may still be selected. Only when a certain multi-channel combination is not selected in the whole process, the column element corresponding to the multi-channel combination in G is set to be 0, and its constraint is:
t
i,j
≤G
[m
] (7)
∀i=0, . . . ,X−1,j=0, . . . ,H(j)
Each sub-array in {circumflex over (T)} indicates which multi-channel combinations represented by sub-vectors are selected from the vector group of {right arrow over (M)} to implement the corresponding switching pattern in {tilde over (Y)}. The number of elements 1 in each sub-array in {circumflex over (T)} represents the number of time slices required to implement the corresponding switching pattern in {tilde over (Y)} in the sub-array. Therefore, in order to minimize the total number of time slices for realizing all switching patterns in {tilde over (Y)}, the optimization problem that can be solved are as follows:
By solving the optimization problem as shown above, the multi-channel combinations required to realize the whole switching scheme is obtained according to the value of {right arrow over (G)}. Also, the multi-channel combination used for switching pattern for each row in {tilde over (Y)} is determined by the value of ti,j. That is, when the value of ti,j is 1, the multi-channel combination is the value of the sub-vector represented by Mt,j.
3. An allocation process of control pattern:
By solving the integer linear programming model constructed as above, independent or simultaneous switching control channels can be obtained. These channels are collectively referred to as the multi-channel switching scheme. The scheme is represented by a multi-path matrix, as shown in (9). In this matrix, there are nine flow valves (i.e. f1_f9) connected to the core input, and there are five multi-channel combinations in total to achieve the multi-channel switching. In this case, each of these five combinations needs to be allocated a control pattern. Herein, we firstly assign five different control patterns to the multi-channel combinations in each row of the matrix. These control patterns are located on the right side of the matrix. This allocation process is the basis of building a complete control logic.
4. An optimization process for PatternActor:
For control channels that require the state switching, the appropriate control pattern must be carefully selected. In the present invention, we propose a method PatternActor based on the deep reinforcement learning to seek a more effective pattern allocation scheme for control logic synthesis. Specifically, it focuses on building DDQN models as reinforcement learning agents, which can use effective pattern information to learn how to allocate control patterns, so as to obtain which pattern is more effective for a given multi-channel combination.
The basic idea of deep reinforcement learning is that agents constantly adjust their decisions made at each time t to obtain the overall optimal policy. This policy adjustment is based on the reward returned by the interaction between the agent and the environment. The flow chart of interaction is as shown in
For the optimization process for PatternActor, the present invention mainly uses deep neural networks (DNNs) to record data, while it can effectively approximate a state value function used to find the optimal policy. In addition to determining the model for recording data, the above three elements need to be designed next to build a deep reinforcement learning framework for the control logic synthesis.
Before designing the three elements, we firstly initialize the number of control ports available in the control logic as 2×┌log2N┐, and these ports can form 2┌log2N┐ control patterns accordingly. In the present invention, the main objective of the process is to select an appropriate control pattern for the multi-channel combination, thus ensuring that the total cost of the control logic is minimized.
4.1. State Design of PatternActor
Before selecting the appropriate control pattern for the multi-channel combination, it firstly needs to design the agent state. The state represents the current situation, which affects the selection of control pattern of the agent. It is usually expressed as s. We design the state by concatenating a multi-channel combination of time t with a coded sequence of selected actions at all time. The purpose of this state design is to ensure that the agent can take into account the current multi-channel combination and the existing pattern allocation scheme, so that the agent can make better decisions. Note that the length of the encoding sequence is equal to the number of rows in the multi-path matrix, that is, each multi-channel combination corresponds to one bit of action code.
Take the multi-path matrix in (10) as an example, the initial state s0 is designed according to the combination represented by the first row of the multi-path matrix, and the time t increases with the number of rows of the matrix. Therefore, the current state at t+2 should be represented as st+2. Accordingly, the multi-channel combination “001001010” in the third row of the multi-path matrix needs to be assigned a control pattern. If the two combinations of the first two rows of the multi-path matrix are allocated to the second and third control patterns, respectively, then the state st+2 is designed to be (00100101023000). Since the combinations under the current and subsequent moments are not allocated to any control pattern, the action codes corresponding to these combinations are represented by zeros in the sequence. All the states herein form a state space S.
4.2. Action Design of PatternActor
An action represents what the agent decides to do in the current state and is usually represented as a. Since the multi-channel combination needs to allocate the corresponding control pattern, the action is naturally the unselected control pattern. Each control pattern can be selected only once, and all control patterns generated by the control port constitute an action space A. In addition, the control patterns in A are encoded in an ascending order of serial number “1”, “2”, “3”, etc. When the agent takes an action in a certain state, the action code indicates which control pattern has been allocated.
4.3. Reward Function Design of PatternActor
The reward represents a revenue that the agent gets by taking an action, usually expressed as r. By designing the reward function of the state, the agent can obtain effective signals and learn in a right way. For a multi-path matrix, assuming that the number of rows in the matrix is h, we represent an initial state as si and a termination state as si+h−1 accordingly. In order to guide agents to obtain a more efficient pattern allocation scheme, the design of reward function needs to involve two Boolean logic simplification methods: a logic tree simplification and a logic forest simplification. The implementation of these two techniques in the reward function is described below.
(1) Simplification of the Logic Tree:
The simplification of logic tree is basically implemented for the corresponding flow valve in the Boolean logic. It mainly uses a Quine-McCluskey method to simplify an internal logic of the flow valve. In other words, it merges and cancels the control valves used in the internal logic. For example, control patterns, such as
For the design of the reward function, the following variables are considered. Firstly, we consider the situation in which the control patterns have been allocated to the corresponding multi-channel combination in the current state, the number of control valves that can be simplified by allocating this pattern is expressed by svc. Secondly, on the basis of the above situation, we randomly assign another feasible pattern for a next combination, and the number of control valves that can be simplified in this way is expressed by svn. In addition, we consider the case where the next multi-channel combination successively allocates the remaining control patterns in the current state. In this case, we take the maximum number of control valves required by the control logic, expressed by Vm. Based on the above three variables, the reward function from state si to si+h−3 is expressed as rt=svc+λ×svn−β×Vm, wherein λ and β are two weight factors, whose values are set to 0.16 and 0.84 respectively. These two factors mainly indicate an extent to which two situations involving the next combination in the current state influence pattern selection.
(2) Simplification of Logical Forest:
Simplification of the logic forest is achieved by merging simplified logic trees between flow valves to further optimize the control logic in a global manner. The same example of the multi-path matrix in (10) above is used to illustrate this optimization approach, which is primarily achieved through a sequentially merged logical tree of f1-f3 to share more valve resources, wherein the simplification procedure is shown in
For states si+h−2, when the current multi-channel combination has already been allocated control patterns, we consider the case where the last combination selects the remaining available patterns, where the minimum number of control valves required by the control logic is represented by Vu. On the other hand, for the termination state si+h−1, the sum of control valve and path length is considered and expressed by spv. For these last two states, the case involving variable svc mentioned above is also considered. Therefore, for the termination state Si+h−1, the reward function is represented as rt=svc−spv, and for the state si+h−2, the reward function is represented as rt=svc−Vu.
To sum up, the overall reward function can be expressed as follows:
After designing the above three elements, the agent can construct the control logic in the way of reinforcement learning. In general, the problem concerning the reinforcement learning is mainly solved through a Q-learning approach, which focuses on estimating a value function of each state-action pair, i.e., Q(s,a), thus selecting an action with a maximum Q-value in the current state. In addition, the value of Q(s,a) is calculated based on the reward received for performing the action a in the state S. In fact, the reinforcement learning is a mapping relationship between a learning state-action pair and the reward.
For the state st ∈S and action at ∈A at time t, the Q value of the state-action pair, that is, Q(st,at), is predicted by iterative updating of the formula shown below.
where α∈ (0,1] represents a learning rate, and γ∈ [0,1] represents a discount factor. The discount factor reflects relative importance between the current reward and future reward, and the learning rate reflects a learning speed of the agent. Q′(st,at) represents an original Q value of this state-action pair. rt is the current reward received from the environment after performing the action at, st+1 represents the state at the next moment. Essentially, Q-learning estimates a value of Q(st,at) by approximating a long-term cumulative reward, wherein the long-term cumulative reward is the sum of the current reward rt and the maximum Q value
of all the actions that can be discounted in the next state st+1.
As the evaluation value of a largest operator in the Q-learning, namely, max Qα∈A(st+1, α), is overestimated, the sub-optimal action exceeds the optimal action in Q value, resulting in the failure to find the optimal action. Based on the existing work, the DDQN can effectively solve the above problems. Therefore, in our proposed approach, we use this model to design the control logic. The structure of the DDQN consists of two DNNs, called a policy network and a target network, wherein the policy network is the state selection action and the target network evaluates the quality of the action taken. The two work alternately.
In the training process of the DDQN, in order to evaluate the quality of the action taken in the current state st, the policy network firstly finds the action amax, which maximizes the Q value in the next state st+1, as follows:
The next state st+1 is then transmitted to the target network to calculate a Q value of the action amax (i.e., Q(st+1,amax,θt31)). Finally, the Q value is used to calculate a target value Yt, which is used to evaluate the quality of the action taken in the current state st, as follows:
Y
t
=r
t
+γQ(st+1,amax,θt−) (14)
Through the above policy network, Q values of all possible actions in the state of st can be obtained, and then appropriate action can be selected for the state through the action selection policy. Taking action a2 selected by the state st as an example, as shown in
In the present invention, both neural networks in the DDQN consist of two fully connected layers and are initialized with random weights and bias.
Firstly, the parameters related to the policy network, target network, and experiential replay buffer must be initialized separately. Specifically, an experiential replay buffer is a buffer of a loop that records information allocated by previous control patterns in each round. These pieces of information are often referred to as a transition. The transition consists of five elements, i.e., (st,at,rt,st+1,done). In addition to the first four elements described above, the fifth element done represents whether a termination state has been reached, and is a variable with value of 0 or 1. Once the value of done is 1, it means that all multi-channel combinations have been allocated the corresponding control patterns. Otherwise, there are still combinations to which control patterns need to be allocated in the multi-channel matrix. By setting a storage capacity for the experiential replay buffer, if the number of transitions stored exceeds the maximum capacity of the buffer, the oldest transition will be replaced by the newest transition.
Training sessions (episodes) are then initialized as constants E, and the agent is ready to interact with the environment. Before the beginning of the interaction process, we need to reset the parameters in the training environment. In addition, before each round of interaction begins, it needs to check whether the current round has reached the termination state. In a round, if the current state has not reached the termination state, feasible control patterns are selected for the multi-channel combination corresponding to the current state.
The calculation of Q value in the policy network involves action selection. The ε-greedy policy is mainly used to select the control pattern from the action space, in which ε is a randomly generated number and is distributed in an interval [0.1, 0.9]. Specifically, the control pattern with the maximum Q value is selected with a probability of ε. Otherwise, the control pattern is randomly selected from an action space A. This policy enables the agent to choose a control pattern with a trade-off between development and exploration. In the course of training, the value of ε is increased with the influence of an increment coefficient
After that, the transition made up of these five elements is stored in sequence in the experiential replay buffer. After a certain number of iterations, the agent is ready to learn from previous experiences. During the learning process, small batch transitions are needed to be randomly selected from the experiential replay buffer as learning samples, which enables the network to be updated more efficiently. The loss function in (15) is used to update the parameters of the policy network by adopting gradient descent back propagation.
L(θ)=E[(rt+γQ(st+1,a°;θt−)−Q(st,at,θt))2] (15)
After several cycles of learning, the old parameters of the target network are periodically replaced by the new parameters of the policy network. It should be noted that the current state transitions to the next state st+1 at the end of each round of interaction. Finally, the agent uses the PatternActor to record a best solution found so far. The whole learning process ends with the number of training sessions set earlier.
The above are preferred embodiments of the present invention, and any change made in accordance with the technical solution of the present invention shall fall within the protection scope of the present invention if its function and role do not exceed the scope of the technical solution of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
202210585659.2 | May 2022 | CN | national |
This application is a continuation application of International Application No. PCT/CN2023/089652, filed on Apr. 21, 2023, which is based upon and claims priority to Chinese Patent Application No. 202210585659.2, filed on May 27, 2022, the entire contents of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2023/089652 | Apr 2023 | US |
Child | 18238562 | US |