The present techniques relates to a method for a parallel implementation of a sequential Monte Carlo method of modelling an industrial process on a distributed memory architecture, and a system for implementing the same.
Sequential Monte Carlo (SMC) methods are a well-established family of algorithms to solve state estimation problems related to dynamic and static models, given a stream of data (i.e. measurements). The key idea is to employ a user-defined proposal distribution to draw N samples (i.e. particles) which approximate the probability density function of the state of the model. Then a correction step, called resampling, is used to correct the particle degeneracy that the sampling technique introduces. The application domain may be vast and diverse and may range, for example, from positioning [1] to medical research [2] [3], risk analysis [4], weather forecasting [5], financial econometrics [6] and broadly speaking any research or industrial field in which it is important to collect data and make predictions afterwards.
Modern applications of SMC methods have increasingly demanding accuracy and run-time constraints which can be met in several ways. For example, the number of particles could be increased [9] [10], or more sophisticated proposal distributions could be used as described in [7] or more measurements could be used if possible [8]. However, applying any of these solutions is likely to significantly slow down the run-time which could also become even more problematic if the constraint on the measurement rate is strict. An alternative solution to meet the run-time constraints without losing accuracy could be to employ parallel computing.
SMC methods are parallelisable, but it is not trivial to achieve an efficient parallelisation. The resampling step, which is necessary to respond to particle degeneracy [24], is indeed a challenging task to parallelise. This is due to the problems encountered in parallelising the constituent redistribute step whose textbook implementations achieve 0(N) time complexity on a single core.
On Shared Memory Architectures (SMAs), it has been proved that redistribute can achieve 0(log2N) time complexity. Examples can be found in [9] to [12] for GPUs. However, High Performance Computing (HPC) applications need to use Distributed Memory Architectures (DMAs) to overcome the limitations in modern SMAs of low memory capacity and low degree of parallelism (DOP).
On DMAs, parallelization is more complicated as the cores cannot directly access the other cores' memory without exchanging messages. Three DMA solutions (along with mixed versions of them) are presented in [13], [14]: Centralized Resampling (C-R), Resampling with Proportional Allocation (RPA) and Resampling with Non Proportional Allocation (RNA). These approaches have been re-interpreted and improved several times, such as in [15]-[18]. In C-R, a central unit gathers the particles from all cores, performs redistribution and scatters the result back to the cores, making the communication increase linearly with degree of parallelism (DOP). RPA is randomised as redistribute is partly or potentially entirely centralized to a single or subset of master cores, leading to strongly non-deterministic data movement. In RNA, although the central unit is simpler than in RPA, after performing local resampling, particle exchange between neighbour cores is cyclically performed until redistribution is achieved, risking heavy redundant communication. In [18] to [20], it has been proved that such strategies have accuracy or stability issues, especially for unbalanced workload, large N or large DOP.
According to the present invention there is provided an apparatus and method as set forth in the appended claims. Other features of the invention will be apparent from the dependent claims, and the description which follows. We also describe a method for estimating a true state of a physical system, the method comprising receiving, from at least one sensor, a measurement of at least one parameter within the physical system, wherein the at least one parameter is related to the true state of the physical system; and implementing, on a server comprising a distributed memory architecture, a sequential Monte Carlo (SMC) process using a plurality of statistically independent particles and the at least one measured parameter to estimate the true state of the physical system, wherein the distributed memory architecture has a plurality of cores each of which are ranked. Implementing the SMC method may comprise determining the number of copies of particles required for each particle; and redistributing copies of particles which are to be duplicated across the distributed memory architecture. Redistributing includes moving the particles which are to be duplicated to the cores having lowest ranks to create gaps in the higher ranked cores using a rotational nearly sort process; when the rotational nearly sort process is complete, filling the gaps with the required number of copies of the particles which are to be duplicated using a rotational split process; and when the rotational split process is complete, filling the remaining gaps in each core using a sequential redistribute process. After redistributing, each of the plurality of cores only has copies of the particles which are to be duplicated and the total number of copies of each particle across the plurality of cores within the distributed memory architecture equals the determined number of copies.
We also describe an architecture for estimating a true state of a physical system, the architecture comprising at least one sensor for measuring at least one parameter within the physical system, wherein the at least one parameter is related to the true state of the physical system; and a server comprising a distributed memory architecture having a plurality of cores, wherein the cores are uniquely identified by a rank. The cores are configured to implement a sequential Monte Carlo (SMC) process on the distributed memory architecture using a plurality of statistically independent particles to estimate the true state of the physical system using the at least one measured parameter. Implementing the SMC process by the cores may comprise determining the number of copies of particles required for each particle; and redistributing copies of particles which are to be duplicated across the distributed memory architecture. Redistributing includes moving the particles which are to be duplicated to the cores having lowest ranks to create gaps in the higher ranked cores using a rotational nearly sort process; when the rotational nearly sort process is complete, filling the gaps with the required number of copies of the particles which are to be duplicated using a rotational split process; and when the rotational split process is complete, filling the remaining gaps in each core using a sequential redistribute process. After redistributing, each of the plurality of cores only has copies of the particles which are to be duplicated and the total number of copies of each particle across the plurality of cores within the distributed memory architecture equals the determined number of copies.
The above method and architecture provide an efficient parallel implementation by effectively parallelising the redistribute step which may be considered to be a constituent part of a resampling step. The SMC method may be used to perform state estimation of dynamic or static models under non-linear, non-Gaussian noise. The plurality of particles may be termed hypotheses or samples and may be generated at every step. The plurality of particles may be sampled from a user-defined proposal distribution such that each particle represents the probability density function (pdf) of the true state of the physical system. Each particle may be assigned to a weight such that the weights provide information on which particle best describes the real state of the physical system.
The following features may apply to the method and the architecture.
The number of particles (N) may be a power of two. Similarly, the number of cores (P) may be a power of two. Each core may own n=N/P elements.
The rotational nearly sort process may be described as a process which ensures that the particles which are to be duplicated are separated from those which are not to be duplicated. The particles which are not to be duplicated are considered empty spaces (or gaps, the terms may be used interchangeably) which can be substituted with copies of the particles which are to be duplicated. Moving the particles using the rotational nearly sort process may comprise determining, for each particle to be duplicated, the number of particles in lower ranked cores which are not to be duplicated; and obtaining, for each particle to be duplicated, an associated binary expression of the number of particles which are not to be duplicated. Determining the number of particles not to be duplicated may be considered as counting the zeros that each of the particles can see in the lower ranked cores. The number of zeros may be considered as the number of positions to shift by down into the lower ranked cores. Shifting may be shifting to a lower ranked core and also where the core contains more than one particle, i.e. more than one partition, shifting to a lower partition within a core.
Moving the particles which are to be duplicated may be an iterative process. The iterative process may comprise scanning, for each of the particles to be duplicated, a bit of the associated binary expression; rotating based on the scanned bit and repeating in sequence the scanning and rotating steps for each bit. When the scanned bit is equal to one, the associated particle may be rotated to a lower ranked core. When the scanned bit is equal to zero, the associated particle may not be rotated (or shifted, the terms are used interchangeably, as well as related nouns such as shifts and rotations). A least significant bit of each binary expression may be scanned first. Then each of the adjacent bits may be scanned in sequence. Finally, a most significant bit of each binary expression may be scanned.
There may be an optional leaf shift stage if the number of cores (P) is fewer than the number of particles (N). The leaf shift stage may be performed before the scanning and rotating steps. In the leaf shift stage, a core may transfer all the particles to a neighbouring core.
The goal of the rotational split phase is to fill the gaps in a way that balances the workload across the cores. This can be achieved by making enough room between the particles that have to be copied and creating the copies in these new empty spaces. While these two steps could be performed one after another, here it is shown that it is possible to perform both at the same time.
Filling the gaps using the rotational split process may comprise computing a minimum shift value for each of the particles to be duplicated and computing a maximum shift value for each of the particles to be duplicated. The minimum shift value may be a value representing the minimum number of shifts each of the particles must take to securely distance itself from the particles in the lower ranked cores; and the maximum shift value may be a value representing the maximum number of shifts each of the particles may take to end up in a gap.
Computing the minimum shift value associated with a particle to be duplicated may comprise summing for each lower ranked core one less than the number of copies which are required of each particle in the lower ranked cores. Computing the maximum shift value associated with a particle to be duplicated may comprise summing the minimum shift value with one less than the number of copies which are required of the associated particle. The calculations may be:
min_shiftsi=Σj=0i−1(ncopiesj−1)=csumi−ncopiesi−i
max_shiftsi=min_shiftsi+ncopiesi−1=csumi−i−1
where ncopiesi is the number of copies which is required for each particle, and csumi is the inclusive cumulative sum over ncopies up to the ith element. More details are provided below.
For each particle to be duplicated, an associated binary expression of the minimum shift value and the maximum shift value may be obtained. Creating the gaps may be an iterative process comprising scanning and rotating steps which are repeated. For each of the particles to be duplicated, a bit of the associated binary expression of the minimum shift value and the maximum shift value may be scanned. When both scanned bits are equal to one, the associated particle may be rotated to a higher ranked core by a number of positions corresponding to the scanned bits. When only one scanned bit is equal to zero, the number of copies to be duplicated may be split to determine excess duplicates and the excess duplicates may be rotated to a higher ranked core by a number of positions corresponding to the scanned bits. When no scanned bit is equal to one, there is no rotation. The rotation may be determined by the bit which is scanned, for example particles are rotated by decreasing power of two numbers of bit position.
The scanning may be in the opposite direction to the scanning in the rotational nearly sort process. A most significant bit of each binary expression may be scanned first. Then each of the adjacent bits may be scanned in sequence. Finally, a least significant bit of each binary expression may be scanned.
There may be an optional leaf shift stage if the number of cores (P) is fewer than the number of particles (N). The leaf shift stage may be performed after a certain number of scanning and rotating steps.
The goal of the sequential redistribute phase is to fill the remaining gaps in each core which are not filled by the rotational split phase. Filling the remaining gaps in each core using a sequential redistribute process may comprise duplicating particles within the memory of each core.
The SMC method may further comprise obtaining values for each of the particles; computing weights for each of the particles, wherein the weight indicates the resemblance of the obtained values to the true state; normalising the computed weights; determining the number of copies which are required based on the normalised weights; and after redistributing the particles, resetting the weights. The obtaining, computing, normalising, determining, redistributing, and resetting steps may be repeated at each iteration. The values for the each of the particles may be obtained from the proposal distribution. This step, along with computing weights for each of the particles, may be termed importance sampling (IS). Determining the number of copies ncopiesi required for each particle i may be computed from:
ncopiesi=┌cdfi+1−u┐−┌cdfi−u┐
where cdfi is the cumulative density function of the weights up to and including the ith particle and u is drawn from a uniform distribution with u˜[0,1). The bracket operator is the ceiling function which rounds up the input number to the next integer, unless the input number is an integer already in which case the output of the ceiling function is identical to the input.
There are various embodiments of the SMC process. For example, the SMC process may be selected from a particle filter, a fixed-lag SMC sampler and an SMC sampler. For particle filters, there may be one measurement per iteration and resampling is typically only used if needed. For SMC samplers, there may be only one measurement which is collected, and the iterations are run given the same measurement. The method finalises estimates with recycling and resampling is used only if needed. For fixed-lag SMC, a history of l>1 measurements may be used at each iteration (where l is the lag). The method samples and resamples trajectories of l>1 particles.
For a particle filter, the weight may be calculated from:
where wti and wt−1i are the weights for time steps t and t−1, xti and xt−1i are the values of the particles for time steps t and t−1, p(xti, Yt|xt−1i) is the incremental posterior, commonly called target or posterior distribution, q( )is the proposal distribution defined by the user, and Yt is a measurement of the system collected at time t. For an SMC sampler, the weight may be calculated from:
where wti and wt−1i are the weights for iterations t and t−1, xti and xt−1i are the values of the particles for iterations t and t−1, p(xti|Yt) is the pdf of the state, q( )is the proposal distribution defined by the user, Yt is the measurement of the system used in every iteration and L( ) is a user defined backward kernel.
For a fixed-lag SMC sampler, a new measurement may be collected from the at least one sensor but the previous l measurements may also be used. The weight may be calculated from:
where wti and wt−1i are the weights for time steps t and t−1,
We also describe a (non-transitory) computer readable medium carrying processor control code which when implemented in a system causes the system to carry out the method described above.
Although a few preferred embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that various changes and modifications might be made without departing from the scope of the invention, as defined in the appended claims.
For a better understanding of the present techniques, and to show how embodiments of the same may be carried into effect, reference will now be made, by way of example only, to the accompanying diagrammatic drawings in which:
The client layer 300 may comprise any suitable client device, e.g. a laptop or desktop PC, which may monitor the estimate ft of the state. The client layer 300 is connected both to the sensors, e.g. via an antenna or other wireless connection and to the server layer 400. The server layer 400 may be remote from the client layer 300 (i.e. in a different location). As shown, the client layer may request the new measurements from the sensor layer 200 and receive the new measurements from the sensor layer 200. The client layer 300 may then send the received measurements to the server layer 400 together with a model of the physical system 100. In other words, the client layer has two roles: the first is to request and collect the measurements and pass to the server; the second is to receive a new estimate of the state ft from the server such that it can be printed out for the users. The user may be an engineer who designs the model, i.e. a source code describing Xt, Yt, the proposal, the target, the initial distributions and the backward kernel L( ).
The server layer 400 may be a supercomputer on which parallel methods may be installed and run. The server layer 400 may be a high performance computing machine comprising hardware cores (or processors, the terms may be used interchangeably) connected via ethernet cables or Infiniband. In particular, the server layer 400 may run an SMC method as described in more detail below. At each iteration, the server waits for the client to send a new measurement and then runs a full SMC iteration (i.e. IS, normalise, ESS, MVR, redistribute, reset and estimate in sequence) to produce a new estimate ft and send it back to the client. All parallelisation methods (including the one for redistribute which is the focus of the method described below) are installed in the server, since they parallelise tasks which are fully compatible with any model.
The client layer must also send the model to the server and compile it with the rest of the SMC method components to generate the executable file. A job request may then be submitted to the server such that the process can start. For example, a job request may be done by submitting a bash file, containing the path to the executable file, the number of cores and the memory to be allocated and other required inputs such as N, N*, TSMC, l and variant which are described in more detail below.
Further details of the server layer are shown in
The main advantages of a DMA relative to an SMA include larger memory and number of cores, scalable memory and computation capability with the number of cores and a guarantee of there being no interference when a core accesses its own memory. The main disadvantages are the cost of communication and the consequent data movement. This may affect the speed-up relative to a single core.
In order to implement the methods (or algorithms, the terms are used interchangeably) discussed below, a Message Passing Interface (MPI) may be used. MPI is one of the most common application programming interfaces (APIs) for DMAs. In this model, the total number of cores is represented by P and each core is uniquely identified by a rank p=0, 1, . . . , P−1. The cores are connected via communicators and use send or receive communication routines to exchange messages.
In this arrangement, the SMC module 40 may be distributed across multiple components. The SMC module implements a sequential Monte Carlo method and may be used to perform state estimation of dynamic or static models under non-linear, non-Gaussian noise. There are a range of different implementations of SMC methods, including sequential importance resampling (SIR) particle filters. SMC methods apply the importance sampling principle to make Bayesian inferences.
The general idea of SMC methods consists of generating N statistically independent hypotheses called particles or samples at every step t. The population of particles (or samples) xt is sampled from a user-defined proposal distribution q(xti|xt−1i, Yt) such that xt represents the probability density function (pdf) of the state of a model. Each particle xti is then assigned to an unnormalized weight wti such that the array of weights provides information on which particle best describes the real state of interest. Thus, we have:
xt∈N×M
w
t
i
=f(wt−1i, xti, xt−1i, Yt)
wt∈N
As explained below, the methods implemented on this system use the divide-and-conquer paradigm and therefore there are always a power of two number of cores to balance the communication between them. For the same reason, the number of particles N will also be always a power of two number such that the particles x t and every array related to them, e.g. the weights wt will be equally spread amongst the cores. This means that every core will always own exactly n=N/P elements of all the listed arrays whose indexes will be spread over the cores in increasing order. More precisely, given a certain N, P pair the ith particle (where 0≤i≤N−1) will always be given to the same core with MPI rank p=int(i/n). The space complexity is then 0(1) when P=N.
At the initial iteration t=0, no measurement has been collected yet, so Initialising the weights may be done by setting each weight to the same value. The prior data Y0 may be given and the initial values for the particles may be initially drawn from the initial distribution because this is the best assumption without feedback, thus we have
where Xt is the current true state of the system, M is the dimension of the state, Yt is a measurement of the system sent to the server at iteration t, D is the dimension of the measurement, w0i is the weight at iteration t=0 to be applied to each particle x0i, N is the number of particles, q0( ) is the initial proposal distribution defined by the user.
Once the initial values are obtained and for any iteration t>0 measurements may be collected and the values for the particles may be drawn from the proposal distribution (step S106), for example as defined in equation (1). This phase along with the weighting procedure in equation (2) may be termed importance sampling (IS).
x
t
i
˜q(xti|xt−1i, Yt) (1)
The obtained weights may then be computed as a function of xti, xt−1i, wt−1i, Yt, and normalised (steps S108 and S110), for example as defined in equations (2) and (3).
where xti is a particle at iteration t, q is the proposal distribution defined by the user, Yt is a measurement of the system, wti is the weight of the ith particle at iteration t, {tilde over (w)}ti is the normalised weight of the ith particle at iteration t and N is the number of particles. Equation (3) thus represents a normalise step and is used because the weights need to sum up to 1.0 (i.e. 100% of all probabilities), just like any pdf does.
The particles may be subject to a phenomenon called degeneracy which (within a few iterations) make all weights but one decrease towards 0. This is because the variance of the weights is proven to increase at every iteration as described in [24]. This is addressed by a resampling phase which may be considered to repopulate the particles by eliminating the most negligible ones and duplicating the most important ones. This resampling phase may not always be triggered and for example may be only triggered when needed, e.g. when the (approximate) effective sample size Neff is below a threshold N* (e.g. N/2). There is thus a decision at step S111 as to whether resampling is required, and this requires a computation of the effective sample size (ESS) and is defined by
In some cases, the user may purposely decide to perform resampling always. This can be done by setting N*=N+1. In this case, SMC is typically found in the literature under the name of Bootstrap Filter. Several (biased or unbiased) resampling schemes exist [25], [26] but they all repopulate the particles in three steps.
When it is determined that resampling is required, the first step (step S112) of the resampling may be to define the number of copies of the ith particle which are required (ncopiesi). Various techniques may be used to generate ncopiesi but regardless of the method of generation, the following statements are true:
Σi=0N−1ncopiesi=N (5)
ncopies∈N
E[ncopiesi]=N{tilde over (w)}ti
0≤ncopiesi≤N∀i
One method of performing this step is Systematic Resampling, as described in [25], but is also termed Minimum Variance Resampling (MVR) in several referenced works, e.g. [12], [19], and [21]-[23]. The key idea of MVR is to first compute the cumulative density function (CDF) of {tilde over (w)}t, then draw u˜[0,1) from a uniform distribution and finally compute each ncopiesi as follows:
ncopiesi=┌cdfi+1−u┐−┌cdfi−u┐ (6)
The bracket operator is the ceiling function which rounds up the input number to the next integer, unless the input number is an integer already.
Once the number of copies is determined, the second step of resampling is a redistribute step (S114) in which each particle xti is duplicated as many times as ncopiesi. The detail of the redistribute step is described in more detail below. The final step of resampling is to reset all the weights to the initial value (i.e. 1/N).
After resampling, every iteration in the SMC method is completed by making an estimate ft of the true state Xt by computing the mean of the particles xt as follows:
This procedure goes on for TSMC SMC iterations.
There are various embodiments of SMC which may be used, and three variants are described in detail below. The main difference between each variant is the equation defining the weights—equation (2) above.
Particle filters (PFs) are described for example in and are SMC methods that work well on dynamic models, which are typically used in real-time contexts where the true state changes frequently. In the PF, the SMC iterations are commonly called time steps. A new measurement Yt is available at each time step and thus the client and sensor layers of
In this SMC variant the weight equation is:
where the p( ) term, commonly called target or posterior distribution, is computed as the product of p(xti|xt−1i), the prior distribution, and p(Yt|xti), the likelihood distribution. Thus, the weight equation may be expressed as:
where xti and xt−1i are particles at time t and t−1, q( ) is the proposal distribution defined by the user, Yt is a measurement of the dynamic system collected at time t, wti and wt−1i are the weights of the ith particle at time t and t−1.
Particle filters typically work well when the sensor layer always provides low-latency measurements. However, that is not always possible due to time delays in the external environment. Fixed-lag (FL) SMC such as described in [29] are usually used rather than particle filters in this scenario. In FL SMC, a new measurement Yt is collected from the sensors and sent to the server. However, the server will also keep in memory, the previous l measurements Yt−l:t−1, where l is commonly called lag. Hence l+1 measurements Yt−l:t are considered every time, such that the filter can retrospectively reprocess the previous measurements in light of that most recently received. Therefore, instead of sampling N particles xti, FL SMC samples N trajectories of l+1 particles xt−l:ti from the old trajectories as follows:
x
t−l:t
i
˜q(xt−l:ti|
where
Particles filters may be intuitively viewed as FL SMC with l=0. Therefore, the equation above can be substituted in the generic algorithm which is installed in the server. Depending on the chosen SMC variant, l is automatically set to 0 to work in a general-purpose way. The particle trajectories are weighted by the following equation which defines the weights:
where L(
SMC samplers as described in [30] work well on static models, hence those scenarios where the true state is assumed to be constant for a considerably large amount of time, such that very frequent measurements (e.g. every few seconds as in real-time applications) will not lead to a significantly different state estimation. SMC samplers then collect a single measurement at t=1. During the SMC iterations, the same data is re-proposed to the server by the client and each new estimate ft is simply a more accurate state estimate than it was in the previous SMC step. In other words, one can think of SMC samplers as particle filters which are specialised in maximising the state estimation accuracy given a single measurement, which requires lots of computation and hence is only practical if the application is off-line. Here the weight equation is calculated as follows:
Once again, the p( ) terms can also be expressed as likelihood-prior products as for the PF, but here it is omitted for brevity. After the SMC iterations, SMC samplers optimise each state estimate, ft, such that, at the tth iteration, the output takes the following value:
where the normalisation constants cτ may be computed as follows:
This optimisation technique is called recycling and was first proposed in the context of SMC samplers in [31]. To be efficient, this operation is computed recursively at each SMC iteration, i.e. by updating step by step its numerator, nsum, and denominator, dsum, otherwise it would require the cores holding a lot of state information.
SMC samplers find application in several domains, such as parameter tuning for system design or medical applications for disease diagnosis [32], [33], where the state is assumed to be constant on the testing day. The description below in relation to
The following is an example of the pseudo code for an algorithm which may be used to implement the SMC method described in
After an initialise step, each iteration produces a new state estimate given one or more measurements. Each iteration requires the following steps (or components): importance sampling (IS), normalise, ESS, resampling and estimate (followed by recycling in the case of SMC samplers). As explained above, particle filters (PF), SMC samplers and fixed-lag SMC are examples of SMC methods. For particle filters, there is one measurement per iteration and resampling is typically only used if needed. For SMC samplers, only one measurement is collected, and the iterations are run given the same measurement. The method finalises estimates with recycling and resampling is used only if needed. For fixed-lag SMC, a history of l>1 measurements is used at each iteration. The method samples and resamples trajectories of l>1 particles. For each of particle filters, SMC samplers and fixed-lag SMC there is a variant when resampling is always used, and this is called Bootstrap Filter.
Each of the components of the SMC module may be parallelised as explained below by one of four parallelisation methods: embarrassing parallel, reduction, cumulative sum and fully-balanced redistribute (such as the new RoSS method). In the process described above, the steps reset, initialise, equations (6), (9) and all variants of (2) are parallelisable using embarrassing parallel because the cores can work simultaneously and independently on each ith element of these operations. They are thus element-wise operations and achieve 0(1) time complexity for P=N cores. The steps represented by equations (3), (4), (7) and (13) require sum and can be easily parallelised by using reduction. The time complexity of any reduction operation scales logarithmically with the number of cores, more precisely as defined below:
0(N/P+log2P)
On MPI, reduction can be computed by calling MPI_Reduce or MPI_Allreduce. An example of a reduction to compute sum for N=8 elements and P=4 MPI cores is shown in
Calculating the cumulative density function (CDF) of the weights requires cumulative sum which is also termed prefix sum or scan in the literature. Parallel cumulative sum scales as above. On MPI, parallel cumulative sum can be computed by calling MPI_Scan or MPI_Exscan as described for example in [34]. An example for N=8 elements and P=4 MPI cores is shown in
The redistribute step used in the process above and described below in detail is a new fully-balanced and deterministic redistribute step. By contrast, some known techniques which are used in the redistribute step are impossible to parallelise in an element-wise fashion. For example, a textbook implementation of redistribute is described as sequential redistribute (S-R) and the algorithm is shown in the following pseudo-code:
It is impossible to simply divide equally the iteration space across the cores because each ncopiesi randomly changes at every iteration t as it may be equal to any integer number between 0 and N. Therefore, an element-wise parallelisation would be extremely unbalanced. On DMAs, parallelisation is even more problematic because the cores can only directly access their own private memory.
The fully-balanced redistribute step of the present techniques has a perfectly balanced network where the cores deterministically perform the same computation and send/receive the same number of messages. Since the result is exactly the same as S-R, a resampling step using the fully-balanced redistribute process described below provides an exact result. A fully-balanced redistribute step is one in which the cores deterministically perform the same computation and send/receive the same number of messages. An alternative way of parallelising the redistribute step is a central unit approach. In this approach, the central unit(s) may be overworked in comparison with the remaining cores because they perform extra duties, such as decision making, particle duplication, particle routing to the other cores or a combination of them. Some central unit approaches may sacrifice accuracy to have faster communication between the cores [20], or they might be non-deterministic [14], [15]which potentially translates to little or no scalability when, in the worst case, the workload is centralised to a single core [19].
As shown below, the redistribution in the new method happens in logarithmic time complexity by using a three-phase approach where the first two phases take 0(log2N) steps. After that, S-R can be performed locally in constant time, as explained below.
Before the redistribute step, the values of each of the eight particles x0 to x7 are distributed across the cores in order, i.e. the lowest ranked core p=0 has the values of the two lowest numbered particles x0 and x1. The number of copies of each particle (ncopiesi) is determined as described above and is stored with the value of the particle. As shown, zero copies of particles x0, x1, x2, x5 and x7 are required together with 1 copy of particle x3, 5 copies of particle x4, and 2 copies of particle x6. In the first phase of the redistribute step, all particles which must be duplicated are moved to the left (i.e. moved to cores having lower ranking) using a technique termed rotational nearly sort. This also automatically creates empty spaces, or gaps, on the higher ranked cores, that can be filled with particles to duplicate. In other words, the purpose of this phase is to ensure that the particles which are to be duplicated are separated from those which are not to be duplicated. As shown in
The next phase of the redistribute step is to fill the empty spaces on the higher ranked cores with the required number of copies. However, instead of directly filling the gaps, the next phase is termed rotational split and creates room to the right (i.e. in the next highest core) of each particle value to duplicate and fills the gaps at the same time.
The final phase, illustrated in
In other words, at every iteration, the bits will be scanned and rotations to the left will be performed if and only if the scanned bit is 1. For the example of particle x3, at the first iteration, the bit is 1 and thus there is rotation to the left by one position because this is the first iteration. In the second iteration, the relevant bit is also equal to 1 and thus again there is rotation but in this case by two positions because this is the second bit (and second iteration). In the third (and final) iteration, the relevant bit is 0 and thus there is no rotation. Since all particles shift to the left and follow an LSB to MSB order strategy, it is impossible that two consecutive particles for which ncopiesi>0 will collide, because during the rotations, they will at most catch up with each other. For example, particles x4 and x6 are separated by a zero and must rotate by three and four positions respectively. However, by the time x6 has to rotate by four, x4 will have already rotated by three positions.
The following is an example of an algorithm which may be used to implement the rotational nearly sort process described in
Such a routine only takes 0(N/P) iterations. As shown in
It is noted that at this point, it is possible to prove that the particles within core p must shift to the left by as many positions as the number of zero elements in ncopies owned by the cores with a lower rank. If zeros is the array which counts the number of ncopiesi=0 within each core, each element of shifts can be initialised as follows:
shiftsp=Σ{tilde over (p)}=0p−1zeros{tilde over (p)} (14)
zeros∈P
shifts∈P
0≤zerosi≤N, ∀i
0≤shiftsi≤N, ∀i
where shifts is the array to keep track of the remaining shifts, zeros is the array which counts the zeros within each core and p is the number of the core.
The next stage shown in
csumj←csumj−1+arrayj∀j=1,2, . . .
As we can see, for core p=0 and j=0 within the core, the array has a value of 2 and thus the value for the first location in csum is also 2. All the other values in the array are 0 and thus the value for each entry in csum is 2. In the next step, the final value from csum is duplicated to form {coreSum, sum}. The cores are partnered and sum is exchanged with the partner and the received value is added (as in the reduction operation). For the higher ranked partner, the received value is also added to the value of coreSum. In the final stage:
Returning to
If P<N, as shown in
Examples of the bitwise & are shown below. In the first example, N=64, P=8 such that log2(N/P)=3. In this example, we wish to mask the log2(N/P) LSBs of a number. If the input number is 43, this is written in binary as (101011)2. The mask is equal to N/P−1, i.e. 7 which in binary is (000111)2. The bitwise & result is thus (000011)2. In the second example, we select the δth bit of a number for δ=1, 2, . . . , log2N. In this example, δ=4 and N=64. The input number is 43, this is written in binary as (101011)2. The mask is equal to 2δ−1, i.e. 8 which in binary is (001000)2. The bitwise & result is thus (001000)2. In both examples, the & operation of two bits equals 1 only if both bits are 1, otherwise it equals 0.
After the optional leaf stage, the actual tree-structure routine can start. At every kth stages of the tree (k=1, 2, . . . , log2P), any core p will send to its partner p−2k−1 all its particles (i.e. x and ncopies) and shiftsP−N/P2k−1 (i.e. the number of remaining shifts after the current rotation), if and only if:
(shiftsp&N/P2k−1)>0
This corresponds to checking a new bit of shiftsp, precisely the one which is significant to this stage. At every stage, the particles will then shift by an increasing power of two number of positions and shifts get updated in 0(1). Thus, as shown in
ncopies=[λ0, λ1, . . . , λm−1, 0, . . . , 0] (15)
The minimum number of shifts may be calculated (step S300) by summing up all the (ncopiesi−1) terms on the left of the ith particle. Thus, for the example in
The next steps of the method are to express the minimum and maximum as binary numbers (steps S304, S306). Thus, the minimum and maximums for the particle in position i=2 are expressed as (100)2 and (101)2, respectively. The minimum and maximums for the particle in position i=1 are expressed as (000)2 and (100)2, respectively. The binary numbers are then scanned and rotations are based on the scanned bits as in the rotational nearly sort process, However, in the rotational split process, the binary numbers are scanned in the opposite direction. In other words, progressively smaller power of two rotations to the right are carried out depending on the MSB of the binary notation for the minimum and maximum shifts. This is because the particles are to be scattered rather than gathered together. For each particle, three scenarios may occur: none of the copies must move; all of them must rotate; or only some must split and shift while the remaining ones keep still.
The scanning and moving may be implemented as shown in steps S310 onwards. For example, there is a determination as to whether one or both of the scanned bit in each of the binary numbers is 1 (step S310). The particle does not move if neither scanned bit is 1 (step S314). If at least one scanned bit is 1, there is a determination as to whether both bits are 1 (step S312). If both scanned bits are 1, all copies are moved to the right by the number of positions corresponding to the bit which has been scanned (step S316). For example, for the particle in position i=2, the MSB of both the maximum and minimum shifts expressed in binary notation is 1 and therefore all copies of x2 are misplaced and must rotate by the number of positions corresponding to the bit which has been scanned, e.g. 22=4.
If only one of the scanned bits is 1, some copies must be split and rotated while the others must stay in place. The number of copies to be split is equal to how many duplicates of xi are excess and should stay at least as many positions ahead as the rotations related to the scanned bits. The number of excess duplicates may be obtained (step S318) by subtracting from the sum of all ncopiesi up to xi the new position after this round of rotation (equal to i plus the rotations to perform). For example, for the particle in position i=1, the MSB of the maximum and minimum shifts are different and the number of excess duplicates is ncopies0+ncopies1−(i+4)=1. The excess duplicates are then split and rotated (step S320). In this example, the single excess duplicate is copied and rotated by 4 positions and the other copies are maintained in place.
Regardless of whether all, some or no copies are moved, if there are more bits as determined in step S322, the next bit is scanned and steps S310 to S322 are repeated until there are no more bits. As explained in more detail below, it is necessary to use a final process, after rotational split, in order to redistribute within each MPI core.
The following is an example of an algorithm which may be used to implement the rotational split process described in
which is essentially equation (5) above applied locally. This can be achieved by making enough room between the particles that have to be copied and creating the copies in these new empty spaces. While these two steps could be performed one after another, in this section, it is shown that it is possible to perform both at the same time in 0(log2N). For clarity, we believe it is more intuitive to illustrate how to achieve the first step alone and then how to extend that to accomplish the result of equation (18).
As set out above, after the rotational nearly sort phase, ncopies has the shape shown in equation (15). Therefore, making enough room between the particles that have to be copied easily translates to shifting the particles to the right until ncopies has the new shape:
ncopies=[λ0, 0, . . . , 0, λ1, 0, . . . , 0, λm−1, 0, . . . , 0] (19)
where for each ncopiesi>0, ncopiesi−1 zeros follow.
As shown in
The value of csumi can be used to calculate the minimum and maximum number of shifts (steps S300 and S302 of
min_shiftsi=Σj=0i−1(ncopiesj−1)=csumi−ncopiesi−i (20)
max_shiftsi=min_shiftsi+ncopiesi−1=csumi−i−1 (21)
copies_to_sendi=csumi−i−N2−k (22)
To achieve the new shape for ncopies in equation (19), we consider each min_shiftsi in binary notation and scan the bits from the MSB to the LSB. The particles are rotated to the right by decreasing power of two numbers of bit position, if and only if the MSB of min_shiftsi is 1. This corresponds to using a top-down binary tree structure of height 0(log2N). At each stage k of the tree (for k=1, 2, . . . , log2P), any core with rank p sends particles N2−k positions ahead to its partner with rank
By repeating these steps recursively, ncopies will have the shape in (19) but not (18). That is because for each xi that must be copied more than once, we are rotating all its copies by the same number of bit positions and we are not considering that some of these copies could be split and rotated further to ensure ncopies had the shape of (18). This can be achieved by filling some of the gaps that a strategy for ensuring ncopies has the shape of (19) alone would create.
Therefore, as shown in
However, if only the MSB of max_shiftsi is equal to 1, copies_to_sendi<ncopiesi. In this case, the number of copies to split is equal to how many of them must be placed from position i+N2−k to achieve perfect workload balance. This is equivalent to computing how many copies cause it to be the case that csumi>i+N2−k.Therefore, the number of excess duplicates to send also termed copies_to_send (step S318 of
Considering the example in
In case P<N, after log2P stages, each core p will perform a leaf stage. Here, the log2(N/P) LSBs of max_shiftsi and min_shiftsi are checked at once. Inter-core shifts are performed or splits and shifts are performed only if they send copies to the neighbour, i.e. if any csumi>n(p+1) where n(p+1) is the first memory address of the neighbour's private memory. Internal shifts are also considered only if min_shiftsi>0, which will make enough room to receive particles from the neighbour since min_shiftsi=max_shiftsi−1+1 (see equations (20) and (21)). As shown in
In order to ensure logarithmic time complexity, one needs to update csumi, max_shiftsi and min_shiftsi in 0(N/P). This can be done if the cores send starter=csumj−copies_to_sendj, where j is the index of the first particle to send, having ncopiesj>0. Because the particles are rotated by scanning the bits of in max_shiftsi and min_shiftsi and by using an MSB-to-LSB strategy, it is impossible for a particle to overwrite or get past another one. Hence each core can safely see the received starter as the sum of ncopies for the cores with lower rank and use it to re-initialise and update csumi in 0(N/P) as set out below. This strategy guarantees that csumi is always correct only for any index i such that ncopiesi>0, but those are the only indexes we are interested in. Once csumi is computed, equations (20) and (21) are trivially parallelisable. The pseudocode for rotational split is found in Algorithm 6. The update for csumi is:
csumi=csumi−1+ncopiesi
It is easy to infer that both min_shiftsi<N−1 and max_shiftsi≤N−1 because a particle copy could at most be shifted, or split and shifted from the first to last position. Therefore, because the shifts or the splits and shifts to perform decrease stage by stage by up to a factor of two, the achieved time complexity of rotational split is 0(log2N). Additionally, no collision can occur because each particle copy is shifted by at least min_shiftsi and by at most max_shiftsi.
The final phase, which is illustrated in
Thus, in summary the overall redistribute step (step S114of
The first term in (23) represents S-R which is always performed and all the steps which are called only once for any P>1, such as S-NS. The second term describes the log2P stages of the rotational nearly sort and rotational split phases during which we update, send and receive up to N/P particles. The overall algorithm may be termed rotational nearly sort and split (RoSS) redistribute and the pseudo code may be written as:
Thus, all tasks in the SMC take either 0(1) or 0(log2N) time complexity, which brings down the overall time complexity to be equal to 0 (log2N). Even if it was possible to perform a fully-balanced redistribute in 0(1), the time complexity of SMC will still be 0(log2N) due to normalise, ESS, MVR, estimate and recycling which all require reduction or cumulative sum as set out in the table below.
The RoSS redistribute process described above could be improved in different ways. For example, a hybrid shared DMA may be used rather than a normal DMA by simply substituting the sequential bits in RoSS with their shared memory equivalent, by using, e.g. GPUs or mainstream CPUs. For example, S-R could be substituted with its OpenMP version described in [12] (which on SMAs only achieves 0(log2N) as well). Since hybrid shared DMAs have fewer MPI nodes than pure DMAs, the number of messages will be trivially lower.
Another idea that can save messages is to check the maximum MSB for all shiftsi and all max_shiftsi at the beginning of rotational nearly sort and rotational split respectively. This will allow us to finish the execution of either of the two steps early and avoid sending unnecessary messages. For example, if ncopies happens to be nearly-sorted already, all bits in shifts will be 0, meaning that no messages should be sent and rotational nearly sort could be skipped entirely. However, this practice is only potentially faster as it is highly non-deterministic; therefore, we strongly do not recommend it for real-time applications.
We also denote that an m-ary tree structure with m>2 cores per each node would theoretically save MPI messages versus a binary tree structure because the tree height would decrease from 0(log2N) to 0(logmN). However, the operations to compute per each node would also increase and would be much more complicated to code, meaning that the overall run-time could be equal or worse. This is why when it comes to arrays, binary trees are widely considered more efficient. The table below illustrates the time complexity of each task of SMC on DMAs using RoSS:
We also note that this document describes a redistribute algorithm for particles having fixed size M. In some applications, the particle size can change at run-time, specifically across different particles and/or SMC iterations. To handle this case, one only needs first to identify the size of the biggest particle (which can be computed in 0(log2N) by using reduction) and then extend the other particles by adding extra dimensions with uninformative values, such as NaNs (i.e. not a numbers). After that, the particles have all the same size and can be redistributed by using any redistribute algorithm, such as prior art variants or the novel invention described here.
There exist other Monte Carlo methods that require resampling (and hence redistribute). Notable examples are Bootstrap methods and Jackknife methods, where IS, normalise, ESS and recycling are not used, and resampling and estimate are always performed to respectively correct the degeneracy of the initial distribution and produce new state estimates.
The following Figures show the numerical results of redistribute first (
The table below summarises the time complexity of each task of SMC on DMAs in the single core and parallel case for B-R or N-R:
Bitonic sort which is used in B-R is a fast deterministic comparison based parallel sorting algorithm which has recently been implemented on a cluster of graphic cards [35]. In [21], it has been shown that, by using a top-down divide-and-conquer approach, redistribute can be parallelised in fully-balanced fashion. Starting from the root node, the key idea consists of sorting ncopies and moving the particles at every stage of a binary tree. This can be achieved by searching for a particular index called pivot which perfectly divides the node into two balanced leaves. In order to find pivot, parallel inclusive cumulative sum is performed over ncopies and then pivot is the first index where cumulative sum is equal to or greater than
The first term in equation (16) describes the number of steps to perform Bitonic sort locally and the second term represents the data movement to merge the keys between the cores. Since (16) converges to 0((log2N)2) for P=N cores and Bitonic sort is performed log2P times, we can infer that this redistribute in [21] achieves 0((log2N)3) time complexity. In [20], the idea of using sort recursively is also applied to a dynamic scheduler for RPA and RNA on MPI.
As described in [22], B-R improves the algorithm in [21] by making a subtle consideration: the workload can still be divided deterministically if we perform Bitonic sort only once. After this single sort, the algorithm moves on to another top-down routine where we use rotational shifts to move all particles on the left side of pivot up to the left side of the node. This way the father node gets split into two balanced leaves. This algorithm is recursively performed log2P times until the workload is equally distributed across the cores; then S-R is called. Rotational shifts are faster than Bitonic sort because the achieved time complexity is equal to 0(log2N) and, since Bitonic sort is performed only once, the overall time complexity is improved to 0((log2N)2) for P=N or equal to (16) for any P≤N. In [22], B-R has been implemented on MapReduce and although it was significantly better than the algorithm in [21], its run-time for 512 cores was at best 20 times worse than S-R. In [19], B-R has been ported to DMAs by using MPI. The results have shown substantial speed-up for up to 128 cores, proving that MPI is a more suitable framework than MapReduce to perform B-R.
B-R can be further improved as described in [23] to give the proposed algorithm nearly sort based redistribute (N-R). This improvement substitutes Bitonic sort with an alternative algorithm called nearly sort. It has been proven that one does not actually need to perfectly sort the particles to divide the workload deterministically, but only needs to guarantee ncopies has shape of (15), which is described again below:
ncopies=[λ0, λ1, . . . , λm−1, 0, . . . , 0]
In order to achieve this property, the particles are first nearly sorted within each core by calling sequential nearly sort (S-NS) in Algorithm 3. Then they are recursively merged as in Bitonic sort by using the same sorting network as used for Bitonic sort. Since S-NS can be performed in linear time and the number of sent/received messages is equal to (log2P)2, we can infer that nearly sort costs the number of iterations defined below:
The cluster, Barkla, which is used is also very small in comparison with large existing systems such as Summit which can provide up to 2.4 million cores. Therefore, the gap between RoSS and N-R/B-R is likely to increase to much larger than eight and this will be realised in settings involving more than P=256 cores.
Particles filters are used to solve filtering problems of a dynamic model.
Xt is nine-dimensional and contains information about the electrode thermal boundary layer Δ, the electrode gap G, the ram position Xram, the electrode Me, the melting efficiency μ, the centreline pool depth Sc, the mid-radius pool depth Sm, the Helium pressure phe, and the current I.
Where the time interval dt, the melting current Ic and the ram speed Vram are controlled by the user. Here we use dt=5 S, Ic=6000 A and Vram is inferred by setting to 0 the expression between squared brackets in (24b). The process noise terms dI, dVram, dμ, dhe are Gaussian with 0 mean and covariances (σIdt)2, (σV
Y
t=[Gt, Xram(0, R) (25)
where
R=diag(σG2, σX2, σM
The particles are initialized as follows:
X
0=([Δ0, Xram
where
Q=diag(σΔ2, σG2, σX2, σLC2, σμ2dt, σhe2dt, σC2, σM2, σI
Tables IV, V and VI below provide all the constants and a terms in (24), (25), and (26). The dynamics is chosen as proposal such that (2) becomes:
w
t
i
=w
t−1
i
p(Yt|xti)
Hydrological models are commonly used in the context of urban planning, policy making and water management. The goal is typically to estimate the daily or hourly water runoff levels in a specific and relatively small geographic area, given on-site measurements of rainfall, evapotranspiration, and water flow. Historically, the significant runoff parameters could be tuned by the modellers based on their experience. This approach is however highly prone to human mistakes. Today, SMC methods can be used in substitution. More precisely, SMC samplers can be employed to provide daily or hourly estimates of the runoff parameters because the application is not real-time.
Part of the baseflow recharge gets run off too, according to another index K, called daily recession constant. The state has the following shape:
X
t
=[K, BFI, C
1
, . . . , C
v
, A
1
, . . . , A
v] (27)
and, therefore, has M=2v+2 dimensions. Here, measurements of water flow at catchments Yt are provided daily or hourly and related to the state Xt by a non-linear objective function g( )as follows:
Y
t
=g(Xt, ϕt)+εt (28)
where ϕt are rainfall and evapotranspiration input data, and εt is a homoscedastic Gaussian measurement error with 0 mean. In this experiment, the specific application of the model is for the Bass River, a 52 km2 catchment located at Loch in the South Gippsland Basin in Victoria, on the western slope of the Strzelecki Ranges. Here v=3 such that we estimate M=8 parameters.
Every new run of the SMC sampler is initialised as follows:
K˜0.271×Beta(51.9, 4.17)+(1−0.271)×Beta(255, 9.6)
A
1˜Beta(1.4, 2.6)
A
2
, A
3˜Beta(2.0, 2.5)
C
d˜Weibull(2.16, 136cd)
where cd={0.5, 0.75, 1.5}. In the experiment, we focus on single calls of SMC samplers, under the same testing conditions of the VAR model. At every SMC iteration, the same measurement Yt=Y1 is resent to the server, the particles are sampled from a Gaussian distribution, and weighted using (28) as target and backward kernel L( )=q( ).
As can be observed in
These results underline well the importance of having a fast, scalable redistribution. Since modern models may be very detailed and complex (e.g. requiring some sophisticated numerical integrator), the IS step also becomes highly computationally intensive. Therefore, a fast redistribute allows SMC methods to maintain a near-linear speed up for higher P, which is what we expect in theory but is hardly achievable in practice.
The findings are encouraging because on the one hand applications of SMC are getting increasingly demanding in terms of accuracy and run-times and on the other hand, modern computers and supercomputers are progressively being built with higher numbers of cores. Therefore, as time goes on, it is getting more important to have a fast parallelisable redistribute, such that in real-world problems, the SMC module can maintain a near linear speed-up for a higher level of parallelism.
The novel fully-balanced redistribute algorithm described above applies to SMC methods running on distributed memory environments. The algorithm has been implemented on MPI. The baselines for comparison are B-R and N-R, two similar state of the art fully-balanced redistribute algorithms which both achieve 0((log2N)2) time complexity and whose implementation is already available on MPI. However, we have shown that RoSS achieves exactly 0(log2N) time complexity. The results show that RoSS is up to eight times faster than B-R and N-R for up to P=256 cores. Similar results are observed on two real-world problems. For the same level of parallelism, an SMC method using RoSS is up to three times faster than the same using B-R/N-R and provides a maximum speed-up of 160. Under the same condition, RoSS is no longer the bottleneck.
At least some of the example embodiments described herein may be constructed, partially or wholly, using dedicated special-purpose hardware. Terms such as ‘component’, ‘module’ or ‘unit’ used herein may include, but are not limited to, a hardware device, such as circuitry in the form of discrete or integrated components, a Field Programmable Gate Array (FPGA) or Application Specific Integrated Circuit (ASIC), which performs certain tasks or provides the associated functionality. In some embodiments, the described elements may be configured to reside on a tangible, persistent, addressable storage medium and may be configured to execute on one or more cores or processors. These functional elements may in some embodiments include, by way of example, components, such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables. Although the example embodiments have been described with reference to the components, modules and units discussed herein, such functional elements may be combined into fewer elements or separated into additional elements. Various combinations of optional features have been described herein, and it will be appreciated that described features may be combined in any suitable combination. In particular, the features of any one example embodiment may be combined with features of any other embodiment, as appropriate, except where such combinations are mutually exclusive. Throughout this specification, the term “comprising” or “comprises” means including the component(s) specified but not to the exclusion of the presence of others.
Attention is directed to all papers and documents which are filed concurrently with or previous to this specification in connection with this application and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference.
The list of references includes:
All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and/or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features.
The invention is not restricted to the details of the foregoing embodiment(s). The invention extends to any novel one, or any novel combination, of the features disclosed in this specification (including any accompanying claims, abstract and drawings), or to any novel one, or any novel combination, of the steps of any method or process so disclosed.
Number | Date | Country | Kind |
---|---|---|---|
2101274.5 | Jan 2021 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/GB2022/050238 | 1/28/2022 | WO |