This disclosure relates to systems and methods for accelerating evolutionary algorithms using a field programmable gate array (FPGA) and graphics processing unit (GPU).
This section is intended to introduce the reader to various aspects of art that may be related to various aspects of the present disclosure, which are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present disclosure. Accordingly, it may be understood that these statements are to be read in this light, and not as admissions of prior art.
Evolutionary computation is a subfield of artificial intelligence that has emerged as a powerful approach to tackling complex optimization problems. Inspired by natural selection, evolutionary algorithms iteratively improve candidate solutions. Each candidate solution may be considered one of a population of individuals, each representing a potential solution. Evolutionary algorithms evaluate each individual using a fitness function, which determines how well that individual solves the problem. Individuals with superior fitness have a higher chance of being selected for reproduction. This process introduces variations through mechanisms mimicking biological processes, such as recombination (combining traits from two parents) and mutation (random alterations to individual characteristics). These variations create new offspring, forming the next generation of individuals of the population. Over time, the population evolves towards increasingly optimal solutions. However, the effectiveness of evolutionary algorithms is limited by computational constraints. Evaluating large populations, especially for intricate problems, can be extremely time-consuming.
One technique that has been used is hardware acceleration using a field-programmable gate array (FPGA). One FPGA design that has been developed performs all stages (e.g., population, evaluation, selection, and operation) of the algorithm in the FPGA, reporting an average speedup of 4,902× compared to a 28-core CPU, and an average speedup of 43× compared to a GPU. While this is much better than many existing techniques, even with this design, evaluating large populations may be extremely time-consuming. Moreover, the main drawback of this design is that it may only support one particular form of evolutionary algorithm that solves a particular problem.
Various aspects of this disclosure may be better understood upon reading the following detailed description and upon reference to the drawings in which:
One or more specific embodiments will be described below. In an effort to provide a concise description of these embodiments, not all features of an actual implementation are described in the specification. It should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another. Moreover, it should be appreciated that such a development effort might be complex and time consuming, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure.
When introducing elements of various embodiments of the present disclosure, the articles “a,” “an,” and “the” are intended to mean that there are one or more of the elements. The terms “comprising,” “including,” and “having” are intended to be inclusive and mean that there may be additional elements other than the listed elements. Additionally, it should be understood that references to “one embodiment” or “an embodiment” of the present disclosure are not intended to be interpreted as excluding the existence of additional embodiments that also incorporate the recited features. Furthermore, the phrase A “based on” B is intended to mean that A is at least partially based on B. Moreover, the term “or” is intended to be inclusive (e.g., logical OR) and not exclusive (e.g., logical XOR). In other words, the phrase A “or” B is intended to mean A, B, or both A and B.
The complexity of evolutionary algorithms is primarily influenced by three factors: population size, number of generations, and the cost of evaluating each individual's fitness. Larger populations allow for a more thorough exploration of the solution space but come at the cost of increased computation per generation. Similarly, a higher number of generations allows for better convergence to an optimal solution but involves a greater number of fitness evaluations. The cost of each evaluation itself significantly impacts the overall runtime of the algorithm. Consequently, the worst-case time complexity of an evolutionary algorithm is essentially a product of these three factors. This inherent complexity raises the value of hardware acceleration. This disclosure leverages specialized hardware—the programmability of field programmable gate arrays (FPGAs) and the parallel processing of graphics processing units (GPUs)—to achieve faster exploration with convergence towards optimal solutions. Additionally, this hardware acceleration can alleviate the computational burden associated with handling larger problems.
As mentioned above, evolutionary algorithms can be split into four stages: population, evaluation, selection, and operation. Analysis suggests performing evaluation in software and the other stages in hardware for a general-purpose engine, as it enables users to write an evaluation function for their particular problem and utilize hardware acceleration for the other stages to evolve solutions to that problem. Accordingly, this disclosure involves using a GPU for the evaluation stage (e.g., to evaluate individuals of the population in parallel using separate threads) alongside an FPGA for the other stages to design a general-purpose evolutionary computation engine with maximal scope for acceleration.
In a configuration mode of the integrated circuit device 12, a designer may use an electronic device 13 (e.g., a computer including a data processing system having a processor and memory or storage) to implement high-level designs (e.g., a system user design) using design software 14 (e.g., executable instructions stored in a tangible, non-transitory, computer-readable medium such as the memory or storage of the electronic device 13), such as a version of INTEL® QUARTUS® by INTEL CORPORATION. The electronic device 13 may use the design software 14 and a compiler 16 to convert the high-level program into a lower-level description (e.g., a configuration program, a bitstream). The compiler 16 may provide machine-readable instructions representative of the high-level program to a host 18 and the integrated circuit device 12. The host 18 may receive a host program 22 that may control or be implemented by the kernel programs 20. To implement the host program 22, the host 18 may communicate instructions from the host program 22 to the integrated circuit device 12 via a communications link 24 that may include, for example, direct memory access (DMA) communications or peripheral component interconnect express (PCIe) communications. In some embodiments, the kernel programs 20 and the host 18 may configure programmable logic blocks (e.g., LABs 110) on the integrated circuit device 12. The programmable logic blocks (e.g., LABs 110) may include circuitry and/or other logic elements and may be configurable to implement a variety of functions in combination with digital signal processing (DSP) blocks 120.
The designer may use the design software 14 to generate and/or to specify a low-level program, such as the low-level hardware description languages described above. Further, in some embodiments, the system 10 may be implemented without a separate host program 22. Thus, embodiments described herein are intended to be illustrative and not limiting.
An illustrative embodiment of a programmable integrated circuit device 12 such as a programmable logic device (PLD) (e.g., a field programmable gate array (FPGA) device) that may be configured to implement a circuit design (also sometimes referred to as a system design) is shown in
Programmable logic circuitry of the integrated circuit device 12 may be controlled by programmable memory elements sometimes referred to as configuration random access memory (CRAM). Memory elements may be loaded with configuration data (also called programming data or a configuration bitstream) using input-output elements (IOEs) 102. Once loaded, the memory elements each provide a corresponding static control signal that controls the operation of an associated functional block (e.g., LABs 110, DSP blocks 120, RAM 130, or IOEs 102).
In one scenario, the outputs of the loaded memory elements are applied to the gates of metal-oxide-semiconductor transistors in a functional block to turn certain transistors on or off and thereby configure the logic in the functional block including the routing paths. Programmable logic circuit elements that may be controlled in this way include parts of multiplexers (e.g., multiplexers used for forming routing paths in interconnect circuits), look-up tables, logic arrays, AND, OR, NAND, and NOR logic gates, pass gates, etc.
The memory elements may use any suitable volatile and/or non-volatile memory structures such as random-access-memory (RAM) cells, fuses, antifuses, programmable read-only-memory (ROM) memory cells, mask-programmed and laser-programmed structures, combinations of these structures, etc. Because the memory elements are loaded with configuration data during programming, the memory elements are sometimes referred to as configuration memory, configuration random-access memory (CRAM), or programmable memory elements. The integrated circuit device 12 (e.g., as a programmable logic device (PLD)) may be configured to implement a custom circuit design. For example, the configuration RAM may be programmed such that LABs 110, DSP blocks 120, and RAM 130, programmable interconnect circuitry (e.g., vertical channels 140 and horizontal channels 150), and the input-output elements 102 form the circuit design implementation.
In addition, the programmable logic device may have input-output elements (IOEs) 102 for driving signals off the integrated circuit device 12 and for receiving signals from other devices. Input-output elements 102 may include parallel input-output circuitry, serial data transceiver circuitry, differential receiver and transmitter circuitry, or other circuitry used to connect one integrated circuit to another integrated circuit.
The integrated circuit device 12 may also include programmable interconnect circuitry in the form of vertical routing channels 140 (i.e., interconnects formed along a vertical axis of the integrated circuit device 12) and horizontal routing channels 150 (i.e., interconnects formed along a horizontal axis of the integrated circuit device 12), each routing channel including at least one track to route at least one wire. If desired, the interconnect circuitry may include pipeline elements or network-on-chip (NOC) features to transfer data in a packetized format.
The integrated circuit device 12 may include hardware blocks to enable the various stages of evolutionary algorithms other than evaluation, which is more efficiently performed by the GPU 186. These blocks may include a selection module 190, population storage 192, control circuitry 194, and an operation module 196. These blocks may be implemented in hardened logic or soft logic of programmable logic circuitry of the integrated circuit device 12. The selection module 190 may perform selection as in block 166 of the flowchart 160, selecting individuals of the evaluated population to be used to generate a next generation population. The population storage 192 may include memory or storage elements that contain the individuals (or references to memory outside of the integrated circuit device 12 where the individuals may be stored) that have been selected and/or operated on by the operation module 196. The control circuitry 194 may include any suitable control logic circuitry, such as a finite state machine (FSM) or a processor executing firmware (e.g., instructions stored on a tangible, non-transitory, machine readable medium executed by the processor) to control the blocks 190, 192, and 196. The operation module 196 may use the selected individuals from the previous population to generate a new population through recombination and/or mutation, thereby evolving a new set of individuals for the next population.
Note that the selection module 190, population storage 192, control circuitry 194, and operation module 196 may be programmed into the integrated circuit device 12. Indeed, even if the evaluation method of the GPU 186 changes, the selection method of the selection module 190 may remain the same since it may produce a fitness ranking of the evaluated individuals for any problem that is evaluated by the GPU 186. Similarly, the operation module 196 may simply support a certain population representation and problems can be adapted. Indeed, there may be a few popular representations that users could select to use from among a library of operation modules supporting popular representations. To support different representations and/or problems, the operations module 196 may be configured at the outset of operation or partially reconfigured during runtime.
For a broad-based evolutionary algorithm computation accelerator, the evaluation stage (e.g., block 164 of
The CPU 184 may also load the evaluation algorithm on the GPU 186 and start its processing when enough data is present. For example, all individuals 198 in the population may loaded into memory 200 of the GPU 186 and be evaluated, which means the same evaluation function 202 may be applied to potentially thousands of individuals 198, ideally in parallel. Evaluations 204 are the result of the evaluation function 202 performed by software running on the GPU 186. The evaluations 204 indicate the fitness of each individual 198 that was evaluated.
Note that data, such as the evaluations 204, from the GPU 186 may be read by the integrated circuit device 12 through the PCIe connection 188, potentially using a same direct memory access (DMA) engine used to write data to the memory. Overall, this is a feasible (and previously demonstrated) approach for a peer-to-peer setup using writes and reads from the integrated circuit device 12 to the GPU 186 memory 200 using DMA engines over the PCIe connection 188, with the CPU 184 sharing the PCI address of the memory 200 of the GPU 186.
The evaluation function 202 for the user's problem may be run thousands of times across thousands of threads operating on different individuals 198 to produce their fitness values (e.g., evaluations 204), all in parallel; this builds on the acceleration offered by an FPGA-only solution by utilizing the flexibility and parallelism of the GPU 186 to reach previously unattainable evolution rates for a general-purpose evolutionary algorithm accelerator.
In contrast to deterministic selection algorithms that risk premature convergence in local optima, tournament selection may inject randomness into the process. Unlike other selection methods that order individuals first and then add a randomization step, tournament selection achieves this through seeding, which may be stochastically determined. As a result, even lower-performing individuals can rank higher due to their placement within the population, promoting exploration beyond readily apparent good solutions.
While the potential applications of evolutionary algorithms are broad, there are a few representations that solutions to problems are often encoded into such as bit strings or trees. Besides the evaluation stage, which is performed in GPU 186, the particular coding may only affect the operation module 260, since it may support mutation and recombination operations for that particular encoding. Note that the evaluation function 202, running in software on the GPU 186, may be easily modified to support any suitable encoding for a representation regardless of how it is defined in hardware.
One example of an encoding that may undergo an evolutionary operation, such as a mutation 260, by the operations module 196 is an expression tree, as shown in
Another example of an encoding that may undergo an evolutionary operation, such as recombination or mutation, by the operations module 196 is a bit string, as shown in
A random value generator 286 may randomly or pseudorandomly output selection signals. The random value generator 286 may be formed using any suitable logic circuitry (e.g., a linear feedback shift register (LFSR)). The selection signals output by the random value generator 286 may be used by select logic 288 to select bits or bit groups from either the parent bit string 280 or 282 and used by select logic 288 to select bits or bit groups from either the parent bit string 282 or 284. The selection signals may also randomly select between the bits or bit groups entering a first multiplexer 294. For even more randomness, a second multiplexer 296 may select between the output of the first multiplexer 294 or the output of the random value generator 286 depending on an operation selection signal (op), which defines whether the operations module 196 is performing a recombination operation or a mutation operation for a particular bit or bit region.
In any case, the operations module 196 may perform any mutation or recombination operation that takes in the chosen individuals and performs operations on them to produce the next generations of individuals, ideally at a rate that does now allow the GPU 186 to go idle.
There may be several data transfers that happen between the integrated circuit device 12 and the GPU 186. First, the GPU 186 may have multiple threads of the evaluation function 202 evaluating multiple individuals 198 and the resultant fitness values (e.g., evaluations 204) may be stored in the memory 200 of the GPU 186. All of these values may be read and stored in register of the integrated circuit device 12, with high throughput to make good use of the processing of the GPU 186. The next generation of individuals to be evaluated are created in the integrated circuit device 12 and will be stored in the memory 200 of the GPU 186, ideally at a rate such that the GPU 186 never goes idle (e.g., the GPU 186 may have the next generation of individuals 198 in its memory 200 before it finishes evaluating the current generation of individuals 198).
The system 180 discussed above may be a component included in a data processing system, such as a data processing system 500, shown in
The data processing system 500 may be part of a data center that processes a variety of different requests. For instance, the data processing system 500 may receive a data processing request via the network interface 506 to perform encryption, decryption, machine learning, video processing, voice recognition, image recognition, data compression, database search ranking, bioinformatics, network security pattern identification, spatial navigation, digital signal processing, or other specialized tasks.
The techniques and methods described herein may be applied with other types of integrated circuit systems. To provide only a few examples, these may be used with central processing units (CPUs), graphics cards, hard drives, or other components.
While the embodiments set forth in the present disclosure may be susceptible to various modifications and alternative forms, specific embodiments have been shown by way of example in the drawings and have been described in detail herein. However, the disclosure is not intended to be limited to the particular forms disclosed. The disclosure is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure as defined by the following appended claims.
The techniques presented and claimed herein are referenced and applied to material objects and concrete examples of a practical nature that demonstrably improve the present technical field and, as such, are not abstract, intangible or purely theoretical. Further, if any claims appended to the end of this specification contain one or more elements designated as “means for [perform]ing [a function] . . . ” or “step for [perform]ing [a function] . . . ”, it is intended that such elements are to be interpreted under 35 U.S.C. 112(f). However, for any claims containing elements designated in any other manner, it is intended that such elements are not to be interpreted under 35 U.S.C. 112(f).
EXAMPLE EMBODIMENT 1. A system for performing an evolutionary algorithm comprising:
EXAMPLE EMBODIMENT 2. The system of example embodiment 1, wherein the integrated circuit device comprises a field programmable gate array and the circuitry configured to perform population selection and evolution operations comprises programmable logic circuitry of the field programmable gate array.
EXAMPLE EMBODIMENT 3. The system of example embodiment 1, wherein the circuitry configured to perform population selection and evolution operations comprises a selection module configured to select individuals for evolution operation based on the population evaluation of the parallel processor.
EXAMPLE EMBODIMENT 4. The system of example embodiment 3, wherein the selection module is configured to perform tournament selection to select the individuals.
EXAMPLE EMBODIMENT 5. The system of example embodiment 3, wherein the selection module is configured to select the individuals during a single clock cycle.
EXAMPLE EMBODIMENT 6. The system of example embodiment 1, wherein the circuitry configured to perform evolution operations comprises circuitry to perform a mutation of a selected individual.
EXAMPLE EMBODIMENT 7. The system of example embodiment 1, wherein the circuitry configured to perform evolution operations comprises circuitry to perform recombination based on two or more selected individuals.
EXAMPLE EMBODIMENT 8. The system of example embodiment 1, comprising a central processing unit to coordinate operations between the integrated circuit device and the parallel processing system.
EXAMPLE EMBODIMENT 9. The system of example embodiment 8, wherein the central processing unit is to program the parallel processing system with instructions to perform the population evaluation in parallel.
EXAMPLE EMBODIMENT 10. The system of example embodiment 1, wherein the integrated circuit device and the parallel processing system are disposed on separate dies.
EXAMPLE EMBODIMENT 11. A method comprising:
EXAMPLE EMBODIMENT 12. The method of example embodiment 11, comprising repeating the method using the second population of individuals.
EXAMPLE EMBODIMENT 13. The method of example embodiment 11, comprising programming the field programmable gate array with a selection module to perform the selection and an operation module to perform the evolution operation.
EXAMPLE EMBODIMENT 14. The method of example embodiment 11, comprising storing the selected individuals in a local storage of the field programmable gate array.
EXAMPLE EMBODIMENT 15. The method of example embodiment 11, wherein the performing the selection comprises performing a tournament selection.
EXAMPLE EMBODIMENT 16. The method of example embodiment 11, comprising loading the first population of individuals into memory of the graphics processing unit before the evaluation of the individuals of the first population begins.
EXAMPLE EMBODIMENT 17. An article of manufacture comprising tangible, non-transitory, machine-readable instructions that, when executed by a central processing unit, cause the central processing unit to perform operations comprising:
EXAMPLE EMBODIMENT 18. The article of manufacture of example embodiment 17, wherein the operations comprise repeating the operations using the second population of individuals.
EXAMPLE EMBODIMENT 19. The article of manufacture of example embodiment 17, wherein the operations comprise configuring the field programmable gate array with a system design with a selection module to perform selection and an operation module to perform evolution operations.
EXAMPLE EMBODIMENT 20. The article of manufacture of example embodiment 19, wherein the operations comprise partially reconfiguring the field programmable gate array with a different operation module that supports a different encoding.