This application claims priority to Chinese Patent Application No. 201910875064.9 with a filing date of Sep. 17, 2020. The content of the aforementioned application, including any intervening amendments thereto, is incorporated herein by reference.
The exemplary embodiment(s) of the present invention relates to the field of artificial intelligence (AI) technology. More specifically, the exemplary embodiment(s) of the present invention relates to a system architecture based on system on chip (SoC) field programmable gate array (FPGA) for edge artificial intelligence computing (EAIC).
With the development and wide application of AI technology, AI Computing in different scenarios is facing more and more challenges. The application of AI Computing has gradually expanded from the cloud to the edge, such as the Internet of Things (IoT). The edge is the side close to the object or data source. For example, the edge of the IoT is a large number of sensors and cameras.
In the prior art, one of the solutions is to use pure microcontroller unit (MCU) to provide a fixed hardware structure. The scheme has the characteristics of small area and low power consumption. It is easy to use but has low computational performance. The other scheme uses fully customized hardware. The scheme can meet the requirements of computing performance, but it has high design cost, long cycle, high risk and poor usability. High-level synthesis platform based on FPGA can quickly implement specific algorithms in the field of FPGA, which is easy to use, but requires large-scale FPGA, and can not meet the area and power requirements of AI edge applications.
Therefore, how to customize an efficient hardware structure for a specific AI algorithm and implement the whole algorithm is also a huge challenge.
The technical problem to be solved by the embodiment of the present invention is to provide a system architecture based on SoC FPGA for EAIC in view of at least one defect in the EAIC system in the prior art. The system architecture has the advantages of flexibility and high efficiency, which enables the developers of edge AI algorithms to quickly and easily implement low-cost, high-performance computing on SoC FPGA.
In order to solve the above technical problems, this application provides a system architecture based on SoC FPGA for EAIC, including an MCU subsystem and an FPGA subsystem. The FPGA subsystem includes: an accelerator for accelerating an AI algorithm; and a shared memory used as an interface between the accelerator and the MCU subsystem, wherein the MCU subsystem is configured to upload the data to be calculated to the shared memory and to retrieve an operation result; and the accelerator is configured to read the data from the shared memory independently and to write back the operation result.
In another aspect, the application also provides a compilation method for the system architecture based on SoC FPGA for EAIC, which includes: acquiring an AI model; optimizing an algorithm of the AI model to obtain the optimized algorithm; generating a customized accelerator; and according to a function of the accelerator, mapping the optimized algorithm to the MCU instruction set and operation instructions for the accelerator, and generating a software binary code and according to the function of the accelerator, compiling the IP core of the accelerator and the MCU by the FPGA to generate a hardware system.
In yet another aspect, the application also provides a computer-readable storage medium that stores computer instructions that enable the computer to execute the compilation method described above.
Implementing embodiments of the present invention has the following beneficial effects:
1. By cooperating the MCU subsystem with the FPGA subsystem, the invention provides customizable accelerator in the FPGA to accelerate the AI algorithm. On the one hand, it can reduce the power consumption and area of the system, and on the other hand, it can ensure that the system has enough high computing performance.
2. The present application realizes the interface between the accelerator and the MCU subsystem by the shared memory, guarantees that the accelerator provides a compatible and unified data path to the MCU, reduces data handling, and speeds up the data access speed of the accelerator.
3. In the present application, the accelerator function is added to the MCU-based AI compilation tool chain to match and invoke in the compilation process, which greatly facilitate the use of the system architecture.
The exemplary embodiment(s) of the present invention will be understood more fully from the detailed description given below and from the accompanying drawings of various embodiments of the invention, which, however, should not be taken to limit the invention to the specific embodiments, but are for explanation and understanding only.
The purpose of the following detailed description is to provide an understanding of one or more embodiments of the present invention. Those of ordinary skills in the art will realize that the following detailed description is illustrative only and is not intended to be in any way limiting. Other embodiments will readily suggest themselves to such skilled persons having the benefit of this disclosure and/or description.
In the interest of clarity, not all of the routine features of the implementations described herein are shown and described. It will, of course, be understood that in the development of any such actual implementation, numerous implementation-specific decisions may be made in order to achieve the developer's specific goals, such as compliance with application- and business-related constraints, and that these specific goals will vary from one implementation to another and from one developer to another. Moreover, it will be understood that such a development effort might be complex and time-consuming but would nevertheless be a routine undertaking of engineering for those of ordinary skills in the art having the benefit of embodiment(s) of this disclosure.
Various embodiments of the present invention illustrated in the drawings may not be drawn to scale. Rather, the dimensions of the various features may be expanded or reduced for clarity. In addition, some of the drawings may be simplified for clarity. Thus, the drawings may not depict all of the components of a given apparatus (e.g., device) or method. The same reference indicators will be used throughout the drawings and the following detailed description to refer to the same or like parts.
As used in the description herein and throughout the claims that follow, the meaning of “a”, “an”, and “the” includes plural reference unless the context clearly dictates otherwise. Moreover, titles or subtitles may be used in the specification for convenience of a reader, which shall have no influence on the scope of the present disclosure. For convenience, certain terms may be highlighted, for example using italics and/or quotation marks. The use of highlighting has no influence on the scope and meaning of a term; the scope and meaning of a term is the same, in the same context, whether or not it is highlighted. Additionally, some terms used in this specification are more specifically defined below.
The term used in this specification generally have their ordinary meanings in the art, within the context of the disclosure, and in the specific context where each term is used. It will be appreciated that same thing can be said in more than one way. Consequently, alternative language and synonyms may be sued for any one or more of the terms discussed herein, nor is any special significance to be placed upon whether or not a term is elaborated or discussed herein.
As used herein, the terms “comprising,” “including,” “carrying,” “having,” “containing,” “involving,” and the like are to be understood to be open-ended, i.e., to mean including but not limited to.
As used herein, “plurality” means two or more.
System Architecture
The system architecture includes an MCU subsystem 10 and an FPGA subsystem 20. A part of the resources in the FPGA subsystem 20 are used to realize the acceleration function of AI algorithm, which can be customized for each algorithm application. Such a specific algorithm can be implemented by the MCU and the accelerator together.
The FPGA subsystem 20 includes: an accelerator 11 for accelerating AI algorithms; and a shared memory 12 used as an interface between the accelerator 11 and the MCU subsystem 20. The MCU subsystem 20 uploads the data to be calculated to the shared memory 12 and retrieves the operation results; and the accelerator 11 reads the data independently from the shared memory 12 and writes back the operation results.
Only by downloading the key calculation to the accelerator, the system architecture of the present application can achieve the maximum acceleration under the condition of the lowest power consumption/minimum area. The system architecture of the present application also takes into account the requirements of ease of use. The system architecture of the present application not only has low cost, but also can provide seamless high-level user interface for algorithm/software developers without professional software.
Specifically, the accelerator 11 is one or more. The accelerator 11 can have a variety of flexible functions and implementations. For example, as shown in
The shared memory 12 corresponds to the accelerator 11 one by one. The interface between each accelerator 11 and the MCU subsystem 20 is implemented by sharing storage with the MCU subsystem. As shown in
Further, the shared memory 12 is a register or a multi-port memory module in the FPGA subsystem 10. For example, as shown in
The FPGA subsystem 10 includes an FPGA chip. In addition to the accelerator 11 and the shared memory 12, there are other functional modules in the FPGA subsystem, which are not limited here.
Compilation Method
Specifically, the compilation method includes the following steps:
S11, acquiring AI model. Generally, it can be obtained by reading the results of an AI modeling software.
S12, optimizing the algorithm of the AI model to obtain the optimized algorithm. Algorithmic level optimization mainly aims at the deep learning model itself. It uses such methods as hyper-parameter setting, network structure tailoring and quantification to reduce the size of the model and the amount of computation, thus speeding up the reasoning process. The hyper-parameter setting, network structure tailoring and quantization mentioned above can be used as examples of algorithm optimization for model.
S13, generating a customized accelerator. In determining the function of the accelerator, the characteristics of specific AI algorithms (such as data bit width, common operations, etc.) and the given hardware resource constraints can be considered to give the best balance scheme. The IP core corresponding to the accelerator function is selected/configured and added to the hardware design of the FPGA.
S14, according to the function of accelerator, mapping the optimized algorithm to the MCU instruction set and operation instructions for the accelerator, and generating the software binary code.
S15, according to the function of accelerator, compiling the P core of the accelerator and the MCU by the FPGA to generate hardware system.
Further, in one embodiment, according to the accelerator function, mapping the optimized algorithm to the MCU instruction set and operation instructions for the accelerator includes:
Reading and analyzing the algorithm of the AI model by a compiler, and
Extracting an acceleratable part of the algorithm and implementing the acceleratable part by the accelerator, and implementing the remaining part by the MCU instruction set.
Further, the accelerator function can be expressed as an extended instruction set of the MCU or as peripheral function of the MCU.
On the other hand, the present application also provides a computer-readable medium which may be included in the electronic device described in the above-mentioned embodiment or may exist alone without being assembled into the electronic device. When one or more of the above programs are executed by one of the electronic devices, the computer-readable medium carries one or more programs, which enables the electronic device to realize the compilation method as described in the above embodiment.
For example, the electronic device can achieve the steps shown in
Network 602 includes multiple network nodes, not shown in
Switching network 604, which can be referred to as packet core network, includes cell sites 622-626 capable of providing radio access communication, such as 3G (3rd generation), 40, or SG cellular networks. Switching network 604, in one example, includes IP and/or Multiprotocol Label Switching (“MPLS”) based network capable of operating at a layer of Open Systems Interconnection Basic Reference Model (“OSI model”) for information transfer between clients and network servers. In one embodiment, switching network 604 is logically coupling multiple users and/or mobiles 616-620 across a geographic area via cellular and/or wireless networks. It should be noted that the geographic area may refer to a campus, city, metropolitan area, country, continent, or the like.
Base station 612, also known as cell site, node B, or eNodeB, includes a radio tower capable of coupling to various user equipments (“UEs”) and/or electrical user equipments (“EUEs”). The term UEs and EUEs are referring to the similar portable devices and they can be used interchangeably. For example, UEs or PEDs can be cellular phone 615, laptop computer 617, iPhone® 616, tablets and/or iPad® 619 via wireless communications. Handheld device can also be a smartphone, such as iPhone®, BlackBerry®, Android, and so on. Base station 612, in one example, facilitates network communication between mobile devices such as portable handheld device 613-619 via wired and wireless communications networks. It should be noted that base station 612 may include additional radio towers as well as other land switching circuitry.
Internet 650 is a computing network using Transmission Control Protocol/Internet Protocol (“TCP/IP”) to provide linkage between geographically separated devices for communication. Internet 650, in one example, couples to supplier server 638 and satellite network 630 via satellite receiver 632. Satellite network 630, in one example, can provide many functions as wireless communication as well as global positioning system (“GPS”). In one aspect, partitioned PSD with DRPC can be used in all applicable devices, such as, but not limited to, smartphones 613-619, satellite network 630, automobiles 613, AI server 608, business 607, and homes 620.
The above description involves various modules. These modules usually include hardware and/or a combination of hardware and software (e.g., solidified software). These modules may also include computer-readable media (e.g., permanent media) containing instructions (e.g., software instructions), which, when executed by a processor, can perform various functional features of the present invention. Accordingly, unless explicitly required, the scope of the present invention is not limited by specific hardware and/or software characteristics of the modules explicitly mentioned in the embodiments. As a non-limiting example, the present invention can execute software instructions (e.g., stored in non-permanent memory and/or permanent memory) by one or more processors (e.g., microprocessors, digital signal processors, baseband processors, and microcontrollers) in an embodiment. In addition, the present invention can also be implemented with an application specific integrated circuit (ASIC) and/or other hardware components. It should be pointed out that the system/device is divided into various modules for clarity. However, in practice, the boundaries of various modules can be blurred. For example, any or all functional modules in this article can share various hardware and/or software components. For example, any and/or all of the functional modules in this paper can be implemented wholly or partially by executing software instructions by a common processor. In addition, various software sub-modules executed by one or more processors can be shared among various software modules. Accordingly, the scope of the present invention is not limited by mandatory boundaries between various hardware and/or Software Components Unless explicitly required.
What has been disclosed above is only a better embodiment of the present invention. Of course, the scope of the present invention can not be limited by this. One of ordinary skill in the art can understand all or part of the process for realizing the above-mentioned embodiment, and the equivalent changes made according to the claims of the invention still fall within the scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
201910875064.9 | Sep 2019 | CN | national |
Number | Name | Date | Kind |
---|---|---|---|
7058921 | Hwang et al. | Jun 2006 | B1 |
7509614 | Hwang et al. | Mar 2009 | B2 |
10713535 | Chen | Jul 2020 | B2 |
10776164 | Zhao | Sep 2020 | B2 |
10805179 | Guim Bernat | Oct 2020 | B2 |
11100607 | Cho | Aug 2021 | B2 |
11119820 | Andrei | Sep 2021 | B2 |
11256977 | Smelyanskiy | Feb 2022 | B2 |
11264055 | Liu | Mar 2022 | B2 |
11295412 | Lee | Apr 2022 | B2 |
11315222 | Lee | Apr 2022 | B2 |
11322179 | Liu | May 2022 | B2 |
11373099 | Guim Bernat | Jun 2022 | B2 |
20040021085 | Prince et al. | Feb 2004 | A1 |
20190140913 | Guim Bernat | May 2019 | A1 |
20190317802 | Bachmutsky | Oct 2019 | A1 |
20200104670 | Seo | Apr 2020 | A1 |
20200250510 | Kumar Addepalli | Aug 2020 | A1 |
20200294182 | George | Sep 2020 | A1 |
20200296221 | Zhou | Sep 2020 | A1 |
20200364829 | Ahn | Nov 2020 | A1 |
20200382204 | Yu | Dec 2020 | A1 |
20200387801 | Kwon | Dec 2020 | A1 |
20210019591 | Venkatesh | Jan 2021 | A1 |
20210019633 | Venkatesh | Jan 2021 | A1 |
20210063214 | Li | Mar 2021 | A1 |
20210065891 | Li | Mar 2021 | A1 |
20210081770 | Jennings | Mar 2021 | A1 |
20210273658 | Shen | Sep 2021 | A1 |
20210274438 | Guan | Sep 2021 | A1 |
20210303173 | Wu | Sep 2021 | A1 |
20210319098 | Pogorelik | Oct 2021 | A1 |
20210327027 | Cho | Oct 2021 | A1 |
20210337404 | Sun | Oct 2021 | A1 |
20210342569 | Sieckmann | Nov 2021 | A1 |
20210342636 | Sieckmann | Nov 2021 | A1 |
20210344355 | Yu | Nov 2021 | A1 |
20210351815 | Zheng | Nov 2021 | A1 |
20210360437 | Zhu | Nov 2021 | A1 |
20210377138 | Sun | Dec 2021 | A1 |
20210377588 | Lee | Dec 2021 | A1 |
20210377923 | Ge | Dec 2021 | A1 |
20210385723 | Zong | Dec 2021 | A1 |
20220012520 | Mok | Jan 2022 | A1 |
20220014299 | Ji | Jan 2022 | A1 |
20220022122 | Cao | Jan 2022 | A1 |
20220061078 | Guan | Feb 2022 | A1 |
20220066812 | Du | Mar 2022 | A1 |
20220076375 | Lee | Mar 2022 | A1 |
20220086243 | Pang | Mar 2022 | A1 |
20220086693 | Zhou | Mar 2022 | A1 |
20220108724 | Liu | Apr 2022 | A1 |
20220109962 | Zhu | Apr 2022 | A1 |
20220123881 | Li | Apr 2022 | A1 |
Number | Date | Country |
---|---|---|
3092940 | Sep 2019 | CA |
3065651 | Oct 2020 | CA |
102723764 | Oct 2012 | CN |
102723764 | Dec 2014 | CN |
104820657 | Aug 2015 | CN |
104835162 | Aug 2015 | CN |
104866286 | Aug 2015 | CN |
207458128 | Jun 2018 | CN |
207458128 | Jun 2018 | CN |
108256636 | Jul 2018 | CN |
108536635 | Sep 2018 | CN |
109816220 | May 2019 | CN |
109871353 | Jun 2019 | CN |
110074776 | Aug 2019 | CN |
209514616 | Oct 2019 | CN |
110477862 | Nov 2019 | CN |
110477864 | Nov 2019 | CN |
110489765 | Nov 2019 | CN |
110727633 | Jan 2020 | CN |
110074776 | Apr 2020 | CN |
111626401 | Sep 2020 | CN |
111985597 | Nov 2020 | CN |
112788990 | May 2021 | CN |
112955909 | Jun 2021 | CN |
112018006701 | Sep 2020 | DE |
WO-2019136747 | Jul 2019 | WO |
WO-2019136762 | Jul 2019 | WO |
WO-2019136764 | Jul 2019 | WO |
WO-2020067633 | Apr 2020 | WO |
WO-2020107481 | Jun 2020 | WO |
WO-2020124867 | Jun 2020 | WO |
WO-2020133134 | Jul 2020 | WO |
WO-2020142871 | Jul 2020 | WO |
WO-2020142871 | Jul 2020 | WO |
WO-2020150878 | Jul 2020 | WO |
WO-2020155083 | Aug 2020 | WO |
WO-2021031015 | Feb 2021 | WO |
WO-2021031127 | Feb 2021 | WO |
Entry |
---|
CN 104820657—Machine Translated (Year: 2015). |
Number | Date | Country | |
---|---|---|---|
20210081770 A1 | Mar 2021 | US |