Embodiments described herein generally relate to the field of data processing, and more particularly relates to a hardware acceleration pipeline with filtering engine for column-oriented database management systems with arbitrary scheduling functionality.
Conventionally, big data is a term for data sets that are so large or complex that traditional data processing applications are not sufficient. Challenges of large data sets include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating, and information privacy.
Most systems run on a common Database Management System (DBMS) using a standard database programming language, such as Structured Query Language (SQL). Most modern DBMS implementations (Oracle, IBM, DB2, Microsoft SQL, Sybase, MySQL, Ingress, etc.) are implemented on relational databases. Typically, a DBMS has a client side where applications or users submit their queries and a server side that executes the queries. Unfortunately, general purpose CPUs are not efficient for database applications. On-chip cache of a general-purpose CPU is not effective since it's relatively too small for real database workloads.
For one embodiment of the present invention, methods and systems are disclosed for arbitrary scheduling and in-place filtering of relevant data for accelerating operations of a column-oriented database management system. In one example, a hardware accelerator for data stored in columnar storage format comprises memory to store data and a controller coupled to the memory. The controller to process at least a subset of a page of columnar format in an execution unit with any arbitrary scheduling across columns of the columnar storage format.
Other features and advantages of embodiments of the present invention will be apparent from the accompanying drawings and from the detailed description that follows below.
Methods, systems and apparatuses for accelerating big data operations with arbitrary scheduling and in-place filtering for a column-oriented database management system are described.
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention can be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the present invention.
Reference in the specification to “one embodiment” or “an embodiment” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrase “in one embodiment” appearing in various places throughout the specification are not necessarily all referring to the same embodiment. Likewise, the appearances of the phrase “in another embodiment,” or “in an alternate embodiment” appearing in various places throughout the specification are not all necessarily all referring to the same embodiment.
The following glossary of terminology and acronyms serves to assist the reader by providing a simplified quick-reference definition. A person of ordinary skill in the art may understand the terms as used herein according to general usage and definitions that appear in widely available standards and reference books.
HW: Hardware.
SW: Software.
I/O: Input/Output.
DMA: Direct Memory Access.
CPU: Central Processing Unit.
FPGA: Field Programmable Gate Arrays.
CGRA: Coarse-Grain Reconfigurable Accelerators.
GPGPU: General-Purpose Graphical Processing Units.
MLWC: Many Light-weight Cores.
ASIC: Application Specific Integrated Circuit.
PCIe: Peripheral Component Interconnect express.
CDFG: Control and Data-Flow Graph.
FIFO: First In, First Out [0033] NIC: Network Interface Card
HLS: High-Level Synthesis
KPN: Kahn Processing Networks (KPN) is a distributed model of computation (MoC) in which a group of deterministic sequential processes are communicating through unbounded FIFO channels. The process network exhibits deterministic behavior that does not depend on various computation or communication delays. A KPN can be mapped onto any accelerator (e.g., FPGA based platform) for embodiments described herein.
Dataflow analysis: An analysis performed by a compiler on the CDFG of the program to determine dependencies between a write operation on a variable and the consequent operations which might be dependent on the written operation.
Accelerator: a specialized HW/SW component that is customized to run an application or a class of applications efficiently.
In-line accelerator: An accelerator for I/O-intensive applications that can send and receive data without CPU involvement. If an in-line accelerator cannot finish the processing of an input data, it passes the data to the CPU for further processing.
Bailout: The process of transitioning the computation associated with an input from an in-line accelerator to a general-purpose instruction-based processor (i.e. general purpose core).
Continuation: A kind of bailout that causes the CPU to continue the execution of an input data on an accelerator right after the bailout point.
Rollback: A kind of bailout that causes the CPU to restart the execution of an input data on an accelerator from the beginning or some other known location with related recovery data like a checkpoint.
Gorilla++: A programming model and language with both dataflow and shared-memory constructs as well as a toolset that generates HW/SW from a Gorilla++ description.
GDF: Gorilla dataflow (the execution model of Gorilla++).
GDF node: A building block of a GDF design that receives an input, may apply a computation kernel on the input, and generates corresponding outputs. A GDF design consists of multiple GDF nodes. A GDF node may be realized as a hardware module or a software thread or a hybrid component. Multiple nodes may be realized on the same virtualized hardware module or on a same virtualized software thread.
Engine: A special kind of component such as GDF that contains computation.
Infrastructure component: Memory, synchronization, and communication components.
Computation kernel: The computation that is applied to all input data elements in an engine.
Data state: A set of memory elements that contains the current state of computation in a Gorilla program.
Control State: A pointer to the current state in a state machine, stage in a pipeline, or instruction in a program associated to an engine.
Dataflow token: Components input/output data elements.
Kernel operation: An atomic unit of computation in a kernel. There might not be a one to one mapping between kernel operations and the corresponding realizations as states in a state machine, stages in a pipeline, or instructions running on a general-purpose instruction-based processor.
Accelerators can be used for many big data systems that are built from a pipeline of subsystems including data collection and logging layers, a Messaging layer, a Data ingestion layer, a Data enrichment layer, a Data store layer, and an Intelligent extraction layer. Usually data collection and logging layer are done on many distributed nodes. Messaging layers are also distributed. However, ingestion, enrichment, storing, and intelligent extraction happen at the central or semi-central systems. In many cases, ingestions and enrichments need a significant amount of data processing. However, large quantities of data need to be transferred from event producers, distributed data collection and logging layers and messaging layers to the central systems for data processing.
Examples of data collection and logging layers are web servers that are recording website visits by a plurality of users. Other examples include sensors that record a measurement (e.g., temperature, pressure) or security devices that record special packet transfer events. Examples of a messaging layer include a simple copying of the logs, or using more sophisticated messaging systems (e.g., Kafka, Nifi). Examples of ingestion layers include extract, transform, load (ETL) tools that refer to a process in a database usage and particularly in data warehousing. These ETL tools extract data from data sources, transform the data for storing in a proper format or structure for the purposes of querying and analysis, and load the data into a final target (e.g., database, data store, data warehouse). An example of a data enrichment layer is adding geographical information or user data through databases or key value stores. A data store layer can be a simple file system or a database. An intelligent extraction layer usually uses machine learning algorithms to learn from past behavior to predict future behavior.
Columnar storage formats like Parquet or optimized row columnar (ORC) can achieve higher compression rates if dictionary decoding is preceded by Run Length Encoding (RLE) or Bit-packed (BP) encoding. Apache Parquet is an example of a columnar storage format available to any project in a Hadoop ecosystem. Parquet is built for compression and encoding schemes. Apache optimized row columnar (ORC) is another example of a columnar storage format.
As one embodiment of this present design, the parquet columnar storage format is explored. However, the same concepts apply directly to other columnar formats for storing database tables such as ORC. Data in parquet format is organized in a hierarchical fashion, where each parquet file 200 is composed of Row Groups 210. Each row group (e.g., row groups 0, 1) is composed of a plurality of Columns 220 (e.g., columns a, b). Each column is further composed of a plurality of Pages 230 (e.g., pages 0, 1) or regions. Each page 230 includes a page header 240, repetition levels 241, definition levels 242, and values 243. The repetition levels 241, definition levels 242, and values 243 are compressed using multiple compression and encoding algorithms. The values 243, repetition levels 241, and definition levels 242 for each parquet page 230 may be encoded using Run Length Encoding (RLE), Bit-packed Encoding (BP), a combination of RLE+BP, etc. The encoded parquet page may be further compressed using compression algorithms like Gzip, Snappy, zlib, LZ4, etc.
Operations with Parquet:
A typical operation on a database table using parquet columnar storage format 200 (e.g., file 200) is a decompression and decoding step to extract the values 243, definition levels 242, and repetition levels 241 from the encoded (using RLE+BP or any other encoding) and compressed (e.g., using GZIP, Snappy, etc) data. The extracted data is then filtered to extract relevant entries from individual parquet pages. Metadata-based filtering can be performed using definition levels 242 or repetition levels 241, and value-based filtering can be performed on the values 243 themselves.
The present design 300 (programmable hardware accelerator architecture) focuses on hardware acceleration for columnar storage format that can perform decompression, decoding, and filtering. A single instance of the design 300 is referred to as a single Kernel. The kernel 300 includes multiple processing engines (e.g., 310, 320, 330, 340, 350, 360, 370) that are specialized for computations necessary for processing and filtering of parquet columnar format 200 with various different compression algorithms (e.g., Gzip, Snappy, LZ4, etc) and encoding algorithms (e.g., RLE, RLE-BP, Delta encoding, etc).
In one embodiment of the present design for kernel 300, engines 310, 320, 330, 340, 350, 360, and 370 consume and produce data in a streaming fashion where data generated from one engine is directly fed to another engine. In another embodiment of the work, the data consumed and produced by engines 310, 320, 330, 340, 350, 360, and 370 is read from and written to either on-chip, off-chip memory, or a storage device.
The Configurable Parser engine 310 is responsible for reading the configuration or instructions that specify a parquet file size, compression algorithms used, filtering operation, and other metadata that is necessary for processing and filtering the parquet format file 200.
The Decompress engine 320 is responsible for decompression according to the compression algorithm used to compress the data (e.g., 241, 242, and 243). In some implementations, the Decompress engine 320 is preceded by the Configurable Parser engine 310 as shown in
The Page Splitter engine 330 is responsible for splitting the contents of the parquet file into page header 240, repetition levels 241, definition levels 242, values 243 so these can be individually processed by the proceeding engines.
The Decoding Engine 340 is responsible for further decompression or decoding of repetition levels 241, definition levels 242, and values 243. Based on the configuration accepted by the Config Parser engine 310, the decoding engine can perform decoding for RLE-BP, RLE, BP, Dictionary, Delta, and other algorithms supported for the parquet format 200 and other columnar formats like ORC.
The Filtering engine 350 (e.g., filtering single instruction multiple data (SIMD) engine 350, filtering very larger instruction word (VLIW) engine 350, or combination of both SIMD and VLIW execution filtering engine 350) is responsible for applying user-defined filtering conditions on the data 241, 242, 243.
Section size shim engine 360 is responsible for combining the filtered data generated by Filtering engine 350 into one contiguous stream of data.
Finally, in Packetizer engine 370, the data generated by the previous engines is divided into fixed sized packets that can be either written back to a storage device or off-chip memory.
The operations of Decompression Engine 320 and Decoding engine 340 result in a significant increase in the size of the data, which may limit performance when bandwidth is limited.
To overcome this limitation, the proposed hardware accelerator design of kernel 300 further includes a filtering engine 350 that performs filtering prior to the data being sent to a host computer (e.g., CPU) and significantly reduces the size of the data produced at the output of the pipeline. In one embodiment the filtering can be in-place, where the filtering operation is performed directly to the incoming stream of data coming from the Decoding Engine 340 into the Filtering engine 350. In another embodiment, the data from Decoding Engine 340 is first written to on-chip memory, off-chip memory, or to a storage device before being consumed by the Filtering engine 350.
Operators Supported by Filtering Engine:
The filtering engine 350 can apply one or more value-based filters and one or more metadata-based filters to individual parquet pages. The value-based filters keep or discard entries in a page based on the value 243 (e.g., value >5). The metadata-based filters are independent of values and either dependent on the def-levels 241, rep-levels 242, or the index of a value 243.
Overcoming Limited On-Chip Memory:
When an entry in a page of a column chunk is discarded by the filtering engine, the corresponding entry in a different column chunk can be discarded as well. However, since the filtering engine processes a single page at a time, it keeps track of which entries have been discarded in each page in a local memory 440 (e.g., Column Batch BRAM (CBB)), as shown in
Filtering Engine Architecture Overview:
The stage 410 (or operation 410) accepts data from the incoming stream from a RLE/BP decoder 404 (or any other decoding engine that precedes the filtering engine) and reads data from the memory 440. The memory 440 keeps track of the data filtered out by the previous column chunk.
The stage 411 (or operation 411) performs value-based and metadata-based filtering with a value-based filter and a null filter/page filter. In one example, this stage performs SIMD-style execution to apply value-based and metadata-based filtering to the incoming stream of data. Using SIMD-style execution, the same filter (example value >5) is applied to every value in the incoming value stream. Furthermore, multiple operations (such as value >5 and value <9) can be combined and executed as a single instruction, similar to VLIW (Very Large Instruction Word) execution. A stage 412 (or operation 412) discards data based on the filtering in the stage 411. A stage 413 (or operation 413) combines the filtered data and assembles the data to form the outgoing data stream 420.
A stage 414 (or operation 414) updates the memory 440 according to the filter applied for the current column chunk. This way the filtering engine gets more effective as the number of column chunks and the number of filters applied increase.
As discussed above, limited memory provides challenges for the hardware accelerator.
To be effective, the filtering engine 400 needs to keep track of which bits have been filtered out for a column chunk and discard the corresponding entries for other column chunks. As such, for large parquet pages, the amount of memory required to keep track of the filtered entries can exceed the limited on-chip or on-board memory for FPGA/ASIC acceleration. This present design overcomes this challenge by supporting partial filtering of pages to best utilize available memory capacity, called sub-page filtering. To this end, the filtering engine exposes the following parameters for effective scheduling:
1. Total number of entries in a page or region (e.g., parquet page)
2. Number of entries to be filtered in the page or region (e.g., parquet page). This specifies the number of entries in the parquet page to be filtered. The remaining entries are not filtered and passed through.
3. Range of entries valid in CBB 440: The range of entries that are valid in the CBB for the previous parquet column chunk. This allows the filtering engine to apply filters successively as the different pages are being processed.
4. Offset address for CBB 440
Target Hardware
The target pipeline can utilize various forms of decompression (e.g., Gzip, Snappy, zlib, LZ4, . . . ), along with the necessary type of decoder (e.g., RLE, BP, . . . ) and an engine to perform the filtering. In one embodiment, an internal filtering engine architecture utilizes a memory (e.g., CBB 440 and multiple SIMD (Single Instruction Multiple Data) lanes to store filtering results across columns, and produce filtered results as part of a larger pipeline to perform parquet page level filtering.
Execution Flow:
Normal
In a typical scheduler (e.g., Spark), each page of a column chunk is processed sequentially across columns. For example, if there are 3 columns in a row-group and each column chunk has 2 pages, then column 1 page 1 is processed and then column 2 page 1 is processed, after which column 3 page 1 is processed. Then the software scheduler moves on to processing page 2 of column 1, page 2 of column 2, and page 3 of column 3. Typically software implementations execute one page after the other in sequential fashion. These implementations are open-source.
Batched Schedule
The present design provides a hardware implementation that could support any algorithm to schedule processing of a set of column based pages by exposing the required parameters. Different schedules have varying impacts on the efficiency and parallelism of execution, and also impact the overhead and complexity of implementation. Simpler scheduling algorithms can be easier to support, at the potential cost of underutilization and inefficiency when one or more instances of the present hardware design of kernel 300 are used. A more elaborate scheduling algorithm can improve efficiency by maximizing the reuse of local memory information during filtering across multiple columns. It can also allow for the extraction of parallelism, by scheduling the processing of multiple pages, specifically pages in the same column, to be dispatched across multiple executors concurrently. These improvements can come at a heightened development and complexity cost. Local memory (e.g., a software-managed scratchpad and a hardware-managed cache) utilization and contention, and kernel utilization and number of kernels are among the parameters to consider for an internal cost function for determining efficiency of page scheduling algorithms.
Subpage Scheduling
The hardware of the present design allows for partial filtering during the processing of a page 210. When a page 210 is too large to fit filtering information in the local memory, or it is desirable to maintain the state of the local memory instead of overwriting, the hardware can still perform as much filtering as is requested before passing along the rest of the page to software. Software maintains information about how much filtering is expected in order to interpret the output results correctly.
Multiple Kernel/Execution Unit
The scheduling unit provides necessary infrastructure to process pages of the same row group across multiple execution units or kernels. As an example, if there are two column chunks 220 in a row group 210 and each column chunk has 2 pages 230 then the first pages of each column can be executed in one kernel and the second pages of each column can be executed in parallel on another column. This doubles the throughput of processing row-groups.
Dynamic Scheduling
The scheduler can dynamically update the scheduling preferences in order to extract more parallelism and/or filter reuse. The scheduler has an internal profiler which monitors throughput to determine which pages could be advantageous to prioritize the scheduling to maximize the reuse of the filtering information stored in memory 440 and allow more data to discarded. The profiler is capable of utilizing feedback to improve upon its scheduling algorithm from additional sources such as Reinforcement Learning, or history buffers and pattern matching.
The present design provides increased throughput due to batched scheduling of pages and processing pages on multiple kernels (e.g., execution units) at the same time, the throughput can be substantially increased. The present design provides filter reuse with subpage scheduling. The partial filters in the memory 440 associated with a Filtering Engine 350 can be reused effectively across multiple columns. The present design also has a lower CPU utilization with filtering happening in a hardware accelerator, thus the workload on CPU reduces, leading to lower CPU utilization and better efficiency. Also, a reduced number of API calls occur due to the batched nature of scheduling. If the batch size is zero and there is a software scheduler, then for each individual page needs to be communicated to the accelerator using some API. With batching of multiple pages, API calls to an accelerator are reduced.
Round Robin
A round robin schedule is an example of a simple page scheduling algorithm. The algorithm iterates across the columns 220 and selects one page 210 from that column to schedule. This gives fair treatment to each column 220 but may result in inefficiency due to potential disparities in page sizes and the existence of pages with boundaries that do not align at the same row as boundaries of pages in other columns.
Round Robin (Largest Page First)
Largest page first round robin scheduling first has the option of choosing from the top unscheduled page of each column. It schedules these pages in order of decreasing size. Once all pages of this subset have been scheduled, a new subset is made from the next page in each column, and the subschedule is chosen again in order of decreasing size. This algorithm attempts to extract filter reuse by making the sequentially smaller pages reuse filter bits for all of their elements, not just partially. This algorithm is still prone to offset pages that cause thrashing in the scratchpad, resulting in no filter reuse.
Column Exhaustive
A column exhaustive scheduling algorithm schedules all pages 230 in a column 220 before moving on to the next column 220. This is the simplest algorithm suited towards extracting parallelism across multiple kernels, as pages in the same column have no dependencies with one another.
Even Pacing Across Columns (Using Number of Rows)
This algorithm schedules a first page 230. A number of rows serves as a current max pointer to the memory 440. This algorithm schedules pages in such a fashion that a page comes as close to max pointer without exceeded it, until there are no small enough pages remaining. Max pointer is then pushed forward by the next page and the process repeats. This algorithm tries for maximum reuse of filter, but unlike round robin largest page first is not limited in choice at the cost of more complexity.
Even Pacing Across Columns (with Allowance of Buffer Size)
Same as previous, however once a page in a column has been scheduled, this algorithm will schedule as many pages from that column as possible without exceeding a buffer size amount of rows from the base pointer. Base pointer moves as the lowest row number across uncommitted pages. By preferring scheduling within a single column, parallelism across multiple kernels is maximized.
Even Pacing Across Columns (with Selectivity)
Same as previous, but choice is also weighed by selectivity, and prioritizes scheduling a set of pages with the highest selectivity first, in order to maximize filter reuse across columns.
In order to minimize the number of interactions between the host CPU and the hardware accelerator kernel 300, this example dispatches multiple pages 230 of data from parquet columnar storage 200, ORC columnar format, row-based storage formats like JSON and CSV, and other operations in big data processing like sorting, shuffle, among others.
Without batching, the execution of steps 1410-1440 is serialized and interaction between software and the hardware kernel 300 can cause reduction in performance.
In an embodiment, accelerator 911 is coupled to multiple I/O interfaces (not shown in the figure). In an embodiment, input data elements are received by I/O interface 912 and the corresponding output data elements generated as the result of the system computation are sent out by I/O interface 912. In an embodiment, I/O data elements are directly passed to/from accelerator 911. In processing the input data elements, in an embodiment, accelerator 911 may be required to transfer the control to general purpose instruction-based processor 920. In an alternative embodiment, accelerator 911 completes execution without transferring the control to general purpose instruction-based processor 920. In an embodiment, accelerator 911 has a master role and general-purpose instruction-based processor 920 has a slave role.
In an embodiment, accelerator 911 partially performs the computation associated with the input data elements and transfers the control to other accelerators or the main general-purpose instruction-based processor in the system to complete the processing. The term “computation” as used herein may refer to any computer task processing including, but not limited to, any of arithmetic/logic operations, memory operations, I/O operations, and offloading part of the computation to other elements of the system such as general purpose instruction-based processors and accelerators. Accelerator 911 may transfer the control to general purpose instruction-based processor 920 to complete the computation. In an alternative embodiment, accelerator 911 performs the computation completely and passes the output data elements to I/O interface 912. In another embodiment, accelerator 911 does not perform any computation on the input data elements and only passes the data to general purpose instruction-based processor 920 for computation. In another embodiment, general purpose instruction-based processor 920 may have accelerator 911 to take control and completes the computation before sending the output data elements to the I/O interface 912.
In an embodiment, accelerator 911 may be implemented using any device known to be used as accelerator, including but not limited to field-programmable gate array (FPGA), Coarse-Grained Reconfigurable Architecture (CGRA), general-purpose computing on graphics processing unit (GPGPU), many light-weight cores (MLWC), network general purpose instruction-based processor, I/O general purpose instruction-based processor, and application-specific integrated circuit (ASIC). In an embodiment, I/O interface 912 may provide connectivity to other interfaces that may be used in networks, storages, cameras, or other user interface devices. I/O interface 912 may include receive first in first out (FIFO) storage 913 and transmit FIFO storage 914. FIFO storages 913 and 914 may be implemented using SRAM, flip-flops, latches or any other suitable form of storage. The input packets are fed to the accelerator through receive FIFO storage 913 and the generated packets are sent over the network by the accelerator and/or general purpose instruction-based processor through transmit FIFO storage 914.
In an embodiment, I/O processing unit 910 may be Network Interface Card (NIC). In an embodiment of the invention, accelerator 911 is part of the NIC. In an embodiment, the NIC is on the same chip as general purpose instruction-based processor 920. In an alternative embodiment, the NIC 910 is on a separate chip coupled to general purpose instruction-based processor 920. In an embodiment, the NIC-based accelerator receives an incoming packet, as input data elements through I/O interface 912, processes the packet and generates the response packet(s) without involving general purpose instruction-based processor 920. Only when accelerator 911 cannot handle the input packet by itself, the packet is transferred to general purpose instruction-based processor 920. In an embodiment, accelerator 911 communicates with other I/O interfaces, for example, storage elements through direct memory access (DMA) to retrieve data without involving general purpose instruction-based processor 920.
Accelerator 911 and the general-purpose instruction-based processor 920 are coupled to shared memory 943 through private cache memories 941 and 942 respectively. In an embodiment, shared memory 943 is a coherent memory system. The coherent memory system may be implemented as shared cache. In an embodiment, the coherent memory system is implemented using multiples caches with coherency protocol in front of a higher capacity memory such as a DRAM.
In an embodiment, the transfer of data between different layers of accelerations may be done through dedicated channels directly between accelerator 911 and processor 920. In an embodiment, when the execution exits the last acceleration layer by accelerator 911, the control will be transferred to the general-purpose core 920.
Processing data by forming two paths of computations on accelerators and general purpose instruction-based processors (or multiple paths of computation when there are multiple acceleration layers) have many other applications apart from low-level network applications. For example, most emerging big-data applications in data centers have been moving toward scale-out architectures, a technology for scaling the processing power, memory capacity and bandwidth, as well as persistent storage capacity and bandwidth. These scale-out architectures are highly network-intensive. Therefore, they can benefit from acceleration. These applications, however, have a dynamic nature requiring frequent changes and modifications. Therefore, it is highly beneficial to automate the process of splitting an application into a fast-path that can be executed by an accelerator with subgraph templates and a slow-path that can be executed by a general-purpose instruction-based processor as disclosed herein.
While embodiments of the invention are shown as two accelerated and general-purpose layers throughout this document, it is appreciated by one skilled in the art that the invention can be implemented to include multiple layers of computation with different levels of acceleration and generality. For example, a FPGA accelerator can backed by a many-core hardware. In an embodiment, the many-core hardware can be backed by a general-purpose instruction-based processor.
Referring to
Data processing system 1202, as disclosed above, includes a general-purpose instruction-based processor 1227 and an accelerator 1226 (e.g., in-line accelerator, offload accelerator for offloading processing from another computing resource, or both). The general-purpose instruction-based processor may be one or more general purpose instruction-based processors or processing devices (e.g., microprocessor, central processing unit, or the like). More particularly, data processing system 1202 may be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, general purpose instruction-based processor implementing other instruction sets, or general purpose instruction-based processors implementing a combination of instruction sets. The accelerator may be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal general purpose instruction-based processor (DSP), network general purpose instruction-based processor, many light-weight cores (MLWC) or the like. Data processing system 1202 is configured to implement the data processing system for performing the operations and steps discussed herein.
The exemplary computer system 1200 includes a data processing system 1202, a main memory 1204 (e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM) or DRAM (RDRAM), etc.), a static memory 1206 (e.g., flash memory, static random access memory (SRAM), etc.), and a data storage device 1216 (e.g., a secondary memory unit in the form of a drive unit, which may include fixed or removable computer-readable storage medium), which communicate with each other via a bus 1208. The storage units disclosed in computer system 1200 may be configured to implement the data storing mechanisms for performing the operations and steps discussed herein. Memory 1206 can store code and/or data for use by processor 1227 or accelerator 1226. Memory 1206 include a memory hierarchy that can be implemented using any combination of RAM (e.g., SRAM, DRAM, DDRAM), ROM, FLASH, magnetic and/or optical storage devices. Memory may also include a transmission medium for carrying information-bearing signals indicative of computer instructions or data (with or without a carrier wave upon which the signals are modulated).
Processor 1227 and accelerator 1226 execute various software components stored in memory 1204 to perform various functions for system 1200. Furthermore, memory 1206 may store additional modules and data structures not described above.
Operating system 1205a includes various procedures, sets of instructions, software components and/or drivers for controlling and managing general system tasks and facilitates communication between various hardware and software components. A compiler is a computer program (or set of programs) that transform source code written in a programming language into another computer language (e.g., target language, object code). A communication module 1205c provides communication with other devices utilizing the network interface device 1222 or RF transceiver 1224.
The computer system 1200 may further include a network interface device 1222. In an alternative embodiment, the data processing system disclose is integrated into the network interface device 1222 as disclosed herein. The computer system 1200 also may include a video display unit 1210 (e.g., a liquid crystal display (LCD), LED, or a cathode ray tube (CRT)) connected to the computer system through a graphics port and graphics chipset, an input device 1212 (e.g., a keyboard, a mouse), a camera 1214, and a Graphic User Interface (GUI) device 1220 (e.g., a touch-screen with input & output functionality).
The computer system 1200 may further include a RF transceiver 1224 provides frequency shifting, converting received RF signals to baseband and converting baseband transmit signals to RF. In some descriptions a radio transceiver or RF transceiver may be understood to include other signal processing functionality such as modulation/demodulation, coding/decoding, interleaving/de-interleaving, spreading/dispreading, inverse fast Fourier transforming (IFFT)/fast Fourier transforming (FFT), cyclic prefix appending/removal, and other signal processing functions.
The Data Storage Device 1216 may include a machine-readable storage medium (or more specifically a computer-readable storage medium) on which is stored one or more sets of instructions embodying any one or more of the methodologies or functions described herein.
Disclosed data storing mechanism may be implemented, completely or at least partially, within the main memory 1204 and/or within the data processing system 1202 by the computer system 1200, the main memory 1204 and the data processing system 1202 also constituting machine-readable storage media.
In one example, the computer system 1200 is an autonomous vehicle that may be connected (e.g., networked) to other machines or other autonomous vehicles in a LAN, WAN, or any network. The autonomous vehicle can be a distributed system that includes many computers networked within the vehicle. The autonomous vehicle can transmit communications (e.g., across the Internet, any wireless communication) to indicate current conditions (e.g., an alarm collision condition indicates close proximity to another vehicle or object, a collision condition indicates that a collision has occurred with another vehicle or object, etc.). The autonomous vehicle can operate in the capacity of a server or a client in a client-server network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. The storage units
disclosed in computer system 1200 may be configured to implement data storing mechanisms for performing the operations of autonomous vehicles.
The computer system 1200 also includes sensor system 1214 and mechanical control systems 1207 (e.g., motors, driving wheel control, brake control, throttle control, etc.). The processing system 1202 executes software instructions to perform different features and functionality (e.g., driving decisions) and provide a graphical user interface 1220 for an occupant of the vehicle. The processing system 1202 performs the different features and functionality for autonomous operation of the vehicle based at least partially on receiving input from the sensor system 1214 that includes laser sensors, cameras, radar, GPS, and additional sensors. The processing system 1202 may be an electronic control unit for the vehicle.
The above description of illustrated implementations of the invention, including what is described in the Abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific implementations of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize.
These modifications may be made to the invention in light of the above detailed description. The terms used in the following claims should not be construed to limit the invention to the specific implementations disclosed in the specification and the claims. Rather, the scope of the invention is to be determined entirely by the following claims, which are to be construed in accordance with established doctrines of claim interpretation.
This application claims the benefit of U.S. Provisional Application No. 62/885,150, filed on Aug. 9, 2019, the entire contents of this Provisional application is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
62885150 | Aug 2019 | US |