Microprocessors have benefited from continuing gains in transistor count, integrated circuit cost, manufacturing capital, clock frequency, and energy efficiency due to continued transistor scaling predicted by Moore's law, with little change in associated processor Instruction Set Architectures (ISAs). However, the benefits realized from photolithographic scaling, which drove the semiconductor industry over the last 40 years, are slowing or even reversing. Reduced Instruction Set Computing (RISC) architectures have been the dominant paradigm in processor design for many years.
Methods, apparatus, and computer-readable storage media are disclosed for performing complex arithmetic operations using a single processor instruction. In certain examples of the disclosed technology, a processor is configured to execute a single processor instruction to produce two or more function values be performing table lookups based on an input operand of the instruction, generate an output value by interpolating a value based on the produced function values, and produce the interpolated value as an output operand of the single processor instruction. The disclosed techniques can be implemented in general purpose central processing unit (CPU), graphics processing units (GPU), vector processors, or other suitable processors. In some examples, the disclosed techniques allow for improved processing efficiency and/or energy savings. In some examples, the single instruction includes a single instruction multiple data (SIMD) operand.
In some examples of the disclosed technology, each “lane” or “slot” of a multi-operand SIMD register will be used for a table lookup. In some examples, the lookup table is preloaded to support various mathematical operations, for example, trigonometric operations, texture operations, or other mathematical functions. The results received from the table lookup can then be interpolated in order to determine a result. The resulting data can then be stored as the output of the single instruction, for example, in a processor register or in memory.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. The foregoing and other objects, features, and advantages of the disclosed subject matter will become more apparent from the following detailed description, which proceeds with reference to the accompanying figures.
This disclosure is set forth in the context of representative embodiments that are not intended to be limiting in any way.
As used in this application the singular forms “a,” “an,” and “the” include the plural forms unless the context clearly dictates otherwise. Additionally, the term “includes” means “comprises.” Further, the term “coupled” encompasses mechanical, electrical, magnetic, optical, as well as other practical ways of coupling or linking items together, and does not exclude the presence of intermediate elements between the coupled items. Furthermore, as used herein, the term “and/or” means any one item or combination of items in the phrase.
The systems, methods, and apparatus described herein should not be construed as being limiting in any way. Instead, this disclosure is directed toward all novel and non-obvious features and aspects of the various disclosed embodiments, alone and in various combinations and subcombinations with one another. The disclosed systems, methods, and apparatus are not limited to any specific aspect or feature or combinations thereof, nor do the disclosed things and methods require that any one or more specific advantages be present or problems be solved. Furthermore, any features or aspects of the disclosed embodiments can be used in various combinations and subcombinations with one another.
Although the operations of some of the disclosed methods are described in a particular, sequential order for convenient presentation, it should be understood that this manner of description encompasses rearrangement, unless a particular ordering is required by specific language set forth below. For example, operations described sequentially may in some cases be rearranged or performed concurrently. Moreover, for the sake of simplicity, the attached figures may not show the various ways in which the disclosed things and methods can be used in conjunction with other things and methods. Additionally, the description sometimes uses terms like “produce,” “generate,” “display,” “receive,” “emit,” “verify,” “execute,” and “initiate” to describe the disclosed methods. These terms are high-level descriptions of the actual operations that are performed. The actual operations that correspond to these terms will vary depending on the particular implementation and are readily discernible by one of ordinary skill in the art.
Theories of operation, scientific principles, or other theoretical descriptions presented herein in reference to the apparatus or methods of this disclosure have been provided for the purposes of better understanding and are not intended to be limiting in scope. The apparatus and methods in the appended claims are not limited to those apparatus and methods that function in the manner described by such theories of operation.
Any of the disclosed methods can be implemented as computer-executable instructions stored on one or more computer-readable media (e.g., computer-readable media, such as one or more optical media discs, volatile memory components (such as DRAM or SRAM), or nonvolatile memory components (such as hard drives)) and executed on a computer (e.g., any commercially available computer, including smart phones or other mobile devices that include computing hardware). Any of the computer-executable instructions for implementing the disclosed techniques, as well as any data created and used during implementation of the disclosed embodiments, can be stored on one or more computer-readable media (e.g., computer-readable storage media). The computer-executable instructions can be part of, for example, a dedicated software application, or a software application that is accessed or downloaded via a web browser or other software application (such as a remote computing application). Such software can be executed, for example, on a single local computer (e.g., a thread executing on any suitable commercially available computer) or in a network environment (e.g., via the Internet, a wide-area network, a local-area network, a client-server network (such as a cloud computing network), or other such network) using one or more network computers.
For clarity, only certain selected aspects of the software-based implementations are described. Other details that are well known in the art are omitted. For example, it should be understood that the disclosed technology is not limited to any specific computer language or program. For instance, the disclosed technology can be implemented by software written in C, C++, Java, or any other suitable programming language. Likewise, the disclosed technology is not limited to any particular computer or type of hardware. Certain details of suitable computers and hardware are well-known and need not be set forth in detail in this disclosure.
Furthermore, any of the software-based embodiments (comprising, for example, computer-executable instructions for causing a computer to perform any of the disclosed methods) can be uploaded, downloaded, or remotely accessed through a suitable communication means. Such suitable communication means include, for example, the Internet, the World Wide Web, an intranet, software applications, cable (including fiber optic cable), magnetic communications, electromagnetic communications (including RF, microwave, and infrared communications), electronic communications, or other such communication means.
Novel operations performed with a processor are disclosed. In some examples, low-power processing is achieved based at least in part on performing mathematical operations using a single processor instruction.
Processors with vector or single instruction multiple data (SIMD) instruction sets can be used in hand, gesture, or depth processing. Such processors are typically designed to be very low power. However, it is often desirable to perform fairly complex math operations, but accuracy can be reduced in order to reduce the compute power requirements of performing such operations. In some examples, a lookup table and interpolation is used to support the processor functions in a low power fashion. In some examples, a unique set of instructions are provided that are natively available in a processor Instruction Set Architecture (ISA) to increase performance and/or save energy.
In some examples, combining a SIMD instruction set with a table lookup and subsequent interpolation provides a lower power processor, which is desirable in, for example, mobile hardware applications, while simultaneously realizing higher performance due to a reduction in of the number of operations performed, including associated overhead, thereby further increasing energy savings.
In some examples of the disclosed technology, each “lane” or “slot” of a SIMD register is be used for a respective table lookup. A pre-loaded lookup table is accessed to support a number of operations, including mathematical operations. In other examples, the lookup table can be fixed (e.g., using a read-only memory (ROM) to realize further energy savings. Results of table lookups are interpolated. The outputs can be stored in the same SIMD register as the source operands (e.g., an operation on a four-lane SIMD operand results in a four-operand output) or in a different register.
Furthermore, any of the processing cores 110 have access to a set of registers which are included within, for example, a register file. In some examples, the processor cores 110 share registers within a register file. In other examples, each of the processor cores includes its own dedicated register file. The register files store data for registers to find in the corresponding processor architecture, and can have one or more read ports and one or more write ports.
In the example of
The I/O interface 145 includes circuitry for receiving and sending input and output signals to other components, such as hardware interrupts, system control signals, peripheral interfaces, co-processor control and/or data signals (e.g., signals for a graphics processing unit, floating point coprocessor, physics processing unit, digital signal processor, or other co-processing components), clock signals, semaphores, or other suitable I/O signals. The I/O signals may be synchronous or asynchronous. In some examples, all or a portion of the I/O interface is implemented using memory-mapped I/O techniques in conjunction with the memory interface 140.
The multi-processor 100 can also include a control unit 160. The control unit 160 supervises operation of the multi-processor 100. Operations that can be performed by the control unit 160 can include allocation and de-allocation of cores for performing instruction processing, control of input data and output data between any of the cores, the register file 130, the memory interface 140, and/or the I/O interface 145. The control unit 160 can also process hardware interrupts, and control reading and writing of special system registers, for example the program counter. In some examples of the disclosed technology, the control unit 160 is at least partially implemented using one or more of the processing cores 110, while in other examples, the control unit 160 is implemented using a different processing core (e.g., a general-purpose RISC processing core). In some examples, the control unit 160 is implemented at least in part using one or more of: hardwired finite state machines, programmable microcode, programmable gate arrays, or other suitable control circuits. In alternative examples, control unit functionality can be performed by one or more of the cores 110.
The control unit 160 includes a scheduler that is used to allocate instructions for execution on one or more of the processor cores 110. The recited stages of instruction operation are for illustrative purposes, and in some examples of the disclosed technology, certain operations can be combined, omitted, separated into multiple operations, or additional operations added.
The multi-processor 100 also includes a clock generator 170, which distributes one or more clock signals to various components within the processor (e.g., the cores 110, interconnect 120, memory interface 140, and/or I/O interface 145). In some examples of the disclosed technology, all of the components share a common clock, while in other examples different components use a different clock, for example, a clock signal having differing clock frequencies. In some examples, a portion of the clock is gated to allow power savings when some of the processor components are not in use. In some examples, the clock signals are generated using a phase-locked loop (PLL) to generate a signal of fixed, constant frequency and duty cycle. Circuitry that receives the clock signals can be triggered on a single edge (e.g., a rising edge) while in other examples, at least some of the receiving circuitry is triggered by rising and falling clock edges. In some examples, the clock signal can be transmitted optically or wirelessly.
Also shown in
While
The generalized micro architecture illustrated in the block diagram 200 includes a control unit 215, which generates control signals to regulate processor core operation and schedules the flow of instructions within the core. For example, the control unit 215 can initiate execution of processor instructions using an instruction fetch unit 220 which accesses the processor memory system 150 in order to fetch one or more processor instructions and store the fetched instructions in an instruction cache 225. Instructions stored in the instruction cache 225 in turn are decoded using an instruction decoder 227. The instruction decoder decodes opcodes specified within the machine language instructions in order to specify operations to be performed and controlled by the control unit 215.
The control unit 215 can be implemented using any suitable technology for generating control signals to regulate and schedule operation of the core. In some examples, the control unit 215 is implemented using hardwired logic to implement a finite state machine. In other examples, the control unit 215 is implemented using logic coupled to a storage unit storing microinstructions for implementing control unit functions. In some examples, the logic for the control unit 215 is implemented at least in part using programmable logic, while in other examples, the control unit is implemented at least in part using hardwired logic that cannot be easily modified after the control unit has been fabricated in an integrated circuit.
The instruction decoder 227 also specifies instruction operands, including input operands and output operands. The instruction operands can be specified using any suitable addressing modes which, depending on a particular processor implementation, can include register mode, immediate mode, displacement mode, indirect mode, indexed mode, absolute mode, memory indirect mode, auto increment mode, auto decrement code, or scaled mode. In some examples, an instruction has one input operand and one output operand. In other examples, instructions can have more than one input operand, and/or output operand. In other examples, one or more of the input operands, or the output operands, are inferred, instead of being explicitly specified within a particular instruction word.
Some instructions are used to load data into the processing unit 210 using the data fetch module 230. The data fetch module 230 uses the memory system 150 to access data stored in a cache, main memory, or virtual memory, and store the data received from the memory system 150 in a data cache 235. Data stored in the data cache 235 can in turn be loaded into a register file 240 that holds architecturally-defined registers for the processing unit 210.
Also shown in
The execution units can also access data stored in a lookup table (LUT) 270. The lookup table can be implemented using read only memory (ROM), random access memory (RAM), as a register file (e.g. a register file comprising latches and/or flip flops) or other suitable storage technology. In some examples, processing resources, including some or all of the memory accessible to the processing unit 210, including in the LUT 270, can be stored in embedded memory including within a System on Chip (SoC) integrated circuit. The LUT 270 can have one or more read ports and one or more write ports, depending on the particular configuration. For example, if the processing unit 210 is a SIMD processor processing four 16-bit words of data simultaneously, the LUT 270 can output data 64 bits in width, or 16 bits in width for each lane of SIMD data. In some examples, the LUT 270 can be programmed using one or more dedicated processor instructions. In other examples, the LUT can be pre-programmed (e.g. as in a ROM, flash memory, or other suitable means) by using a dedicated memory address and read/write memory operations, or by other suitable means. The particular configuration of the LUT 270 can be determined by the designer of the processing unit 210 in view of the apparatus and methods disclosed herein.
The execution units can be configured to form an interpolation module. For example, the control unit 215 can generate control signals for performing operation of a single instruction that cause some of the execution units to subtract one function value returned by the LUT 270 from a second function value, multiply the subtraction result from the first function value, and shift the multiply result right to generate an output value using, for example, the integer ALUs 251 and 253, and the shifter 257. In other examples, the interpolation module is implemented using dedicated adders, subtractors, multipliers, and/or shifters. In some examples, the control unit 215 pipelines a single instruction by performing some operations for the instruction in a first pipeline stage and performing other operations for the same instruction in one or more subsequent pipeline stages, such that execution of the other operations occurs during a different clock cycle than for the first pipeline stage operations. Intermediate results can be stored using the pipeline registers 265. In some examples, the control unit 215 is a general purpose control unit that also supervises operation of other instructions for the processor core 210. Thus, implementation of the single instruction can be integrated into a general-purpose processor core, reducing overhead and allowing for improved energy efficiency.
As shown in
As shown in
The examples of lookup tables disclosed herein (e.g. LUT 340) describe examples where a single index value is used to calculate and address for performing a table lookup. However, as will be readily understood to one of ordinary skill in the art, the lookup table can be addressed using multiple indices, for example two, three, or more indices, thereby forming a multi-dimensional lookup table.
As shown in
It should be readily understood to one of the ordinary skill in the art that the configuration of the functional units within each of the SIMD lanes (e.g. SIMD lane 380) can be varied. For example, instead of using general purpose ALUs such as ALUs 360, 365, and 375, dedicated adders, multipliers, or other circuits can be employed. Further, there are different circuit implementations that can be used to implement the shifter 370. Further, in some examples one or more sets of pipeline registers can be interposed between one or more of the functional units in order to add pipeline stages to the execution of the processing unit displayed in block diagram 300.
A second portion 420 of pseudocode describes performing lookup table lookups and interpolations according to the disclosed technology. Two lookup tables operations are performed to look up a first function value (LUT_A(x)) at a location specified by the index portion of a SIMD operand and a second function value (LUT_B(x)), which is used to perform a table lookup at an address specified by the index portion of a SIMD operand plus one. In some examples of the disclosed technology, a different offset can be used, for example, an offset specified by the user using a processor instruction, by storing a value in a particular register or memory location, or by using other suitable means for specifying the offset. Next, a delta (delta(x)) is calculated by subtracting the function value returned by the LUT_B lookup by the function value returned by the lookup table lookup LUT_A. The delta value, in turn, is multiplied by the fractional portion of the SIMD operand (scale(x)) (also referred to as the scale portion of the operand). The delta value is multiplied by the scale and then shifted right a specified number of bits based on the format of the input and stored as the scale value (scaled(x)). For example, an 8.8 format floating point value will be shifted right by 8 bits. The output value (output(x) for the instruction is computed by adding the lookup value LUT_A to the result of the scaling operation.
A pseudocode portion 430 illustrates an example arrangement of output values that can be stored in a particular SIMD register. As will be readily understood to one of ordinary skill in the art, other arrangements of SIMD data are possible.
At process block 620, output values are generated by interpolating an output value based on the two or more function values for the input operand. For example an execution unit configured to include the interpolation module 387, as described above regarding
At process bock 630, the method generates an output operand of the instruction based on the output value interpolated at process block 620. In some examples, additional processing is performed to the output value before generating an output operand. For example, additional shifting, sign calculation, or other suitable operations can be performed on the output value. The output operand can be stored in a number of different manners. For example, the output operand can be a register in the processor. Thus subsequent instructions executed by the processor can use the output value as stored in corresponding register. In other examples, the output operand can be stored in memory, for example at an absolute, index, or indirect address, placed on a stack, or output as a signal.
Thus, the method outlined in the flowchart 600 can be used to perform a mathematical operation by executing a single processor instruction. For example, the function values performed by the lookup table are not visible at the architectural level. Similarly, intermediate values generated during interpolating of an output value can also be hidden from the programming model. Because the mathematical operation outlined in
At process block 710, an input operand of a single instruction is received, and a lookup table (LUT) offset is computed based on an index portion of the input operand. For example, for a 16-bit fixed-point number expressed in 8.8 format, the 8 most significant bits are used as the index portion. In some examples, the LUT offset is a constant (e.g. plus 1 or minus 1). In other examples, an offset is computed as a function of the index portion of the input operand, the fractional portion of the input operand, a mantissa of a floating point input operand based on a statically or dynamically configurable parameter, or by another operand of the single instruction. Once the LUT offset has been computed, the method proceeds to process block 720. In some examples, the single processor instruction includes a second operand specifying an offset from an index portion of the first input operand and that offset is used in performing an least one of the table lookups performed according to the disclosed method.
At process block 720, function values are generated by performing LUT lookups at an address based on the index as well as the index plus the offset computed at process block 710. For example, if an input operand is a fixed-point number 3.6, the LUT lookup can be performed at an address corresponding to the numbers 3 and 4. As disclosed herein, the function values can be arbitrary, and in some examples can be set by the use of another processor instruction. Once two or more function values are generated by performing the LUT lookup, the method proceeds to process block 730. In some examples, an address used for performing a LUT lookup is based on an index portion of the input operand of a single processor instruction combined with the offset computed at process block 710. In some examples, the processor is configured to calculate an address for the lookup table based on additional considerations, which considerations can be specified by the control unit, by the single processor instruction, by configuring control registers of the processor, or other suitable methods for configuring lookup table address calculation. For example, an address calculated in performing a LUT lookup can be clamped above or below a certain value, wrapped past the end of the lookup table address range back to previous addresses of the lookup table, or limited such that only a portion but not all of the available address locations for the lookup table are used in addressing the lookup table. In some examples, the lookup table values can be updated dynamically as an execution thread is running.
The lookup table can be implemented using any suitable storage technology including DRAM, SRAM, registers, flip flops, latches, flash memory, or other suitable storage technology. As will be readily understood to one of ordinary skill in the relevant art, any arbitrary function can be programmed into the lookup table, for example trigonometric functions, including sine, cosine, tangent, as well as inverse versions of those trigonometric functions. Further, other mathematical functions such as square root, factorial, logarithms, or other suitable mathematical functions can be implemented. Furthermore, table lookups for use in applications such as audio or video processing, encryption, pattern recognition, image processing, or other suitable application can be used.
At process block 730, a difference is computed between the two function values. For example the function value returned by the lookup at index can be subtracted from the function value returned by the LUT lookup at the address corresponding to index plus offset. In other examples, different techniques for computing differences can be used, including but not limited to: bit-wise comparisons, addition, subtraction, multiplication, and/or division, or other mathematical operations. In some examples, the difference is computed by retrieving a value from a lookup table. Once the difference is computed, the method proceeds to process block 740. The different in function values can be computed using an ALU, or a dedicated adder or subtractor.
At process block 740, the difference computed at process block 730 is multiplied by a scale portion of the input operand of the single instruction. For example, if the scale portion is designated as the fractional portion of the input operand, that portion is multiplied by the difference computed at processor block 730. In some examples, the scale portion of the operand is expressed as a fractional binary number. In other examples, a different format of the scale portion is used. Once the difference is multiplied by the scale portion, the method proceeds to process block 750. The different computed at process block 740 can be computed using an ALU, a dedicated multiplier, a shifter, or other suitable logic circuit. After multiplying the difference by the scale portion of the operand, the result can then be shifted by a number of bits equal to one-half the width of the input operand. For example, if the input operand is a 16-bit, 8.8 fixed-point number, then the scaled result is logically shifted to the right by 8 bits. In other examples, a function other than logical right shift is applied to the scaled result (e.g., in examples where interpolation is non-linear). This scaled result can be used by the addition performed at process block 750.
At process block 750, the scaled result generated at process block 740 is added to the function value returned by the table lookup at the address corresponding to the index of the input operand of the single instruction. In other examples, a different mathematical function can be used. For example subtraction, or a bit-wise operation. By adding the scaled result to the function value corresponding to the input operand, an output result value is generated. Once one or more of these output result values are generated, the method proceeds to process block 760. The scaled result can be generated using an ALU, a dedicated adder, or other suitable logic circuit.
At process block 760, the scaled result value generated at process block 750 is saved as at least one output operand of the single instruction. For example, the scaled result value can be stored in a processor register, or at a memory location, which location can be designated using an absolute, relative, indexed address, or other suitable manner of specifying a location to write the output operand. Thus, a complex mathematical operation can be performed using a single processor instruction.
In some examples, the input operand is a scalar value of the single instruction while in other examples, multiple input operands, for example as in a vector processor or SIMD processor, are used so as to allow processing of multiple operands simultaneously for one single instruction. Similarly, the output operand of the method generated at process block 760 can also be a scalar, a vector, or a SIMD register value.
It will be readily understood to one of ordinary skill in the relevant art that intermediate values produced while performing the method outlined in the flowchart 700 may not be architecturally visible. In other words, certain values such as the function values generated at process block 720, the difference computed at process block 730, the multiply result produced at process block 740, or other intermediate values may not be visible to the programmer. This is because the method of
In some examples, after performing the method outlined in
In some examples, a method includes transforming one or more source code or assembly code instructions into processor instructions that are executable by the processor and emitting transformed processor instructions as object code for the processor. The object code includes at least one single processor instruction that when executed by the processor causes the processor to perform the method outlined in
With reference to
A computing system may have additional features. For example, the computing system 800 includes storage 840, one or more input devices 850, one or more output devices 860, and one or more communication connections 870. An interconnection mechanism (not shown) such as a bus, controller, or network interconnects the components of the computing system 800. Typically, operating system software (not shown) provides an operating environment for other software executing in the computing system 800, and coordinates activities of the components of the computing system 800.
The tangible storage 840 may be removable or non-removable, and includes magnetic disks, magnetic tapes or cassettes, CD-ROMs, DVDs, or any other medium which can be used to store information and which can be accessed within the computing system 800. The storage 840 stores instructions for the software 880 implementing one or more innovations described herein.
The input device(s) 850 may be a touch input device such as a keyboard, mouse, pen, or trackball, a voice input device, a scanning device, or another device that provides input to the computing system 800. For video encoding, the input device(s) 850 may be a camera, video card, TV tuner card, or similar device that accepts video input in analog or digital form, or a CD-ROM or CD-RW that reads video samples into the computing system 800. The output device(s) 860 may be a display, printer, speaker, CD-writer, or another device that provides output from the computing system 800.
The communication connection(s) 870 enable communication over a communication medium to another computing entity. The communication medium conveys information such as computer-executable instructions, audio or video input or output, or other data in a modulated data signal. A modulated data signal is a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media can use an electrical, optical, RF, or other carrier.
The innovations can be described in the general context of computer-executable instructions, such as those included in program modules, being executed in a computing system on a target real or virtual processor. Generally, program modules include routines, programs, libraries, objects, classes, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The functionality of the program modules may be combined or split between program modules as desired in various embodiments. Computer-executable instructions for program modules may be executed within a local or distributed computing system.
The terms “system” and “device” are used interchangeably herein. Unless the context clearly indicates otherwise, neither term implies any limitation on a type of computing system or computing device. In general, a computing system or computing device can be local or distributed, and can include any combination of special-purpose hardware and/or general-purpose hardware with software implementing the functionality described herein.
For the sake of presentation, the detailed description uses terms like “determine” and “use” to describe computer operations in a computing system. These terms are high-level descriptions for operations performed by a computer, and should not be confused with acts performed by a human being. The actual computer operations corresponding to these terms vary depending on implementation.
The illustrated mobile device 900 can include a controller or processor 910 (e.g., signal processor, microprocessor, ASIC, or other control and processing logic circuitry) for performing such tasks as signal coding, data processing, input/output processing, power control, and/or other functions, including instructions for implementing lookup tables and single instructions for calculating using the lookup tables disclosed herein. An operating system 912 can control the allocation and usage of the components 902 and support for one or more application programs 914. The application programs can include common mobile computing applications (e.g., email applications, calendars, contact managers, web browsers, messaging applications), or any other computing application. Functionality 913 for accessing an application store can also be used for acquiring and updating application programs 914.
The illustrated mobile device 900 can include memory 920. Memory 920 can include non-removable memory 922 and/or removable memory 924. The non-removable memory 922 can include RAM, ROM, flash memory, a hard disk, or other well-known memory storage technologies. The removable memory 924 can include flash memory or a Subscriber Identity Module (SIM) card, which is well known in GSM communication systems, or other well-known memory storage technologies, such as “smart cards.” The memory 920 can be used for storing data and/or code for running the operating system 912 and the applications 914. Example data can include web pages, text, images, sound files, video data, or other data sets to be sent to and/or received from one or more network servers or other devices via one or more wired or wireless networks. The memory 920 can be used to store a subscriber identifier, such as an International Mobile Subscriber Identity (IMSI), and an equipment identifier, such as an International Mobile Equipment Identifier (IMEI). Such identifiers can be transmitted to a network server to identify users and equipment.
The mobile device 900 can support one or more input devices 930, such as a touchscreen 932, microphone 934, camera 936, physical keyboard 938, trackball 940, and/or motion sensor 942; and one or more output devices 950, such as a speaker 952 and a display 954. Other possible output devices (not shown) can include piezoelectric or other haptic output devices. Some devices can serve more than one input/output function. For example, touchscreen 932 and display 954 can be combined in a single input/output device.
The input devices 930 can include a Natural User Interface (NUI). An NUI is any interface technology that enables a user to interact with a device in a “natural” manner, free from artificial constraints imposed by input devices such as mice, keyboards, remote controls, and the like. Examples of NUI methods include those relying on speech recognition, touch and stylus recognition, gesture recognition both on screen and adjacent to the screen, air gestures, head and eye tracking, voice and speech, vision, touch, gestures, and machine intelligence. Other examples of a NUI include motion gesture detection using accelerometers/gyroscopes, facial recognition, 3-D displays, head, eye, and gaze tracking, immersive augmented reality and virtual reality systems, all of which provide a more natural interface, as well as technologies for sensing brain activity using electric field sensing electrodes (EEG and related methods). Thus, in one specific example, the operating system 912 or applications 914 can comprise speech-recognition software as part of a voice user interface that allows a user to operate the device 900 via voice commands. Further, the device 900 can comprise input devices and software that allows for user interaction via a user's spatial gestures, such as detecting and interpreting gestures to provide input to a gaming application.
A wireless modem 960 can be coupled to an antenna (not shown) and can support two-way communications between the processor 910 and external devices, as is well understood in the art. The modem 960 is shown generically and can include a cellular modem for communicating with the mobile communication network 904 and/or other radio-based modems (e.g., Bluetooth 964 or Wi-Fi 962). The wireless modem 960 is typically configured for communication with one or more cellular networks, such as a GSM network for data and voice communications within a single cellular network, between cellular networks, or between the mobile device and a public switched telephone network (PSTN).
The mobile device can further include at least one input/output port 980, a power supply 982, a satellite navigation system receiver 984, such as a Global Positioning System (GPS) receiver, an accelerometer 986, and/or a physical connector 990, which can be a USB port, IEEE 1394 (FireWire) port, and/or RS-232 port. The illustrated components 902 are not required or all-inclusive, as any components can be deleted and other components can be added.
In example environment 1000, the cloud 1010 provides services for connected devices 1030, 1040, 1050 with a variety of screen capabilities. Connected device 1030 represents a device with a computer screen 1035 (e.g., a mid-size screen). For example, connected device 1030 could be a personal computer such as desktop computer, laptop, notebook, netbook, or the like. Connected device 1040 represents a device with a mobile device screen 1045 (e.g., a small size screen). For example, connected device 1040 could be a mobile phone, smart phone, personal digital assistant, tablet computer, and the like. Connected device 1050 represents a device with a large screen 1055. For example, connected device 1050 could be a television screen (e.g., a smart television) or another device connected to a television (e.g., a set-top box or gaming console) or the like. One or more of the connected devices 1030, 1040, and/or 1050 can include touchscreen capabilities. Touchscreens can accept input in different ways. For example, capacitive touchscreens detect touch input when an object (e.g., a fingertip or stylus) distorts or interrupts an electrical current running across the surface. As another example, touchscreens can use optical sensors to detect touch input when beams from the optical sensors are interrupted. Physical contact with the surface of the screen is not necessary for input to be detected by some touchscreens. Devices without screen capabilities also can be used in example environment 1000. For example, the cloud 1010 can provide services for one or more computers (e.g., server computers) without displays.
Services can be provided by the cloud 1010 through service providers 1020, or through other providers of online services (not depicted). For example, cloud services can be customized to the screen size, display capability, and/or touchscreen capability of a particular connected device (e.g., connected devices 1030, 1040, 1050).
In example environment 1000, the cloud 1010 provides the technologies and solutions described herein to the various connected devices 1030, 1040, 1050 using, at least in part, the service providers 1020. For example, the service providers 1020 can provide a centralized solution for various cloud-based services. The service providers 1020 can manage service subscriptions for users and/or devices (e.g., for the connected devices 1030, 1040, 1050 and/or their respective users).
In some examples of the disclosed technology, an apparatus includes a processor configured to execute one processor instruction having an input operand with the processor by producing two or more function values by performing two or more table lookups based at least in part on the input operand, generating an output value based on the two or more function values, and producing the output value as an output operand of the one processor instruction. In some examples, the output value is generated based at least in part on interpolating the two or more function values.
In some examples of the apparatus, the input operand is expressed as a fixed-point number including an index portion and a fractional portion, and the generating including interpolating the two or more function values and scaling, by the fractional portion, a difference computed between at least two of the two or more function values. In some examples, the input operand is expressed as a fixed-point number including an index portion and a fractional portion, and the index portion of the input operand is used to form an address for performing the two or more table lookups. In some examples, the input operand includes a portion of a vector of two or more input operands and the one processor instruction executes to process the vector, a respective set of two or more function values are produced for each of the two or more input operands of the vector, output values are interpolated and produced for each respective set of two or more function values, and the one processor instruction produces output values as a vector output operand.
In some examples, the one processor instruction includes a second operand specifying an offset from an index portion of the first input operand, and the offset is used to perform at least one of the two or more table lookups. In some examples, the two or more function values are not architecturally visible. In some examples, the processor is further configured to execute another processor instruction that stores values in a lookup table, the lookup table being used for providing the two or more function values produced by performing the two or more table lookups.
In some examples, the processor is further configured to, after executing the one processor instruction, execute one or more processor instructions that cause the processor to store at least one different value in a lookup table that was used for the two or more table lookups, and execute a third, single processor instruction having a second input operand with the processor by: producing two or more second function values by performing two or more table lookups in the lookup table based at least in part on the second operand, interpolating a second output value based on the two or more second function values, and producing the second output value as a second output operand of the third processor instruction.
In some examples of the disclosed technology, an apparatus including a processor includes: a lookup table configured to return one or more function values based on one or more input operands of a processor instruction, a control unit configured to execute the instruction by acts including addressing the lookup table based at least in part on the one or more input operands, and an interpolation module configured to interpolate at least one output value based on two or more of the returned function values.
In some examples, the apparatus further includes a load store unit configured to store the output value in memory and/or a processor register specified by an output operand of the processor instruction.
In some examples, the input operands are vector operands, and the at least one output value is stored as stored in a processor register as a vector operand. In some examples, the processor is configured to execute at least one or more of the following: vector instructions, single instruction multiple data (SIMD) instructions, multiple instruction multiple data (MIMD) instructions, and/or graphic processing unit (GPU) instructions. In some examples, addressing the lookup table includes performing at least one or more of the following when calculating an address for the lookup table when the lookup table returns at least one of the function values: clamping the address, wrapping the address, or limiting the address to a portion but not all available address locations for the lookup table. In some examples, the interpolation module includes at least one or more of the following: an adder, a multiplier, and/or a shifter.
In some examples of the disclosed technology, a method includes transforming one or more source code or assembly code instructions into processor instructions executable by the processor and emitting object code for the processor instructions, the processor code instructions including the single instruction that when executed by the processor, causes the processor perform a method including producing two or more function values by performing two or more table lookups based at least in part on the input operand, generating an output value based on the two or more function values, and producing the output value as an output operand of the one processor instruction. In some examples of the method, the input operand and the output operand are vectors of fixed-point data. In some examples, the method further includes executing one or more instructions different than the single instruction to store values in one or more lookup tables, and the two or more table lookups produce function values based at least in part on the stored values in the one or more lookup tables.
In some examples of the disclosed technology, a method includes transforming one or more source code or assembly code instructions into processor instructions executable by the processor and emitting object code for the processor instructions, the processor instructions including the single instruction that when executed by the processor, causes the processor to perform a method, the method including transforming one or more source code or assembly code instructions into processor instructions executable by the processor and emitting object code for the processor instructions, the processor code instructions including the single instruction that when executed by the processor, causes the processor perform a method including producing two or more function values by performing two or more table lookups based at least in part on the input operand, generating an output value based on the two or more function values, and producing the output value as an output operand of the one processor instruction. For example, the processor instructions can be executed by any of the the exemplary apparatus disclosed herein.
In some examples of the disclosed technology, one or more computer-readable storage media storing computer-executable instructions that when executed by a processor, cause the processor to perform a method including producing two or more function values by performing two or more table lookups based at least in part on the input operand, generating an output value based on the two or more function values, and producing the output value as an output operand of the one processor instruction. In some examples, the computer-readable storage media store instructions for transforming one or more source code or assembly code instructions into processor instructions executable by the processor and emitting object code for the processor instructions including a single instruction that cause a processor to perform a method including producing two or more function values by performing two or more table lookups based at least in part on the input operand, generating an output value based on the two or more function values.
In view of the many possible embodiments to which the principles of the disclosed subject matter may be applied, it should be recognized that the illustrated embodiments are only preferred examples should not be taken as limiting the scope of claims to those preferred examples. Rather, the claimed subject matter is defined by the following claims. We therefore claim as our invention all that comes within the scope of these claims.