A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but other-wise reserves all copyright rights whatsoever.
This application is related to commonly-assigned U.S. patent application Ser. No. ______ entitled “CELL PROCESSOR” to John P. Bates, Payton R. White and Attila Vass, which is filed the same day as the present application, the entire disclosures of which are incorporated herein by reference.
This application is also related to commonly-assigned U.S. patent application Ser. No. ______ entitled “CELL PROCESSOR TASK AND DATA MANAGEMENT” to Richard B. Stenson and John P. Bates, which is filed the same day as the present application, the entire disclosures of which are incorporated herein by reference.
This application is also related to commonly-assigned U.S. patent application Ser. No. ______ entitled “OPERATING CELL PROCESSORS OVER A NETWORK” to Tatsuya Iwamoto, which is filed the same day as the present application, the entire disclosures of which are incorporated herein by reference.
This application is also related to commonly-assigned U.S. patent application Ser. No. ______ entitled “METHOD AND SYSTEM FOR PERFORMING MEMORY COPY FUNCTION ON A CELL PROCESSOR” to Antoine Labour John P. Bates and Richard B. Stenson, which is filed the same day as the present application, the entire disclosures of which are incorporated herein by reference.
This invention generally relates to parallel processing and more particularly to managing tasks in cell processors.
Major advance in electronic computation has been the development of systems that can perform multiple operations simultaneously. Such systems are said to perform parallel processing. Recently, cell processors have been developed to implement parallel processing on electronic devices ranging from handheld game devices to main frame computers. A typical cell processor has a power processor unit (PPU) and up to 8 additional processors referred to as synergistic processing units (SPU). Each SPU is typically a single chip or part of a single chip containing a main processor and a co-processor. All of the SPUs and the PPU can access a main memory, e.g., through a memory flow controller (MFC). The SPUs can perform parallel processing of operations in conjunction with a program running on the main processor. A small local memory (typically about 256 kilobytes) is associated with each of the SPUs. This memory must be managed by software to transfer code and data to/from the local SPU memories.
The SPU have a number of advantages in parallel processing applications. For example, the SPU are independent processors that can execute code with minimal involvement from the PPU. Each SPU has a high direct memory access (DMA) bandwidth to RAM. An SPU can typically access the main memory faster than the PPU. In addition each SPU has relatively fast access to its associated local store. The SPU also have limitations that can make it difficult to optimize SPU processing. For example, the SPU cannot implement symmetric multiprocessing (SMP), have no shared memory and no hardware cache. In addition, common programming models do not work well on SPU.
A typical SPU process involves retrieving code and/or data from the main memory, executing the code on the SPU to manipulate the data, and outputting the data to main memory or, in some cases, another SPU. To achieve high SPU performance it is desirable to optimize the above SPU process in relatively complex processing applications. For example, in applications such as computer graphics processing SPUs typically execute tasks thousands of times per frame. A given task may involve varying SPU code, vary data block numbers and sizes. For high performance, it is desirable to manage the transfer of SPU code and data from SPU software with little PPU software involvement. There are many techniques for managing code and data from the SPU. Often, different techniques for managing code and data from the SPU need to operate simultaneously on a cell processor. There are many programming models for SPU-driven task management. Unfortunately, no single task system is right for all applications.
One prior art task management system used for cell processors is known as SPU Threads. A “thread” generally refers to a part of a program that can execute independently of other parts. Operating systems that support multithreading enable programmers to design programs whose threaded parts can execute concurrently. SPU Threads operates by regarding the SPUs in a cell as processors for threads. A context switch may swap out the contents of an SPU's local storage to the main memory and substitute 256 kilobytes of data and/or code into the local storage from the main memory where the substitute data and code are processed by the SPU. A context switch is the computing process of storing and restoring the state of a SPU or PPU (the context) such that multiple processes can share a single resource. Context switches are usually computationally intensive and much of the design of operating systems is to optimize the use of context switches.
Unfortunately, interoperating with SPU Threads is not an option for high-performance applications. Applications based on SPU Threads have large bandwidth requirements and are processed from the PPU. Consequently SPU-threads based applications are not autonomous and tend to be slow. Because SPU Threads are managed from the PPU, SPU context switching (swapping out the current running process on an SPU to another waiting process) takes too long. Avoiding PPU involvement in SPU management can lead to much better performance for certain applications
To overcome these problems a system referred to as SPU Runtime System (SPURS) was developed. In SPURS, the memory of each SPU has loaded into it a kernel that performs scheduling of tasks handled by the SPU. Unfortunately, SPURS, like SPU Threads, uses context switches to swap work in and out of the SPUs. The work is performed on the SPUs rather than the PPU so that unlike in SPU Threads there is autonomy of processing. However, SPURS suffers from the same overhead of context switches as SPU Threads. Thus, although SPURS provides autonomy it is not suitable for many use cases.
SPURS is just one example of an SPU task system. Middleware and applications will require various task systems for various purposes. Currently, SPURS runs as a group of SPU Threads, so that it can interoperate with other SPU Threads. Unfortunately, as stated above, SPU Threads has undesirable overhead, so using it for the interoperation of SPU task systems is not an option for certain high-performance applications.
In cell processing, it is desirable for middleware and applications to share SPUs using various task systems. It is desirable to provide resources to many task classes, e.g., audio, graphics, artificial intelligence (Al) or for physics such as cloth modeling, fluid modeling, or rigid body dynamics. To do this efficiently the programming model needs to manage both code and data. It is a challenge to get SPU middleware to interoperate with no common task system. Unfortunately, SPU Threads and SPURS follow the same programming model and neither model provides enough performance for many use cases. Thus, application developers still have to figure out how to share limited memory space on the SPUs between code and data.
Thus, there is a need in the art, for a cell processor method and apparatus that overcomes the above disadvantages. It would be desirable to implement SPU task management using a software model that is easy to use and that stresses the SPUs merits. It would also be desirable to be able to implement SMP with software code and/or data cached on the SPU.
Embodiments of the present invention are directed to a cell processor task management in a cell processor having a main memory, one or more power processor units (PPU) and one or more synergistic processing units (SPU), each SPU having a processor and a local memory, a method for managing tasks to be executed by the one or more of the SPUs. An SPU task manager (STM) running on one or more of the SPUs reads one or more task definitions stored in the main memory into the local memory of a selected SPU. Based on information contained in the task definitions the SPU loads code and/or data related to the task definitions from the main memory into the local memory associated with the selected SPU. The selected SPU then performs one or more tasks using the code and/or data.
The teachings of the present invention can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which:
Although the following detailed description contains many specific details for the purposes of illustration, anyone of ordinary skill in the art will appreciate that many variations and alterations to the following details are within the scope of the invention. Accordingly, the exemplary embodiments of the invention described below are set forth without any loss of generality to, and without imposing limitations upon, the claimed invention.
It is desirable for embodiments of the present invention to achieve high performance with a cell processor. Preferably SPU task management according to embodiments of the present invention is complete, i.e., it works for all use cases and is scalable, i.e., performance scales with the number of SPUs. In addition, it is desirable for embodiments of the present invention to implement SPU task management efficiently, with low PPU usage, low memory useage and low DMA bandwidth usage.
The PPU 102 acts as a controller for the SPUs 104, which handle most of the computational workload. The PPU 102 may also be used to run conventional operating systems if it is sufficiently similar to other 64-bit PowerPC processors, and if the SPUs 104 are designed for vectorized floating point code execution. By way of example, the PPU 102 may contain a 32 KiB instruction and data Level 1 cache and a 512 KiB level 2 cache.
The PPU 102, SPUs 104 and main memory 106 can exchange code and data with each other over an exchange interface bus (EIB) 103. The PPU 102 and SPUS 104 can also exchange code and data stored in a main memory 106, e.g., via the EIB 103 and a memory flow controller (MFC) 108 such as a digital memory access (DMA) unit or the like. The EIB 103 may be a circular bus having two channels in opposite directions. The EIB 103 may also be connected to the Level 2 cache, the MFC 108, and a system interface 105 such as a FlexIO for external communications.
Each SPU 104 includes a local memory 110. Code and data obtained from the main memory 106 can be loaded into the local memory 110 so that the SPU 104 can process tasks. As shown in the inset, a software manager referred to herein as an SPU Task Manager (STM) 112 resides in the local memory 110 of each SPU 104. Preferably, the STM 112 takes up only a small fraction of the total memory space available in each local memory 110. The heart of SPMM 112 is referred to as an “STM Kernel”, which typically takes up about 16 KB resident on each SPU. For a 256K local storage, this represents about 6% SPU Local Store usage.
By way of example, policy modules and work queues may be associated as follows. As shown in the lower inset in
When the task queues 116 are empty, the SPU kernel on each SPU 104 waits on an atomic reservation lost event. The SPUs 104 notify the atomic mutex 117 of completion of “checked” tasks. By way of example, the atomic mutex may include 4 bytes of atomic used for a lock state, 2 bytes used for a completed task count and 122 bytes containing states for up to 488 tasks. The 122 bytes may include two bits per task: 1 for reservation, 1 for the state (e.g., waiting, processing or completed). Notification should be used sparingly. STM tasks can optionally notify a waiting PPU thread using the SPU Threads event queue. The latency for this technique (the time it takes from when the SPU sends the event to when the PPU thread is notified) however, can be significantly longer, e.g., about 100 times longer, than atomic notification.
The task definitions 118, may include pointers to memory locations containing task parameters 120 and SPU task code image 122. The code image 122 may be in the form of one or more executable linkable format (ELF) images of the requisite code. The task parameters 120 may include information related to the task, including, but not limited to input/output (I/O) addresses, I/O sizes, addresses for input and output task data 123 and the like. The STM kernel 112 loads code 124 into the SPU 104 using the code image 122 and parameters 120 the SPU 104 where they are stored as context data 126. The SPU 104 can then run the code 124 to load and process the task data 123. The main memory 106 may include an optional shared output buffer 115 to accommodate SPU programs having varying output data size. When such a task completes, the PPU 102 can retrieve its output data through the STM PPU application programming interface (API).
Many of the features described herein can be implemented through appropriate configuration of the SPU kernel 112. In embodiments of the present invention there is no PPU runtime for the STM kernel 112. In general the STM kernel 112 gets task definitions 118 from the shared task queues 116 in main memory 106. The size of a task queue 116 varies depending on usage. Each time a task is added to a queue, it will execute once without interruption. Multiple task queues 116 can be created and grouped into one or more task sets 114. Each task queue 116 can be assigned a priority. The STM kernel 112 can select higher priority queues for processing before lower priority queues. When processing queues of equal priority, the SPUs will try to work on different queues to reduce contention. If a higher priority queue becomes ready, the next available SPU will begin processing it.
Table I represents one possible task definition, among others. The particular contents of work definitions data structures may vary from that of Table I. For example, the Task parameters are optional. Furthermore, if a task does not require synchronization, barrier tag group information is not required.
When the SPU Kernel 116 needs more tasks, it DMAs a number of Task Definitions from the front of the task queue. The task queues 116 may be circular, and can dynamically grow when tasks are added from the PPU 102 or SPU 104. In a circular queue, tasks are added to the end of the queue and taken from the beginning. The tasks fill up the space available and then “wrap around” to occupy memory space that becomes available as tasks are removed from the end of the queue. The task queue may use an atomic mutex 117 to synchronize access to each queue. By way of example the atomic mutex may be a 128-byte atomic mutex. Pointers and indices for the task queue 116 can be stored in this atomic. The atomic mutex 117 generally includes one or more bits that indicate whether access to the task queue 116 is locked or not. The mutex 117 may also include one or more bytes of data that provide information about what other tasks in the task queue are in progress and/or the location of those tasks. The mutex 117 may also include one or more bytes for a counter that can be incremented or decremented to notify other SPU 104 or the PPU 102 which tasks in the task queue 116 have been taken.
With many independent tasks, the performance of the processor 100 tends to scale linearly with the number of SPUs 104. No change to application data management is necessary when changing the number of allocated SPUs 104. The SPUs 104 automatically load balance by getting more tasks whenever they run out. With multiple task queues 116, contention overhead is reduced.
Once the task queue 116 has been selected the STM kernel 112 reads a task definition 118 from the task queue 116 at step 204. Task definitions may be taken in an order determined by the task queue. The STM skips task definitions that have already been taken by other SPUs. Information in the task definition 118 directs the STM to main memory addresses corresponding to the SPU task parameters 120 and task code image 122. At 206 the SPU loads the SPU task code 124. The SPU 104 can use the parameters 120 and code 124 to load the task data 123 in the SPU local store 110 as input data 126. At 208 the SPU 104 uses the code 124 to process the input data 126 and generate output data 128. At 210, the output data 128 may be stored at an address in the main memory 106 or may be transferred to another SPU 104 for further processing.
The code 124 may include one or more SPU programs. As used herein, an SPU program refers to code that can be used by the SPU to implement one or more SPU tasks. In certain embodiments of the present invention, multiple SPU programs can be cached for use by the SPU 104 in processing the data 123 or for processing data for subsequent tasks. Such caching of programs can be used to optimize DMA use and reduce the number of times that the SPU 104 must access the main memory 106 to load code. SPU Programs may be dynamically loaded into main memory from through a PPU API. SPU Program ELF data may be loaded from memory (as a PPU symbol) 106 or from a file. The SPU task definition 118 can be created with reference SPU programs Loaded in main memory. The SPU programs are loaded into main memory once, at the start of the application. They can then be transferred by DMA to SPU local store 110 as needed by tasks.
In embodiments of the present invention SPU programs may be characterized as being of one of two types, referred to herein as Type-1 and Type-2 respectively. Type-1 SPU programs utilize Position Independent Code (PIC), i.e., code that can execute at different locations in memory. PIC is commonly used for shared libraries, so that the same library code can be mapped to a location in each application (e.g., using a virtual memory system) where it won't overlap the application or other shared libraries programs may be further characterized by static local store usage, i.e., the Type-1 code does not allocate memory for use during runtime. As shown in
Type-1 programs are higher performance use programs, though they tend to have more restrictions. An example of a Type-1 program 324 that can be cached is a MEM COPY program. This program takes advantage of the fact that memory transfers can be handled much faster by DMA using the SPU 104 than by the PPU 102. The MEM COPY takes advantage of this by using an available SPU to transfer data from one location in the main memory 106 to another location. Such SPU-based main memory management is particularly advantageous, e.g., where data needs to be aligned before DMA transfer from the main memory to an SPU or elsewhere. Examples of MEM COPY programs are described in commonly-assigned U.S. patent application Ser. No. ______ entitled “METHOD AND SYSTEM FOR PERFORMING MEMORY COPY FUNCTION ON A CELL PROCESSOR” to Antoine Labour John P. Bates and Richard B. Stenson, which is filed the same day as the present application, the entire disclosures of which have been incorporated herein by reference.
Type-2 programs are characterized by the fact that they may use non-position independent code (non-PIC) and may dynamically allocate local store space at SPU runtime. Typically, only one Type-2 program is loaded on one SPU at a time, although exceptions to this feature are within the scope of embodiments of the present invention. As shown in
SPU programs of type-1 and type-2 have some common features. Specifically, the size of task definitions 118 must be specified. In addition, the maximum local store space required for I/O DMA data must be specified. This enables the kernel 112 to manage the local store context data for tasks. SPU Tasks typically share a context buffer for task definitions 118 and I/O data. Type-1 and/or Type-2 programs may written in any suitable language, e.g., C or C++. Programs may be linked and undefined symbols in SPU Programs that exist in the SPU Kernel can be linked at runtime to the kernel symbols.
SPU Programs can have four customizable callbacks referred to herein as prefetch, start, spumain and finish. The prefetch callback has the syntax prefetch(SpuTaskContext*), where the quantity in parentheses is a pointer to the information about the task, including the main memory address of the task definition and a DMA tag for I/O data transfers 118. The SpuTaskContext is a local pointer to information about the current task. This data is necessary for the SPU Program to perform the task. The STM Kernel 112 prepares this data and delivers it to each callback in the SPU Program. SpuTaskContext contains the address in main memory 106 of this task's Task Definition. The task can use that address to DMA the task definition 118. SpuTaskContext may also contain a temporary Local Store buffer that the SPU Program can use in each of the 4 stages of the task. This callback directs the SPU 104 to start DMA transfer of the task definition 118 from the task queue. The start callback has the syntax: start(SpuTaskContext*). This callback causes the SPU 104 to wait for completion of task definition DMA and to start input DMA of code and/or data as determined by the task definition 118. The spumain callback has the syntax spumain(SpuTaskContext*), where the quantity in parentheses refers to the same data as the previous callback.. This callback causes the SPU 104 to wait for completion of the input DMA, process the input data and start DMA of corresponding output data. The finish callback has the syntax: finish(SpuTaskContext*), where the quantity in parentheses refers to the same data as the previous callback.
Embodiments of the present invention allow efficient management of code and data through a process referred to herein as multi buffering. Multi buffering takes advantage of certain characteristics of the SPU. Specifically, an SPU can perform more than one DMA operation at a time and can perform DMA operations while the SPU program is executing. In multi buffering, the STM Kernel interleaves task callbacks so that DMA operations will be in progress during main execution.
Where different portions of multiple tasks can be running in parallel on the same SPU it is often important to be able to synchronize tasks. Such task synchronization is useful where one task set must be completed before a subsequent task set can begin, e.g., when output data from a first set of tasks is used as input data for the following set. To facilitate such synchronization, a barrier command can be inserted into the task queue to ensure that the former tasks are completed before the following tasks begin.
It is possible for multiple task sets to be processed in parallel. In such a case, it is important for the barrier command to distinguish between tasks that must be synchronized with each other and those that don't. To facilitate this distinction, a barrier command may be characterized by a tag mask that identifies those task sets that need to be synchronized. The barrier command only synchronizes those tasks that are included in the tag mask. For example, a barrier mask of 0×FFFFFFF may affect all tasks, while a barrier mask of 1<<2(0×4) only affects tasks characterized by a tag value of 2.
High performance processing can be achieved with embodiments that take advantage of code and/or data affinity. As used herein, “code affinity ” refers to a situation where an SPU already has loaded in its associated local store the program code associated with a particular task. Where an SPU has code affinity with a particular task, it only has to DMA transfer the requisite data for the task. Similarly, “data affinity ”refers to a situation where an SPU already has loaded in its associated local store the data associated with a particular task. Where an SPU has data affinity with a particular task it need only DMA transfer the requisite code. Since it is more efficient to process a task where SPU Kernels choose tasks that match their current SPU code. This reduces the occurrence of code switching. Please note that it is possible to cache several Type-1 programs in local store associated with an SPU and access them as needed. In such a case, code affinity is less important.
There may be times when no available tasks match the current code. In such a case the SPU can switch the program code. This is the situation illustrated in
In embodiments of the present invention it is often desirable when an SPU 104 has completed processing a task to notify the PPU 102 or other SPUs 104 that a given task has been completed. There are different ways to accomplish this task completion notification. For example, any task or barrier can be assigned an ID that can later be polled for completion from the PPU 102. A barrier with a task ID determines when a task group is complete. SPU tasks can also be configured to send a PPU interrupt upon finishing.
The overhead associated with the STM kernel may be about 650 SPU cycles per task. This includes an averaged cost of retrieving task definitions 118 from the shared task queue 116. Once definitions are retrieved, overhead is typically minimal although it can increase if the application uses many barriers.
The cost of code switch is dependent on the size of code being switched. For example a 200 KB code switch may require about 48,000 cycles, a 100 KB code switch may require about 27,000 cycles, a 50 KB code switch may require about 17,000 cycles and a 1 KB code switch may require about 2,400 cycles.
The overhead of such code switches is also partly dependent on the configuration of the task queue and the number of SPU assigned to the task queue. In general, the worst case Tasks in queue use alternating code. In general, the worst case scenario is one where tasks requiring different code alternate in the task queue. If only one SPU is assigned to the task queue, the overhead may be about 1,840 cycles per task for a 200 KB code, about 1,520 cycles per task for a 100 KB code, about 1,360 cycles per task for a 50 KB code and about 1,200 cycles per task for a 1 KB code. If two SPU are assigned to the same task queue, the code switching overhead is about 820 cycles per task for 200 KB, 100 KB, 50 KB and 1 KB code. It would appear that optimal performance may be achieved where the number of SPUs assigned to a given task queue is equal to the number of different codes in that task queue.
The advantages of embodiments of the present invention can be seen by comparison of task contention overhead for SPURS-based and STM-based handling of comparable task queues as shown, e.g., in
By comparison a STM-based system operated on a task queue containing 4 STM SPU programs using an STM-based code 1006. The task queue was configured according to two different scenarios. In a worst case queue 1008 the four programs alternated such that no two successive tasks used the same code. In a best case queue 1010 tasks requiring the same program were always grouped together. The graph 1004 shows that even for the worst case queue 1008 the STM-based system required less than one third the number of cycles per yield call as the SPURS-based system. For the best case queue 1010 the STM-based system required less than a tenth as many cycles per yield. Furthermore, for both best and worst case queues, the number of cycles per yield call remained relatively constant.
Parallel processor units of the type depicted in
The system 1100 may also include well-known support functions 1110, such as input/output (I/O) elements 1111, power supplies (P/S) 1112, a clock (CLK) 1113 and cache 1114. The system 1100 may optionally include a mass storage device 1115 such as a disk drive, CD-ROM drive, tape drive, or the like to store programs and/or data. The controller may also optionally include a display unit 1116 and user interface unit 1118 to facilitate interaction between the controller 1100 and a user. The display unit 1116 may be in the form of a cathode ray tube (CRT) or flat panel screen that displays text, numerals, graphical symbols or images. The user interface 1118 may include a keyboard, mouse, joystick, light pen or other device. The cell processor module 1101, memory 1102 and other components of the system 1100 may exchange signals (e.g., code instructions and data) with each other via a system bus 1120 as shown in
As used herein, the term I/O generally refers to any program, operation or device that transfers data to or from the system 1100 and to or from a peripheral device. Every transfer is an output from one device and an input into another. Peripheral devices include input-only devices, such as keyboards and mouses, output-only devices, such as printers as well as devices such as a writable CD-ROM that can act as both an input and an output device. The term “peripheral device ” includes external devices, such as a mouse, keyboard, printer, monitor, external Zip drive or scanner as well as internal devices, such as a CD-ROM drive, CD-R drive or internal modem or other peripheral such as a flash memory reader/writer, hard drive.
The processor module 1101 may manage the performance of tasks in the task queues 1106 in response to data and program code instructions of a main program 1103 stored and retrieved by the memory 1102 and executed by the PPU or SPU of the processor module 1101. Code portions of the program 1103 may conform to any one of a number of different programming languages such as Assembly, C++, JAVA or a number of other languages. The processor module 1101 forms a general-purpose computer that becomes a specific purpose computer when executing programs such as the program code 1103. Although the program code 1103 is described herein as being implemented in software and executed upon a general purpose computer, those skilled in the art will realize that the method of SPU task management could alternatively be implemented using hardware such as an application specific integrated circuit (ASIC) or other hardware circuitry. As such, it should be understood that embodiments of the invention can be implemented, in whole or in part, in software, hardware or some combination of both. In one embodiment, among others, the program code 1103 may include a set of processor readable instructions that implement a method having features in common with the method 200 of
Embodiments of the present invention provide a lower overhead of context switches, allow for parallel DMA and task execution and use code affinity to choose new tasks that match current SPU code and reduce DMA usage. These advantages of embodiments of the present invention over the prior art are summarized in Table II.
Embodiments of the present invention provide developers with a high performance, intuitive SPU Programming model. This program model allows many different tasks to be executed efficiently without as much context switch overhead as SPURS and SPU Threads. Embodiments of the present invention provide SPU Task Management methods and systems that can run on a varying number of SPUs without modifying application code. Embodiments of the invention are particularly useful in situations requiring many short tasks many small SPU programs where there is shared data between programs and tasks. SPU code caching is also useful to optimize performance. Examples of situations where SPU task management according to embodiments of the invention may be useful include encoding or decoding of audio in situations requiring many different filter codes that must be swapped in and out of the SPU dynamically. Each filter code works on one or more data blocks from RAM. In some cases these cannot be statically defined with overlays. In such a case, the group of tasks may create a tree. Outputs from tasks lower down in the tree can become inputs for following tasks as described herein.
While the above is a complete description of the preferred embodiment of the present invention, it is possible to use various alternatives, modifications and equivalents. Therefore, the scope of the present invention should be determined not with reference to the above description but should, instead, be determined with reference to the appended claims, along with their full scope of equivalents. Any feature described herein, whether preferred or not, may be combined with any other feature described herein, whether preferred or not. In the claims that follow, the indefinite article “A”, or “An” refers to a quantity of one or more of the item following the article, except where expressly stated otherwise. The appended claims are not to be interpreted as including means-plus-function limitations, unless such a limitation is explicitly recited in a given claim using the phrase “means for.”