Software – Mohammad A. Abdallah, Intel Corp

Abstract for “Cache storing data obtained by address calculation load instruction with label as associated name for the consuming instruction to refer”

A processor architecture contains a register hierarchy to implement virtual registers. This allows for multiple copies of the same architecture register to be used by different processing threads. The register hierarchy also includes a plurality level hierarchy. A plurality of execution units are also included in the processor architecture.

Background for “Cache storing data obtained by address calculation load instruction with label as associated name for the consuming instruction to refer”

“Processors must be able to perform multiple tasks, which can either be dependent on each other or completely independent. These processors’ internal state is usually made up of registers with different values at any given moment of program execution. The architecture state of the processor is the state in which the processor’s internal state image appears at each moment of program execution.

“When code execution switches to run another function (e.g. another thread, process, or program), the state of the machine/processor must be saved in order for the new function to use the internal registers to create its new state. After the function has been terminated, the current context can be deleted and the previous context restored. Execution will resume after the new function has ended. This process is known as a context switch. It usually involves tens to hundreds of cycles, especially in modern architectures that use large numbers of registers (e.g. 64, 128, or 256) and/or out-of-order execution.

“Thread-aware hardware architectures support multiple context states to support a limited number hardware-supported threads. The hardware copies all architecture elements for each thread in this situation. This eliminates the need to switch context when executing new threads. This has its limitations, including the complexity, area and power of duplicated architecture state elements (i.e. registers), for every additional thread that is supported by hardware. The context switch must be used if there are more software threads than hardware threads that can be supported. This is common because parallelism requires a high number of threads and on a fine-granularity basis. Hardware thread-aware architectures that store duplicate context-state data do not support non-threaded code. They only reduce the number of context switches required for threaded software. These threads are often designed for coarse grain parallelism and cause heavy Software overhead when initiating and syncronizing. Fine grain parallelism such as function calls, loops, and auto generation, is left without efficient threading initiations/auto generation. Such described overheads are accompanied with the difficulty of auto parallelization of such codes using sate of the art compiler or user parallelization techniques for non-explicitly/easily parallelized/threaded software codes.”

“A unified architecture to dynamically generate, execute, synchronize and parallelize complex instruction formats includes a virtual registry file, register cache, and register hierarchy. Context switching is made easy by a self-generating, synchronizing dynamic threading architecture.

“One aspect of this invention reduces the context switch penalty. The time required to save an architecture state (e.g. data registers and control registers, program counters etc.) is a context switch penalty. Before execution can be resumed, the state of the prior context must be restored. For something as common and involved as a function call, this means loads/stores must occur in large numbers of registers. An architecture state structure is disclosed to perform a gradual context-switch and an instant context switch on its architecture state. As possible implementation components, are also introduced the concepts of a register cache and a virtual register. A virtual register file and register cache allow for support for a much larger number of threads or contexts than traditional hardware thread support. Multi hierarchal register files support offers greater bandwidth for the register file.

The techniques can be combined with threading support. A portion of the architecture state will be subject to an instant hardware/software supported context switching. All architecture state can be saved quickly/instantly using a LIFO circuit (Last In First Out). A LIFO circuit is simpler and more efficient than register file or random-access memory implementations. Every subsequent context switch in this category is stored on top of the previous context in the LIFO. This is a good technique for recursive function calls and nested parallelized loops, where the context of one call in recursion can be entered and exited naturally in a LIFO fashion. FIG. FIG. 1 shows one example of hardware context-switch assistance.

Another way to quickly save architecture state is to save register file content in one block in memory (or cache) (e.g. one cache line, or block). Register batches are a sequence of register numbers. (e.g. registers 0-7 might be a subset from a larger register set, 0-64). These register numbers can be saved as one cache line of 64 bytes when there is a context switch. Register file is transported directly by wide buses 101 and102. These buses carry all its content in one cycle or in a limited number of cycles. This is in contrast to the typical use of loads and stores to store and restore each register. The buses 101 and 102 can be used to connect to fill buffers, which are often used to fill cache lines. This allows registers to be moved directly to cache. Another alternative is to use write/read-combining buffers. These buffers are used to accumulate partial stores, load data into buffers and then write them to the memory in a single transaction. A side door bus connection can be made through 101 and102 to allow a combining buffer to combine several registers into one buffer. This buffer can then transfer all of the registers in the cache into memory, while the read combining buffer can do the opposite. Another option is to use a specialized local storage device with a LIFO or random access memory buffer. Although the LIFO circuit is simpler, it has fewer implementation constraints and requires data to be saved and restored in a strict order (last in first out). A random access memory, like a direct bus-to-processor cache, has the flexibility to use local memory but also imposes hardware implementation limitations, while allowing for flexible saving and restoring contexts in random order.

“Another aspect of the architecture state can be saved and gradually restored as the new context replaces the old context. This means that the hardware swaps/reads in and out individual registers based on the particular register’s use in the new or older context. This gradual context switch works well for global variables as well as variables that are passed from one function into another using the context switch. If the loops are threaded such that the inner loop is assigned a different thread from the outer loop, it will also work for variables passed through nested Loops. This can be used in cases where multiple iterations of a loop are assigned to different threads. These iterations pass multiple variables from one iteration (loop-carried dependency). It is beneficial to perform a gradual context switching for registers in these cases. The number of registers used between threads (the registers which pass data between the contexts), is large. These registers can be shared, and don’t need to undergo a context change. While the rest can be swapped as necessary.

“The invention could also use virtual registers to permit the containment of larger sets of registers than are directly supported by instruction set architecture. These registers are part the hardware supported context switch and hardware/software threads. These scenarios can also be used to illustrate the concept of a register cache and register hierarchy. Virtual register files are register architectures that allow the instruction set registers to be extended. They can also be accessed using virtual register storage. This allows you to keep multiple copies of the same architecture registry that are part of different context switches and threads. This allows for a greater or lesser number of registers than can be accessed directly from a physical register file. These copies of architecture registers can be seen by the hardware when the hardware generates the threads/contexts. They can also be loaded with software threads if the program is creating them.

“FIG. “FIG. The execution units can access very high bandwidth at lower hierarchy levels (201). Higher hierarchy 200 levels allow for lower bandwidth access. Higher hierarchy 200 levels support lower bandwidth access. Some levels of register hierarchy contain the basic register sets that are supported by instruction set architecture. Other levels contain virtual copies. You can also allow the basic register set for multiple thread support to be duplicated.

It is possible to include a register cache in the hierarchy. This register cache stores registers in and out of the structure according to their relevance to current context and their immediate needs. This virtual register file can be implemented by adding tags to register locations so that registers are accessed by tags that include the actual register numbers as well as other information such context ID numbers, static thread numbers and possibly the memory address to which they should be stored for context switching. There are other techniques that can be used. An index map 202 uses position-based data about each architecture register. As part of the register read access phase of register file 203, the position (or location) of each register copy is determined. Another mechanism is to pass the location of a register from its producer instruction to its consumers instructions as part of a score board mechanism. The consumer instructions must read the register to determine which architecture register they are able to access. However, in location/position-based schemes, tags are not required as the register can be accessed only by its register number and where it is located in the multi-hierarchy registry file.

The mechanism of gradual context switching can be used with similar mechanisms to the current context switch. It involves loading and storing individual registers, but there are some differences. Hardware swapping/reading a register occurs when a new context attempts use it. The local copy of the old context is used. The register from the old context is then swapped with the global copy in 200 belonging to the new context. This is then brought into the local registry file 201. Multiple registers can have the same register number, but they can all be coexisting with different context ID tags. A virtual register file can be created where different registers are present in one implementation of the tag scheme. Virtual register files are interesting because they allow for large virtual register files to be implemented as a hierarchy with multiple register-files of different capacity and bandwidth.

“The following example shows different situations where functions can be used. Consider that multiple function calls in a single program/task can be called at different times.

“Function 1 (values, references)\n\na-d: local variables\nX-Z: global variables\nCode\n\nFunction 2 (values, references)\n{\na-d: local variables\nX-Z: global Variables\nCode\n\nFunction 3 (values, references)\na-d: local variables\nX-Z: global variables\nIf (condition) then call Function 3( )\n\nMain \nFunction 1( )\n….\nFunction 2( )\nFunction 3 ( )\n….\n”

If the context is no longer needed, the local variables must be saved. Other cases in which the entire context must be saved include when an independent process is initiated by the operating system.

“On the contrary, if the registers in the old context are not saved, the registers that are needed to be saved will be gradually replaced with the new context register value. Function 1 and function 2 can coexist in the same local register files and share the same global registers within the global register file. Fine grain threads are an example of this. When a register is needed in a specific context, it’s checked against the local register file to determine its context ID. It can be used if they match. If not, it must be read/brought from a higher storage hierarchy (such a global register file or register cache) and swapped with the one in the local hierarchy storage (e.g. the temporary local register).

“At the call to the Function, new variables that must be written within the function’s scope are assigned new register copies (with an external context/thread ID) These new register name copies may be assigned by either a hardware management unit, HMU, or a compiler/programmer who generates code for this architecture. The HMU will assign a new context/threadID number to each register name within the function scope. This allows the HMU unit to generate copies of register names. This context/thread ID is free for reassignment once the return instruction has been reached. All register name copies can then be reused. These temporary contexts are created and synchronized automatically by the hardware.

If the instruction architecture allows for the extension of register names with context/thread ID numbers, the compiler can manage the register names in various contexts. Instructions that write to a particular register name will also indicate the context/thread from which the instruction is written to or read. Below is an example of a typical instruction that has three sources (S1,S2,S3), and one destination (Dest).

“Dest/S1?ADD (S2, S3)”

“In this example, registers and their association with threads will look like:

“Thy, Rx?ADD (Thz:Ri, Thw:Rk)”

This shows how the instruction architecture permits you to specify the context/thread ID, i.e., where y is your thread ID? Register Rx is used with the register name x. The context/thread number to which the register names belong can be specified by the compiler or programmer. The compiler can increment the context/threadID number when a code is compiled for a function. This counter is then decremented after exiting the function call with a decrement instruction. A free thread selector instruction is another mechanism that the instruction architecture may use. This instruction is used by the compiler/programmer to poll the hardware for a context/thread ID that it can use. It can use another instruction, such as a context/thread-free instruction, when it returns from the function. These 2 methods are illustrated in the following:

“Increment thread ID counternFunction a: (( Th+:Ry? Move Th:Ry?nnFunction BodynnReturn( Th? :Ri ? Move Th:Ri?nDecrement thread ID count”

“The increment is placed before the function call, and the decrement following the return to allow for the passing of values between caller and callee. The increment creates a new context/thread number, and the decrement removes it.

“An alternative method to accomplish the same task uses an instructional that selects a context/threadnum and another instruction that frees that context/threadnum.”

“J =current thread numbernI=Select Free Thread ID numbernFunction B: ( ThI?Ry ) Move ThJ:Ry)\n\nFunction body\n\nReturn ( ThJ:Rx ? Move Th:Rx).nFree thread ID number

“To reduce the space required to encode the context/thread numbers associated with register names, the instruction architecture can specify whether a register is associated either with the parent thread (or with the current child thread); this can be encoded using one bit (which we will call the parent/child bit:?P/C?). bit). Every new thread generated will identify its parent thread (the thread it was spun from). Along with the thread state, the parent ID number will also be saved. You can use a hybrid scheme where one/more sources can be specified from the current parent or child thread using the parent/child bit, and one/more from another thread using an explicit number. This instruction set architecture shows such a hybrid scheme:

“P/C:Rx?ADD (P/C:Ri, P/C:Rj, Thw:Rk)”

“The ?P/C? “The?P/C? bit indicates if the register belongs either to the parent or child thread. This scheme is extensible to allow for more bits to be used to encode active threads in a higher-level context. This scheme is illustrated in the following. The context at the high level defines very few threads in its state. The instruction set can reduce the number of states stored by processor architecture. If the processor hardware supports N threads, then the instruction set can allow abbreviations of threads to enable intercommunication between M threads using the registers in instruction set encodings.

“T1T2:Rx?ADD (T1T2:Ri, T1T2:Rj, Thw:Rk)”

“Mapping active threads in a program area to the overall set supported by the hardware can easily be done using mapping or assignment instructions, as the example below shows.

“Map (T1T2=00), TH=0001101”

“Map (T1T2=01), TH=1001000”

“Map (T1T2=10), TH=1010101”

“Map (T1T2=11), TH=0111100”

“01:Rx?ADD (01:Ri, 10:Rj,1000110:Rk)”

“In the code above, the instructions map the hardware to the abbreviated encoded represented by the T1T2 bits of the mapping instructions. The add instruction encodes just 2 bits abbreviating the threads from each source and destination. However, the third source encodes the source thread explicitly. Implementation decisions regarding the number of bits to encode for the instruction set to abbreviate threads can vary between architectures. It is possible to include a third source, or have its thread explicitly encoded.

This pipeline implements the pipeline for a processor that executes microoperations (microcode implementation) of instructions. It stores these maps and uses them to access the right thread state when it executes an abbreviated instruction. These mapping states allow the compiler to expand the register allocation algorithm to map single thread variables into registers. The compiler can then allocate a greater number of threads to each mapping state and then do the register allocation within each thread. The mapping states can then be reallocated to new threads using live ranges for threads that are similar to register live ranges.

“Map (T1T2=00), TH=0001101”

“Map (T1T2=01), TH=1001000”

“Map (T1T2=10), TH=1010101”

“Map (T1T2=11), TH=0111100”

“01:R5?ADD (01:R2, 10:R3, 1000110:R10)”

“01:R8?MUL (01:R4, 01:R5, 1000110:R12)”

“Map (T1T2=10), TH=1011111”

“01:R5?ADD (01:R2, 10:R3, 1000110:R10)”

“01:R8?MUL (01:R4, 01:R5, 1000110:R12)”

Out of order processors can allow for the renaming of abbreviation map to enable more active region threads be executed simultaneously. In such an out-of-order processor, for example, the code above will be decoded in order to rename the abbreviated bits internal in the actual microoperation encodings to their full thread number. Temporary speculative registers can also be used in such out-of-order machines to preserve the renamed states and registers. FIG. FIG.

The general scheme described allows fine grain threads communicate, cooperate, and synchronize execution on register-level execution without the need for going through memory. It also makes the instruction set encodes extremely compact by providing thread association and minimal instruction encoding space. The compiler must insert instructions to manage threads and set extension bits. These actions can be done by a hardware manager unit, which executes the same instructions as the instructions and manages the setting of the extension bits (as described throughout this invention).

“We use the above mechanisms to handle fine grains threads beyond function calls. We also include threads across the loop boundaries and create threads between code inside a loop. Take the following code as an example:

“Loop (j=1 to 100)nnInstructionAnInstructionBnLoop(i?1to j)nnInstruction1nInstruction2nnInstructionCn”

“Using the compiler (or the hardware management unit), the thread that is the inner loop-body (instructions 1 & 2) and the thread which is the instructions outside of the inner loop can be mapped to hardware supported threads.”

“The following example illustrates one way fine grain threads can live in register file/register cache, and be swapped into higher register file hierarchies and memory hierarchies.

“It is assumed that a compiler will use some convention to save context depending on how many registers are used in the called functions. It can also save multiple registers by using incremental methods. If register batches are used in 8-digit increments, then if a function requires registers between 1 and 8, it will save registers 1 through 8 from the old context. It will also restore registers 1 through 8 when it restores the original context. It will save registers 1 through 16 if it is required to use registers between 1-16. Another way to create a virtual register folder is to alias the registers of different contexts/threads, which require a lesser number of registers than the one provided by the architecture. A small function call, or worker thread, might only require a subset from all 32 registers. This will cause the hardware/compiler to alias the 8 logical register batches on top of another free register batch. The register batch to which the thread is mapped does not have to have the same register numbers. Registers 0-7 could be mapped onto physical registers 8-15, 16-23, or 24-31 that are stored in another context.

“FIG. “FIG. Thread 5 state 402, which is made up of registers 0-7, is aliased onto a free context registry batch 8-15. This allows for more threads than usual, where each thread is mapped onto a 32-register state. FIG. FIG. If the majority of additional threads, other than the main thread, have low set register requirements and can each live with 8 registers small states, we can process 13 threads simultaneously. This 8 register batch aliasing configuration allows 16 simultaneous threads to coexist. As shown in the figure, we don’t need to store a thread tag per register. You can store one thread tag per batch (8 registers for this example), or alternatively, threads can be assigned according to a number-aliasing scheme in which the thread number begins at multiples of 8. This scheme is used to allocate registers. FIG. FIG. 4 shows the numbers of the software threads 0, 1, 3, 4, 5, 6 and 7. 4 shows the software threads numbers 0, 1, 3, 4, 5, 6, 7, and 9. 16 hardware threads can now be mapped to software threads using the aliasing scheme. Software thread 0 is responsible for storing hardware threads 0,1,2,3, because it requires the entire 32 registers large state. It thus uses four batches of 8 register each. Software thread 1 however, has only one state of hardware thread storage. It is therefore mapped onto hardware thread 4. Software thread 5 is mapped onto Hardware thread 5, while software thread 3 is mapped to hardware thread 6, and software thread 6 on hardware thread 12, (shown as point 400).

“In instruction set, microinstruction, encoding, the reference to thread-register pairs can be a continuous encode of bits. The top 4 bits will be hardware thread numbers (indicating which batch 8 registers they are referenced) and the bottom 3 bits will identify individual registers in register batch 0-7.

As an example, suppose that the software must perform the high-level threaded task of adding registers to different threads. The result will be written in register 2 in software thread 1. (403 in FIG. 4. As follows:

“Th1:R2?ADD (Th0:R13, Th3:R2,Th6:R3)”

The compiler or the hardware management units will then map the software threads onto the hardware threads. This mapping ensures that software threads with more than 8 registers do not have to use more than one hardware-thread storage. No other software thread will be able to map onto those reserved hardware threads. This is the actual instruction encoding for the task.

“0100010?” ADD (0001101, 10110010, 1100011) that decodes to the below:

“HWTh4:R2?ADD (HWTh1:R5,HWTh6:R2,Th12:R3)”

“Remember that software thread 0 register 13, was encoded as hardware Thread 1 (401), register 5 (0001 101; first 4 bits are hardware number, last 3 bits are register numbers) because the mapping software thread was aliased to 4 small state hardware threads 0,1,2,3, each with a batch of 8 registers. You can also read the same code as hardware thread 0, register 13 (00 01101). The first 2 bits represent the number of large-state hardware threads (one with 32 register states) and the last 5 bits the register number (among 32 registers).

“Another scheme, which can be implemented on top the previous register cache scheme, is one that provides an automated store load mechanism between the register cache and memory cache. This allows the register cache spill out and reload registers immediately upon context switches or thread context swapping.

The following code illustrates how the hardware described at FIG. 5. This allows seamless multithreading context switching. The register cache is built in a similar way to associative caches that have sets and ways. It has two parts: the tag array and data array. The data array can be accessed in the same way as any threaded register array. A register read occurs upon any instruction that accesses the thread register. The tag array can be accessed by any load or store instruction, also known as saving/restoring the context. The register cache tag is updated with the current stack memory address of a particular register only when the current thread context is saved (saving the context). This is because the register file cache does not contain empty registers. This tag is used later to actually delete the register if there are no free registers to store a new thread. This code illustrates:

The compiler will save to the memory stack registers RO through R7 of thread 0. These registers will be overwritten in the next threads, 1, 4, and 6. This scheme allocates the memory region to register batch (0-7) in order to store registers RO through R7 of thread 0. However, the hardware does not perform the storage of registers of thread 0, to the memory/memory ca when threads 1, 4, and 6 are encountered. The multi-context register cache can store 501 to the additional 8 registers (0-7), that these threads require without destroying registers 0, 0-8, of the original thread (thread 0.). So long as the multicontext register file cache is able to accommodate these registers, it updates the tag array of the register cache with stack memory addresses that were intended by the store instructions. Notice how thread 8 is encountered (#function foo). Because register batch 0-7 is in use, the register cache has to be empty. The register file cache contains register 0-7 from thread 8.

“The context switch mechanism can also be used to facilitate other techniques such as dynamic self-generation and synchronization threads. The implementation of an out of-order?Super Scalar is required for concurrent execution of different parts or programs. processor architecture. Out-of-order execution is complex and requires a lot of hardware investment. These are the typical components required for concurrent execution of an out-of order micro-architecture.

“No architecture has taken advantage of the super-scalar nature out-of-order machine’s ability to dispatch multiple streams of instructions simultaneously to enable multiple software threads to execute in one machine.” Multi-threading simultaneously requires duplicated architecture states for each hardware supported thread, partitioning microarchitecture resources between threads, and sharing memory structures such as caches among those threads.

“Despite being out-of order, these machines do not initiate, self dispatch or synchronize threads with out-of?order hardware. Software does all of these tasks. It initiates/generates the threads and dispatches them when the dispatch condition/dependency has been met (e.g. using barriers). Software synchronizes them when they have common control or dependencies (using locks, for example). Software also determines their concurrency. This software could be either the master thread code, or the compiler that attempts to statically parallelize unthreaded code. These threads are usually implemented with coarse-grain parallelism, leaving fine grain parallelism and requiring threading initiations to be efficient. This is due to the difficulty of automatic paralelization of such codes.

The following discussion shows that automatic parallelization can be achieved with a new paradigm in hardware/software even without the use of complex out-of-order micro-architecture support. The new hardware creates, synchronizes parallelizes and concurrently executes these fine/coarse-grain static and dynamic threads. It does this by dynamically detecting whether the control conditions or dependencies have been resolved, and/or by allowing cross reference of registers among threads as well as disambiguating memory refers across those threads. The hardware monitors either the control condition that triggers the thread or the writeback of cross-referenced register values that the current thread relies on.

“The initiation/generation of the thread dynamically by hardware is triggered when all its dependency registers and/or control conditions, particularly those of cross reference nature (e.g., inter-thread registers) are satisfied. This can be tracked by the explicit list that this thread depends on. These hardware-initiated dynamic threads can be thought or described as a method, subroutine, or function call. A loop instance can also be included in this form. The execution of this function call is dependent on a parameter list. These parameter lists and registers are checked for the write back stage. A flag is created for each parameter/register that has been written back after the updating instruction has been executed. Once all flags for those registers/parameters have been set, the hardware can dynamically dispatch the thread. Cross-referencing registers, virtual registers, and/or memory refers between the threads (both software and hardware) allows for the creation, synchronization, parallelization, and concurrent execution of those threads using natural dependency resolution methods such as that used to create consistent execution using programs using registers to communicate dependencies.

“The lazy gradual context switching mentioned earlier also allows the new state dynamically generated or program static threads to be constructed/swapped to other dynamic/static threads, as illustrated earlier.”

“FIG. “FIG. 6 illustrates one way to resolve false dependence on the same-name registers in dynamic execution of these threads. This allows for sequential execution of register updates. It does this by augmenting each register number with a differentiating field that is tagged at instruction allocation time in the machine. Each register at the front has two counter values that correspond to this bit-file. A counter value 601 (lead counter), is incremented with each new destination to an instruction. Another counter value 602 (lag-counter) is incremented with each commit to a corresponding register. The lead counter cannot be bypassed by the lag counter. After reaching its maximum, the lead counter can return to its initial value; the lag count has the same functionality. These counters can be customized to each thread’s needs, so that each thread has its counter.

As a new destination is assigned to an instruction, the lead counter 601 for a particular register keeps increasing. A register can be reassigned to new instructions 603 as a destination. This bit field is then incremented in the front end of the machine and attached to hardware instructions 605. The incremented register number extended will be used for each instruction that uses this register as a source 606 The register bit field 602 at its front end is incremented whenever that register commits the register to final architecture state. To determine if a particular value of a register needs to be read from speculative order buffer or retired register file, the lag count must be incremented. This is done by comparing the register extended field for the hardware instruction (at instruction register read stage), with the counter from the pointer. At retirement pipeline stage 604, the lag counter 602 is incremented. This mechanism can be applied to individual instructions, not just to functions’ calls or threads.

FIG. 6 shows the lead counter and lag table. 6. can be replaced by an inheritance vector with a field for each architectural register. This vector forwards incremented values from one instruction to the next at allocation time while each instruction increments its value in the bit field that is related to its destination register. FIGS. 8.a/8.b, but with a group of instructions instead of one.

“In this paragraph, a preferred method to unify automatic/dynamic Threading Generation and Synchronization together with the handling context switches discussed earlier is described. In addition, the concept of physically-segmented architecturally-unified register file is disclosed with a register hierarchy that is well suited to such an implementation.”

“The architecture could also include an instruction matrix/block/bucket structure (matrix block and bucket can be used interchangeably) where instructions are part a?Ultra Large Instruct Matrix? referred to as ‘ULIM Architecture in a prior invention by the same inventor. PCT/US2007/066536 is included herein by reference. Instruction matrix/bucket/block can be a collection or instruction that is either totally dependent or completely independent. Some instructions are dependent on others, while others are independent. The instruction matrix/block usually encapsulates instructions within a matrix format. Dependent instructions must occupy one column of the matrix, while independent instructions can occupy rows and columns. You can choose from a few configurations to map the matrix dimensions (rows and columns) to the architecture’s hardware execution units. The Matrix/Block/bucket architectural concept allows for the passing of dependencies at the level of an instruction block/Bucket identification instead of individually sourcing. The architecture effectively removes the complexity associated with CAM (content-addressable Match) matching sources or destinations in large scale concurrent instruction execution architectures. The disclosed inventions can be designed without the use of the ULIM architecture, or even instruction blocks. Instead, a virtual identifier is used that performs a similar role as the bucket/block role. It tracks and resolves dependencies at source group level. This allows it to form a matrix or block by grouping simple instructions. This implementation advantage is superior to current schemes that use individual source dependence tracking and resolution per instruction.

Instructions can be assembled in VLIW, SIMD, or MIMD within an instruction bucket. The buckets can then be dispatched and executed in dynamic threads. The static software threads may also share the same mechanisms as dynamic threads, but they are generated using the software application threads.

“FIG. “FIG. 7 shows a traditional super-scalar architecture of order as described in the related art. Each instruction is renamed using a table called rename (not shown in figure). This maps architecture registers to a pool of physical ones. Then, the instructions can be scheduled, executed, and retired. FIG. 7 shows a typical instruction scheduler 701 with 3 dispatch ports 704. FIG. 7 shows a typical instruction scheduler 701 and 3 dispatch ports 704. You can schedule up to three instructions and have them dispatched in program order. These ports allow you to execute on the 3 execution unit 702. They can then write back the results by using write back ports 705, to the reorder buffer 703. Each cycle. You can then retire up to three instructions from the instruction reorder/retirement 703 so their destinations registers can permanently and non-speculatively update the architecture state.

“In contrast with the out-of-order hardware implementation shown in FIG. 7. The current invention describes a dynamic scheduling architecture out of order that scales better using the concepts and the instruction matrix/bucket, group-level dependency check and instruction group dispatch. 8.a. An instruction matrix buffer and scheduler 801 stores those instruction matrices/blocks/buckets.”

“In this example, instruction bucket 806 has determined that it will satisfy its dependency sources. It is therefore ready to dispatch bucket 808 or 809. These buckets can either execute in one cycle, if the hardware permits it, or they can be executed in execution units 802 in multiple cycles in a pipelined and non-pipelined fashion. Once they are completed, their results are saved to file 803. These registers represent the bucket and are written using the write back ports 805. These bucket destination registers are kept in the bucket retirement buffer, until the bucket as an entire can update the architecture state in its original sequential order.

“One possible implementation of this concept is described with three components:

“1?The front end”

“2?The Execution and Scheduler”

“3?The Back end of retirement”

The front end contains: Speculative Thread Bucket Pointers, Bucket Sources, and destinations lists. Execution buckets and the scheduler include a bucket dispatch selector, virtual register match-and-read, and possibly a register hierarchy. The Back end stores executed buckets and enforces exception ordering before retirement. Register hierarchy/cache can also be used as intermediate storage to store executed bucket results, until they become non-speculative. It can update the architecture state similar in 803 (FIG. 8.a.”

“The following describes a possible implementation of the front-end, dispatch stage and backend where buckets are logged. These are shown in FIG. 8.b”

The process begins with fetching a new thread matrix/bucket/block. Next, the new bucket is assigned to a bucket slot in the bucket buffer. Each thread allocation pointer in the thread array 852 creates a bucket that the thread can physically place its instructions/blocks in. Round-robin fashion, each thread keeps allocating buckets to the bucket buffer array within its respective interval of contiguous spaces. Each thread space assigns a new number 852 to the buckets/blocks within each thread space. This increments each time a bucket/block is assigned. Each bucket that has a valid source is assigned a read bit?Rv. This indicates that the source is required for instructions in this bucket. Each destination register that will be written back using instructions in this bucket must have a valid bit?Wv. It is set in the bucket, and it has a field 853. A new bucket inherits the destination vector of the previously allocated bucket. This is pointed out by the thread bucket allocation hinter 852. The inheritance vector is copied from an earlier allocated bucket, and then it overwrites the valid destination fields which correspond to those registers that will be updated with those bucket instructions. The bucket’s current bucket number will be used to label the valid destinations. Invalid destinations are copied from the bucket’s corresponding inheritance vector. After that, the thread bucket pointer for the new bucket is updated by incrementing it (it wraps around within the interval).

“In the bucket dispatch-and-execute stage, when a bucket is executed with no exception handling, the bucket execution flag (854 containing the bucket number 854) is set. It is broadcasted through the bucket buffer and latched/monitored in each bucket that has a source that contains that bucket number. You can also pass information about virtual register locations along with the bucket number. Once all execution flags for the sources buckets have been set within a bucket then the bucket ready bit 855 has been set. The bucket can now be dispatched and executed. The bucket will execute without exceptions and update the architecture state according to the sequence of the program. After that, it will retire the bucket and increment the retirement thread pointer 857 to the next bucket. You can assign the retired bucket location to a new bucket.

“Threads with closely related threads may co-exist within the Matrix/bucket/blockbuffer; each thread will occupy an area of buckets that is specific to that thread. This thread’s allocation pointer moves within the bucket interval in a round-robin fashion, fetching new instruction buckets then allocating them in the thread interval according to the described round-robin fashion. This interval sectioning allows the bucket buffer to be divided dynamically using different interval lengths of buckets.

“The inheritance vector concept is introduced for both the instruction bucket and the thread. Every instruction matrix/block/bucket writes to particular registers within the architectural registers. Every new bucket updates the inheritance vector by writing the thread number and bucket number into the vector. This vector leaves the fields unupdated for registers it doesn’t write into. The bucket inheritance vector B_iv 856, is forward from one bucket to the other in program order. FIG. FIG. 8.b Each matrix writes its own number to the architecture destination registers if it is instructed to do so by the matrix instructions, otherwise it inherits from the B_iv value of the previous bucket within that thread.”

This concept can be scaled from a bucket buffer that manages a few closely coupled threads to hardware circuits that manage multiple buckets buffers and threads, as illustrated in FIG. 9. Circuits that can handle a larger number of threads with less interaction are called a global front. They process a thread head 902 but do not have to process the actual thread instructions to enforce dependency checks across distant threads. The thread’s header and its sub-headers contain information about the architecture registers those threads and buckets write into (destination registers for those instructions). There is no need to include actual instructions and sources in those headers. It is sufficient to list the destination registers and a bit vector indicating which register is destination for an instruction. The header doesn’t have to be physically placed in the instructions. It can be any formatted packet, compact representation, or the destination registers within the instructions. These may or may not be stored together with the rest.

“This global front end fetches only headers from threads/blocks according to program order. It generates dynamic bucket and/or thread inheritance vectors 901(Tiv/or Biv). These inheritance vectors are forwarded each time a thread is allocated by keeping the fields that the current thread bucket won’t write to or update. These inheritance vectors are then distributed to large numbers of engines/cores/processors 904, each of which may include a local back-end and fetch unit (which will fetch the actual instructions and produce the dependency vectors for each bucket), and a local matrix/block/bucket with local register files 905. Local front-ends retrieve the actual instructions from the global front-end and then use the information from those inheritance vectors to fill in the dependency information for instructions brought into the engines for execution. FIG. FIG. 9 shows a global front end implementation. It distributes the inheritance vectors to different engines 904 by using concise information about the instructions. (This is only the registers where those instructions are written into). Information about changes in the control path between or within the threads is also useful information to include in the header. To predict the flow and control between those threads, a global branch predictor is possible. These headers may include branching destinations as well as offsets. The hardware/compiler may decide to send independent threads along the 2 control paths for a branch. It will then merge those two paths with the inheritance vector, as shown in FIG. 11. FIG. FIG. 9 shows how a global front fetches a header for a new thread. Thread 2 (906) will update the 901 inheritance vector. This results in vector 910, where registers 1,2,3,4,6,0, and 7 are updated using T2 labels. Notably, register 5 in vector 910 was not written by the T2 buckets. Its label was inherited from an inheritance vector.

“One observation that is interesting is that register files allow cross communication between the cores/engines. To reduce access latency, a request can be made to the registers that are required from cross engines as soon as the thread’s instruction buckets are fetched. At that point the source dependency information can be populated so that cross engine threads reference can be issued. This is likely to be long before actual instructions are sent for execution. The instruction will not be sent until the cross-referenced source has been forwarded. The cross referenced source can also be stored in the register cache or local multi-threaded file. This cross-referenced source can be stored in the same buffer as the load buffer buffer. It can also reuse the load buffer buffer physical storage and dependencies check mechanisms, but it can be used instead of a memory load to store the register load. You can use many topologies to connect register files across engines/cores. These could include a ring topology, cross-bar topology, or mesh routed interconnect.

The following discussion will show you how register file segmentation can work inside and between engines. The bucket’s sources are sent simultaneously to the register file as well as the register cache when it is dispatched. If the register file has direct threading support, the operand can be read from the appropriate thread register section. If the register file is virtual, and includes a physically segmented file that uses tags as well, then a tag match must be performed as part of the virtual register reading. If the tag matches, the read takes place from the segmented registry file.

“FIG. 9.b displays the overall virtual register files and register cache. It also shows the segmentation and low-level register files. A SIMD/MIMD instruction accesses each SIMD/MIIVID register segment, while a VLIW instructs accesses each register separately. Each section has its own instructions, which can access registers separately and cross-segment registers in threading mode. FIGS. 1-5 show more ways that threads can use register file hierarchies. 1-5”

A bit or flag is used to lock the 4 buckets together, creating a super bucket. This allows for the possibility of maintaining the instruction scheduling as it was composed by compiler auto parallelization or SMID/MIMD composition. The flag or bit that is set will cause all 4 buckets in the super bucket to execute simultaneously within the same cycle. Flags that aren’t set will cause those buckets to not be locked together. They can execute independently at different times.

The limited bandwidth register file is under pressure due to the increased parallelism caused by threading, out-of-order execution and VLIW architectures. Register files are usually designed to be a single resource that allows access to all registers. Segmented register files have been designed before, but they have required handling of cross reads/writes at the architecture/software level, which prevents them from being utilized as a unified set of resources and adds overhead to the cross reads/writes.”

“Disclosed is register architecture which supports software threads and hardware generated threads. SIMD & MIMD execution. Also, emulations of out-of order super-scalar execution. It is physically separated, but it appears to be a single architecture resource. This segmented register is part the virtual register file that might include a register hierarchy, a register cache, and mechanisms to store and verify register tags. If we use a location-based scheme that uses the dependency inheritance vector, tag access can be eliminated. This scheme works in a way that all sources of instructions broadcast the executed bucket number during the dispatch stage. They then perform a content addressable match (CAM) that compares their sources buckets to determine the ready flag for each source. This allows for the registration number to be included along with the physical location where the bucket was executed. FIG. In FIG. 9.c, there are four register file segments. Each segment contains 16 registers. When a bucket #x is dispatched to section 2, the bucket number x is broadcast to the bucket buffer. The segment #2 is also broadcast with it so that any sources that depend on bucket x can see that it has written all its registers within segment 2. They will know that when they dispatch the instructions they must read the register from segment 2. To avoid tags, this applies to the register cache. This concept can be extended to the global front-end, where the inheritance vector can also specify which engine the instruction bucket writing into this register was allocated.

The following describes an implementation of a unified, dynamic execution architecture that can issue SIMD/MIMD instructions, VLIW bucket instructions, ULIB bucket instruction, as well as dynamic and static threading. This architecture allows for the imitation of a super-scalar outof-order implementation without explicitly supporting out-of order components. The invention could also include a physically segmented and architecturally unified register file as well as a register cache hierarchy.

Summary for “Cache storing data obtained by address calculation load instruction with label as associated name for the consuming instruction to refer”

“Processors must be able to perform multiple tasks, which can either be dependent on each other or completely independent. These processors’ internal state is usually made up of registers with different values at any given moment of program execution. The architecture state of the processor is the state in which the processor’s internal state image appears at each moment of program execution.

“When code execution switches to run another function (e.g. another thread, process, or program), the state of the machine/processor must be saved in order for the new function to use the internal registers to create its new state. After the function has been terminated, the current context can be deleted and the previous context restored. Execution will resume after the new function has ended. This process is known as a context switch. It usually involves tens to hundreds of cycles, especially in modern architectures that use large numbers of registers (e.g. 64, 128, or 256) and/or out-of-order execution.

“Thread-aware hardware architectures support multiple context states to support a limited number hardware-supported threads. The hardware copies all architecture elements for each thread in this situation. This eliminates the need to switch context when executing new threads. This has its limitations, including the complexity, area and power of duplicated architecture state elements (i.e. registers), for every additional thread that is supported by hardware. The context switch must be used if there are more software threads than hardware threads that can be supported. This is common because parallelism requires a high number of threads and on a fine-granularity basis. Hardware thread-aware architectures that store duplicate context-state data do not support non-threaded code. They only reduce the number of context switches required for threaded software. These threads are often designed for coarse grain parallelism and cause heavy Software overhead when initiating and syncronizing. Fine grain parallelism such as function calls, loops, and auto generation, is left without efficient threading initiations/auto generation. Such described overheads are accompanied with the difficulty of auto parallelization of such codes using sate of the art compiler or user parallelization techniques for non-explicitly/easily parallelized/threaded software codes.”

“A unified architecture to dynamically generate, execute, synchronize and parallelize complex instruction formats includes a virtual registry file, register cache, and register hierarchy. Context switching is made easy by a self-generating, synchronizing dynamic threading architecture.

“One aspect of this invention reduces the context switch penalty. The time required to save an architecture state (e.g. data registers and control registers, program counters etc.) is a context switch penalty. Before execution can be resumed, the state of the prior context must be restored. For something as common and involved as a function call, this means loads/stores must occur in large numbers of registers. An architecture state structure is disclosed to perform a gradual context-switch and an instant context switch on its architecture state. As possible implementation components, are also introduced the concepts of a register cache and a virtual register. A virtual register file and register cache allow for support for a much larger number of threads or contexts than traditional hardware thread support. Multi hierarchal register files support offers greater bandwidth for the register file.

The techniques can be combined with threading support. A portion of the architecture state will be subject to an instant hardware/software supported context switching. All architecture state can be saved quickly/instantly using a LIFO circuit (Last In First Out). A LIFO circuit is simpler and more efficient than register file or random-access memory implementations. Every subsequent context switch in this category is stored on top of the previous context in the LIFO. This is a good technique for recursive function calls and nested parallelized loops, where the context of one call in recursion can be entered and exited naturally in a LIFO fashion. FIG. FIG. 1 shows one example of hardware context-switch assistance.

Another way to quickly save architecture state is to save register file content in one block in memory (or cache) (e.g. one cache line, or block). Register batches are a sequence of register numbers. (e.g. registers 0-7 might be a subset from a larger register set, 0-64). These register numbers can be saved as one cache line of 64 bytes when there is a context switch. Register file is transported directly by wide buses 101 and102. These buses carry all its content in one cycle or in a limited number of cycles. This is in contrast to the typical use of loads and stores to store and restore each register. The buses 101 and 102 can be used to connect to fill buffers, which are often used to fill cache lines. This allows registers to be moved directly to cache. Another alternative is to use write/read-combining buffers. These buffers are used to accumulate partial stores, load data into buffers and then write them to the memory in a single transaction. A side door bus connection can be made through 101 and102 to allow a combining buffer to combine several registers into one buffer. This buffer can then transfer all of the registers in the cache into memory, while the read combining buffer can do the opposite. Another option is to use a specialized local storage device with a LIFO or random access memory buffer. Although the LIFO circuit is simpler, it has fewer implementation constraints and requires data to be saved and restored in a strict order (last in first out). A random access memory, like a direct bus-to-processor cache, has the flexibility to use local memory but also imposes hardware implementation limitations, while allowing for flexible saving and restoring contexts in random order.

“Another aspect of the architecture state can be saved and gradually restored as the new context replaces the old context. This means that the hardware swaps/reads in and out individual registers based on the particular register’s use in the new or older context. This gradual context switch works well for global variables as well as variables that are passed from one function into another using the context switch. If the loops are threaded such that the inner loop is assigned a different thread from the outer loop, it will also work for variables passed through nested Loops. This can be used in cases where multiple iterations of a loop are assigned to different threads. These iterations pass multiple variables from one iteration (loop-carried dependency). It is beneficial to perform a gradual context switching for registers in these cases. The number of registers used between threads (the registers which pass data between the contexts), is large. These registers can be shared, and don’t need to undergo a context change. While the rest can be swapped as necessary.

“The invention could also use virtual registers to permit the containment of larger sets of registers than are directly supported by instruction set architecture. These registers are part the hardware supported context switch and hardware/software threads. These scenarios can also be used to illustrate the concept of a register cache and register hierarchy. Virtual register files are register architectures that allow the instruction set registers to be extended. They can also be accessed using virtual register storage. This allows you to keep multiple copies of the same architecture registry that are part of different context switches and threads. This allows for a greater or lesser number of registers than can be accessed directly from a physical register file. These copies of architecture registers can be seen by the hardware when the hardware generates the threads/contexts. They can also be loaded with software threads if the program is creating them.

“FIG. “FIG. The execution units can access very high bandwidth at lower hierarchy levels (201). Higher hierarchy 200 levels allow for lower bandwidth access. Higher hierarchy 200 levels support lower bandwidth access. Some levels of register hierarchy contain the basic register sets that are supported by instruction set architecture. Other levels contain virtual copies. You can also allow the basic register set for multiple thread support to be duplicated.

It is possible to include a register cache in the hierarchy. This register cache stores registers in and out of the structure according to their relevance to current context and their immediate needs. This virtual register file can be implemented by adding tags to register locations so that registers are accessed by tags that include the actual register numbers as well as other information such context ID numbers, static thread numbers and possibly the memory address to which they should be stored for context switching. There are other techniques that can be used. An index map 202 uses position-based data about each architecture register. As part of the register read access phase of register file 203, the position (or location) of each register copy is determined. Another mechanism is to pass the location of a register from its producer instruction to its consumers instructions as part of a score board mechanism. The consumer instructions must read the register to determine which architecture register they are able to access. However, in location/position-based schemes, tags are not required as the register can be accessed only by its register number and where it is located in the multi-hierarchy registry file.

The mechanism of gradual context switching can be used with similar mechanisms to the current context switch. It involves loading and storing individual registers, but there are some differences. Hardware swapping/reading a register occurs when a new context attempts use it. The local copy of the old context is used. The register from the old context is then swapped with the global copy in 200 belonging to the new context. This is then brought into the local registry file 201. Multiple registers can have the same register number, but they can all be coexisting with different context ID tags. A virtual register file can be created where different registers are present in one implementation of the tag scheme. Virtual register files are interesting because they allow for large virtual register files to be implemented as a hierarchy with multiple register-files of different capacity and bandwidth.

“The following example shows different situations where functions can be used. Consider that multiple function calls in a single program/task can be called at different times.

“Function 1 (values, references)\n\na-d: local variables\nX-Z: global variables\nCode\n\nFunction 2 (values, references)\n{\na-d: local variables\nX-Z: global Variables\nCode\n\nFunction 3 (values, references)\na-d: local variables\nX-Z: global variables\nIf (condition) then call Function 3( )\n\nMain \nFunction 1( )\n….\nFunction 2( )\nFunction 3 ( )\n….\n”

If the context is no longer needed, the local variables must be saved. Other cases in which the entire context must be saved include when an independent process is initiated by the operating system.

“On the contrary, if the registers in the old context are not saved, the registers that are needed to be saved will be gradually replaced with the new context register value. Function 1 and function 2 can coexist in the same local register files and share the same global registers within the global register file. Fine grain threads are an example of this. When a register is needed in a specific context, it’s checked against the local register file to determine its context ID. It can be used if they match. If not, it must be read/brought from a higher storage hierarchy (such a global register file or register cache) and swapped with the one in the local hierarchy storage (e.g. the temporary local register).

“At the call to the Function, new variables that must be written within the function’s scope are assigned new register copies (with an external context/thread ID) These new register name copies may be assigned by either a hardware management unit, HMU, or a compiler/programmer who generates code for this architecture. The HMU will assign a new context/threadID number to each register name within the function scope. This allows the HMU unit to generate copies of register names. This context/thread ID is free for reassignment once the return instruction has been reached. All register name copies can then be reused. These temporary contexts are created and synchronized automatically by the hardware.

If the instruction architecture allows for the extension of register names with context/thread ID numbers, the compiler can manage the register names in various contexts. Instructions that write to a particular register name will also indicate the context/thread from which the instruction is written to or read. Below is an example of a typical instruction that has three sources (S1,S2,S3), and one destination (Dest).

“Dest/S1?ADD (S2, S3)”

“In this example, registers and their association with threads will look like:

“Thy, Rx?ADD (Thz:Ri, Thw:Rk)”

This shows how the instruction architecture permits you to specify the context/thread ID, i.e., where y is your thread ID? Register Rx is used with the register name x. The context/thread number to which the register names belong can be specified by the compiler or programmer. The compiler can increment the context/threadID number when a code is compiled for a function. This counter is then decremented after exiting the function call with a decrement instruction. A free thread selector instruction is another mechanism that the instruction architecture may use. This instruction is used by the compiler/programmer to poll the hardware for a context/thread ID that it can use. It can use another instruction, such as a context/thread-free instruction, when it returns from the function. These 2 methods are illustrated in the following:

“Increment thread ID counternFunction a: (( Th+:Ry? Move Th:Ry?nnFunction BodynnReturn( Th? :Ri ? Move Th:Ri?nDecrement thread ID count”

“The increment is placed before the function call, and the decrement following the return to allow for the passing of values between caller and callee. The increment creates a new context/thread number, and the decrement removes it.

“An alternative method to accomplish the same task uses an instructional that selects a context/threadnum and another instruction that frees that context/threadnum.”

“J =current thread numbernI=Select Free Thread ID numbernFunction B: ( ThI?Ry ) Move ThJ:Ry)\n\nFunction body\n\nReturn ( ThJ:Rx ? Move Th:Rx).nFree thread ID number

“To reduce the space required to encode the context/thread numbers associated with register names, the instruction architecture can specify whether a register is associated either with the parent thread (or with the current child thread); this can be encoded using one bit (which we will call the parent/child bit:?P/C?). bit). Every new thread generated will identify its parent thread (the thread it was spun from). Along with the thread state, the parent ID number will also be saved. You can use a hybrid scheme where one/more sources can be specified from the current parent or child thread using the parent/child bit, and one/more from another thread using an explicit number. This instruction set architecture shows such a hybrid scheme:

“P/C:Rx?ADD (P/C:Ri, P/C:Rj, Thw:Rk)”

“The ?P/C? “The?P/C? bit indicates if the register belongs either to the parent or child thread. This scheme is extensible to allow for more bits to be used to encode active threads in a higher-level context. This scheme is illustrated in the following. The context at the high level defines very few threads in its state. The instruction set can reduce the number of states stored by processor architecture. If the processor hardware supports N threads, then the instruction set can allow abbreviations of threads to enable intercommunication between M threads using the registers in instruction set encodings.

“T1T2:Rx?ADD (T1T2:Ri, T1T2:Rj, Thw:Rk)”

“Mapping active threads in a program area to the overall set supported by the hardware can easily be done using mapping or assignment instructions, as the example below shows.

“Map (T1T2=00), TH=0001101”

“Map (T1T2=01), TH=1001000”

“Map (T1T2=10), TH=1010101”

“Map (T1T2=11), TH=0111100”

“01:Rx?ADD (01:Ri, 10:Rj,1000110:Rk)”

“In the code above, the instructions map the hardware to the abbreviated encoded represented by the T1T2 bits of the mapping instructions. The add instruction encodes just 2 bits abbreviating the threads from each source and destination. However, the third source encodes the source thread explicitly. Implementation decisions regarding the number of bits to encode for the instruction set to abbreviate threads can vary between architectures. It is possible to include a third source, or have its thread explicitly encoded.

This pipeline implements the pipeline for a processor that executes microoperations (microcode implementation) of instructions. It stores these maps and uses them to access the right thread state when it executes an abbreviated instruction. These mapping states allow the compiler to expand the register allocation algorithm to map single thread variables into registers. The compiler can then allocate a greater number of threads to each mapping state and then do the register allocation within each thread. The mapping states can then be reallocated to new threads using live ranges for threads that are similar to register live ranges.

“Map (T1T2=00), TH=0001101”

“Map (T1T2=01), TH=1001000”

“Map (T1T2=10), TH=1010101”

“Map (T1T2=11), TH=0111100”

“01:R5?ADD (01:R2, 10:R3, 1000110:R10)”

“01:R8?MUL (01:R4, 01:R5, 1000110:R12)”

“Map (T1T2=10), TH=1011111”

“01:R5?ADD (01:R2, 10:R3, 1000110:R10)”

“01:R8?MUL (01:R4, 01:R5, 1000110:R12)”

Out of order processors can allow for the renaming of abbreviation map to enable more active region threads be executed simultaneously. In such an out-of-order processor, for example, the code above will be decoded in order to rename the abbreviated bits internal in the actual microoperation encodings to their full thread number. Temporary speculative registers can also be used in such out-of-order machines to preserve the renamed states and registers. FIG. FIG.

The general scheme described allows fine grain threads communicate, cooperate, and synchronize execution on register-level execution without the need for going through memory. It also makes the instruction set encodes extremely compact by providing thread association and minimal instruction encoding space. The compiler must insert instructions to manage threads and set extension bits. These actions can be done by a hardware manager unit, which executes the same instructions as the instructions and manages the setting of the extension bits (as described throughout this invention).

“We use the above mechanisms to handle fine grains threads beyond function calls. We also include threads across the loop boundaries and create threads between code inside a loop. Take the following code as an example:

“Loop (j=1 to 100)nnInstructionAnInstructionBnLoop(i?1to j)nnInstruction1nInstruction2nnInstructionCn”

“Using the compiler (or the hardware management unit), the thread that is the inner loop-body (instructions 1 & 2) and the thread which is the instructions outside of the inner loop can be mapped to hardware supported threads.”

“The following example illustrates one way fine grain threads can live in register file/register cache, and be swapped into higher register file hierarchies and memory hierarchies.

“It is assumed that a compiler will use some convention to save context depending on how many registers are used in the called functions. It can also save multiple registers by using incremental methods. If register batches are used in 8-digit increments, then if a function requires registers between 1 and 8, it will save registers 1 through 8 from the old context. It will also restore registers 1 through 8 when it restores the original context. It will save registers 1 through 16 if it is required to use registers between 1-16. Another way to create a virtual register folder is to alias the registers of different contexts/threads, which require a lesser number of registers than the one provided by the architecture. A small function call, or worker thread, might only require a subset from all 32 registers. This will cause the hardware/compiler to alias the 8 logical register batches on top of another free register batch. The register batch to which the thread is mapped does not have to have the same register numbers. Registers 0-7 could be mapped onto physical registers 8-15, 16-23, or 24-31 that are stored in another context.

“FIG. “FIG. Thread 5 state 402, which is made up of registers 0-7, is aliased onto a free context registry batch 8-15. This allows for more threads than usual, where each thread is mapped onto a 32-register state. FIG. FIG. If the majority of additional threads, other than the main thread, have low set register requirements and can each live with 8 registers small states, we can process 13 threads simultaneously. This 8 register batch aliasing configuration allows 16 simultaneous threads to coexist. As shown in the figure, we don’t need to store a thread tag per register. You can store one thread tag per batch (8 registers for this example), or alternatively, threads can be assigned according to a number-aliasing scheme in which the thread number begins at multiples of 8. This scheme is used to allocate registers. FIG. FIG. 4 shows the numbers of the software threads 0, 1, 3, 4, 5, 6 and 7. 4 shows the software threads numbers 0, 1, 3, 4, 5, 6, 7, and 9. 16 hardware threads can now be mapped to software threads using the aliasing scheme. Software thread 0 is responsible for storing hardware threads 0,1,2,3, because it requires the entire 32 registers large state. It thus uses four batches of 8 register each. Software thread 1 however, has only one state of hardware thread storage. It is therefore mapped onto hardware thread 4. Software thread 5 is mapped onto Hardware thread 5, while software thread 3 is mapped to hardware thread 6, and software thread 6 on hardware thread 12, (shown as point 400).

“In instruction set, microinstruction, encoding, the reference to thread-register pairs can be a continuous encode of bits. The top 4 bits will be hardware thread numbers (indicating which batch 8 registers they are referenced) and the bottom 3 bits will identify individual registers in register batch 0-7.

As an example, suppose that the software must perform the high-level threaded task of adding registers to different threads. The result will be written in register 2 in software thread 1. (403 in FIG. 4. As follows:

“Th1:R2?ADD (Th0:R13, Th3:R2,Th6:R3)”

The compiler or the hardware management units will then map the software threads onto the hardware threads. This mapping ensures that software threads with more than 8 registers do not have to use more than one hardware-thread storage. No other software thread will be able to map onto those reserved hardware threads. This is the actual instruction encoding for the task.

“0100010?” ADD (0001101, 10110010, 1100011) that decodes to the below:

“HWTh4:R2?ADD (HWTh1:R5,HWTh6:R2,Th12:R3)”

“Remember that software thread 0 register 13, was encoded as hardware Thread 1 (401), register 5 (0001 101; first 4 bits are hardware number, last 3 bits are register numbers) because the mapping software thread was aliased to 4 small state hardware threads 0,1,2,3, each with a batch of 8 registers. You can also read the same code as hardware thread 0, register 13 (00 01101). The first 2 bits represent the number of large-state hardware threads (one with 32 register states) and the last 5 bits the register number (among 32 registers).

“Another scheme, which can be implemented on top the previous register cache scheme, is one that provides an automated store load mechanism between the register cache and memory cache. This allows the register cache spill out and reload registers immediately upon context switches or thread context swapping.

The following code illustrates how the hardware described at FIG. 5. This allows seamless multithreading context switching. The register cache is built in a similar way to associative caches that have sets and ways. It has two parts: the tag array and data array. The data array can be accessed in the same way as any threaded register array. A register read occurs upon any instruction that accesses the thread register. The tag array can be accessed by any load or store instruction, also known as saving/restoring the context. The register cache tag is updated with the current stack memory address of a particular register only when the current thread context is saved (saving the context). This is because the register file cache does not contain empty registers. This tag is used later to actually delete the register if there are no free registers to store a new thread. This code illustrates:

The compiler will save to the memory stack registers RO through R7 of thread 0. These registers will be overwritten in the next threads, 1, 4, and 6. This scheme allocates the memory region to register batch (0-7) in order to store registers RO through R7 of thread 0. However, the hardware does not perform the storage of registers of thread 0, to the memory/memory ca when threads 1, 4, and 6 are encountered. The multi-context register cache can store 501 to the additional 8 registers (0-7), that these threads require without destroying registers 0, 0-8, of the original thread (thread 0.). So long as the multicontext register file cache is able to accommodate these registers, it updates the tag array of the register cache with stack memory addresses that were intended by the store instructions. Notice how thread 8 is encountered (#function foo). Because register batch 0-7 is in use, the register cache has to be empty. The register file cache contains register 0-7 from thread 8.

“The context switch mechanism can also be used to facilitate other techniques such as dynamic self-generation and synchronization threads. The implementation of an out of-order?Super Scalar is required for concurrent execution of different parts or programs. processor architecture. Out-of-order execution is complex and requires a lot of hardware investment. These are the typical components required for concurrent execution of an out-of order micro-architecture.

“No architecture has taken advantage of the super-scalar nature out-of-order machine’s ability to dispatch multiple streams of instructions simultaneously to enable multiple software threads to execute in one machine.” Multi-threading simultaneously requires duplicated architecture states for each hardware supported thread, partitioning microarchitecture resources between threads, and sharing memory structures such as caches among those threads.

“Despite being out-of order, these machines do not initiate, self dispatch or synchronize threads with out-of?order hardware. Software does all of these tasks. It initiates/generates the threads and dispatches them when the dispatch condition/dependency has been met (e.g. using barriers). Software synchronizes them when they have common control or dependencies (using locks, for example). Software also determines their concurrency. This software could be either the master thread code, or the compiler that attempts to statically parallelize unthreaded code. These threads are usually implemented with coarse-grain parallelism, leaving fine grain parallelism and requiring threading initiations to be efficient. This is due to the difficulty of automatic paralelization of such codes.

The following discussion shows that automatic parallelization can be achieved with a new paradigm in hardware/software even without the use of complex out-of-order micro-architecture support. The new hardware creates, synchronizes parallelizes and concurrently executes these fine/coarse-grain static and dynamic threads. It does this by dynamically detecting whether the control conditions or dependencies have been resolved, and/or by allowing cross reference of registers among threads as well as disambiguating memory refers across those threads. The hardware monitors either the control condition that triggers the thread or the writeback of cross-referenced register values that the current thread relies on.

“The initiation/generation of the thread dynamically by hardware is triggered when all its dependency registers and/or control conditions, particularly those of cross reference nature (e.g., inter-thread registers) are satisfied. This can be tracked by the explicit list that this thread depends on. These hardware-initiated dynamic threads can be thought or described as a method, subroutine, or function call. A loop instance can also be included in this form. The execution of this function call is dependent on a parameter list. These parameter lists and registers are checked for the write back stage. A flag is created for each parameter/register that has been written back after the updating instruction has been executed. Once all flags for those registers/parameters have been set, the hardware can dynamically dispatch the thread. Cross-referencing registers, virtual registers, and/or memory refers between the threads (both software and hardware) allows for the creation, synchronization, parallelization, and concurrent execution of those threads using natural dependency resolution methods such as that used to create consistent execution using programs using registers to communicate dependencies.

“The lazy gradual context switching mentioned earlier also allows the new state dynamically generated or program static threads to be constructed/swapped to other dynamic/static threads, as illustrated earlier.”

“FIG. “FIG. 6 illustrates one way to resolve false dependence on the same-name registers in dynamic execution of these threads. This allows for sequential execution of register updates. It does this by augmenting each register number with a differentiating field that is tagged at instruction allocation time in the machine. Each register at the front has two counter values that correspond to this bit-file. A counter value 601 (lead counter), is incremented with each new destination to an instruction. Another counter value 602 (lag-counter) is incremented with each commit to a corresponding register. The lead counter cannot be bypassed by the lag counter. After reaching its maximum, the lead counter can return to its initial value; the lag count has the same functionality. These counters can be customized to each thread’s needs, so that each thread has its counter.

As a new destination is assigned to an instruction, the lead counter 601 for a particular register keeps increasing. A register can be reassigned to new instructions 603 as a destination. This bit field is then incremented in the front end of the machine and attached to hardware instructions 605. The incremented register number extended will be used for each instruction that uses this register as a source 606 The register bit field 602 at its front end is incremented whenever that register commits the register to final architecture state. To determine if a particular value of a register needs to be read from speculative order buffer or retired register file, the lag count must be incremented. This is done by comparing the register extended field for the hardware instruction (at instruction register read stage), with the counter from the pointer. At retirement pipeline stage 604, the lag counter 602 is incremented. This mechanism can be applied to individual instructions, not just to functions’ calls or threads.

FIG. 6 shows the lead counter and lag table. 6. can be replaced by an inheritance vector with a field for each architectural register. This vector forwards incremented values from one instruction to the next at allocation time while each instruction increments its value in the bit field that is related to its destination register. FIGS. 8.a/8.b, but with a group of instructions instead of one.

“In this paragraph, a preferred method to unify automatic/dynamic Threading Generation and Synchronization together with the handling context switches discussed earlier is described. In addition, the concept of physically-segmented architecturally-unified register file is disclosed with a register hierarchy that is well suited to such an implementation.”

“The architecture could also include an instruction matrix/block/bucket structure (matrix block and bucket can be used interchangeably) where instructions are part a?Ultra Large Instruct Matrix? referred to as ‘ULIM Architecture in a prior invention by the same inventor. PCT/US2007/066536 is included herein by reference. Instruction matrix/bucket/block can be a collection or instruction that is either totally dependent or completely independent. Some instructions are dependent on others, while others are independent. The instruction matrix/block usually encapsulates instructions within a matrix format. Dependent instructions must occupy one column of the matrix, while independent instructions can occupy rows and columns. You can choose from a few configurations to map the matrix dimensions (rows and columns) to the architecture’s hardware execution units. The Matrix/Block/bucket architectural concept allows for the passing of dependencies at the level of an instruction block/Bucket identification instead of individually sourcing. The architecture effectively removes the complexity associated with CAM (content-addressable Match) matching sources or destinations in large scale concurrent instruction execution architectures. The disclosed inventions can be designed without the use of the ULIM architecture, or even instruction blocks. Instead, a virtual identifier is used that performs a similar role as the bucket/block role. It tracks and resolves dependencies at source group level. This allows it to form a matrix or block by grouping simple instructions. This implementation advantage is superior to current schemes that use individual source dependence tracking and resolution per instruction.

Instructions can be assembled in VLIW, SIMD, or MIMD within an instruction bucket. The buckets can then be dispatched and executed in dynamic threads. The static software threads may also share the same mechanisms as dynamic threads, but they are generated using the software application threads.

“FIG. “FIG. 7 shows a traditional super-scalar architecture of order as described in the related art. Each instruction is renamed using a table called rename (not shown in figure). This maps architecture registers to a pool of physical ones. Then, the instructions can be scheduled, executed, and retired. FIG. 7 shows a typical instruction scheduler 701 with 3 dispatch ports 704. FIG. 7 shows a typical instruction scheduler 701 and 3 dispatch ports 704. You can schedule up to three instructions and have them dispatched in program order. These ports allow you to execute on the 3 execution unit 702. They can then write back the results by using write back ports 705, to the reorder buffer 703. Each cycle. You can then retire up to three instructions from the instruction reorder/retirement 703 so their destinations registers can permanently and non-speculatively update the architecture state.

“In contrast with the out-of-order hardware implementation shown in FIG. 7. The current invention describes a dynamic scheduling architecture out of order that scales better using the concepts and the instruction matrix/bucket, group-level dependency check and instruction group dispatch. 8.a. An instruction matrix buffer and scheduler 801 stores those instruction matrices/blocks/buckets.”

“In this example, instruction bucket 806 has determined that it will satisfy its dependency sources. It is therefore ready to dispatch bucket 808 or 809. These buckets can either execute in one cycle, if the hardware permits it, or they can be executed in execution units 802 in multiple cycles in a pipelined and non-pipelined fashion. Once they are completed, their results are saved to file 803. These registers represent the bucket and are written using the write back ports 805. These bucket destination registers are kept in the bucket retirement buffer, until the bucket as an entire can update the architecture state in its original sequential order.

“One possible implementation of this concept is described with three components:

“1?The front end”

“2?The Execution and Scheduler”

“3?The Back end of retirement”

The front end contains: Speculative Thread Bucket Pointers, Bucket Sources, and destinations lists. Execution buckets and the scheduler include a bucket dispatch selector, virtual register match-and-read, and possibly a register hierarchy. The Back end stores executed buckets and enforces exception ordering before retirement. Register hierarchy/cache can also be used as intermediate storage to store executed bucket results, until they become non-speculative. It can update the architecture state similar in 803 (FIG. 8.a.”

“The following describes a possible implementation of the front-end, dispatch stage and backend where buckets are logged. These are shown in FIG. 8.b”

The process begins with fetching a new thread matrix/bucket/block. Next, the new bucket is assigned to a bucket slot in the bucket buffer. Each thread allocation pointer in the thread array 852 creates a bucket that the thread can physically place its instructions/blocks in. Round-robin fashion, each thread keeps allocating buckets to the bucket buffer array within its respective interval of contiguous spaces. Each thread space assigns a new number 852 to the buckets/blocks within each thread space. This increments each time a bucket/block is assigned. Each bucket that has a valid source is assigned a read bit?Rv. This indicates that the source is required for instructions in this bucket. Each destination register that will be written back using instructions in this bucket must have a valid bit?Wv. It is set in the bucket, and it has a field 853. A new bucket inherits the destination vector of the previously allocated bucket. This is pointed out by the thread bucket allocation hinter 852. The inheritance vector is copied from an earlier allocated bucket, and then it overwrites the valid destination fields which correspond to those registers that will be updated with those bucket instructions. The bucket’s current bucket number will be used to label the valid destinations. Invalid destinations are copied from the bucket’s corresponding inheritance vector. After that, the thread bucket pointer for the new bucket is updated by incrementing it (it wraps around within the interval).

“In the bucket dispatch-and-execute stage, when a bucket is executed with no exception handling, the bucket execution flag (854 containing the bucket number 854) is set. It is broadcasted through the bucket buffer and latched/monitored in each bucket that has a source that contains that bucket number. You can also pass information about virtual register locations along with the bucket number. Once all execution flags for the sources buckets have been set within a bucket then the bucket ready bit 855 has been set. The bucket can now be dispatched and executed. The bucket will execute without exceptions and update the architecture state according to the sequence of the program. After that, it will retire the bucket and increment the retirement thread pointer 857 to the next bucket. You can assign the retired bucket location to a new bucket.

“Threads with closely related threads may co-exist within the Matrix/bucket/blockbuffer; each thread will occupy an area of buckets that is specific to that thread. This thread’s allocation pointer moves within the bucket interval in a round-robin fashion, fetching new instruction buckets then allocating them in the thread interval according to the described round-robin fashion. This interval sectioning allows the bucket buffer to be divided dynamically using different interval lengths of buckets.

“The inheritance vector concept is introduced for both the instruction bucket and the thread. Every instruction matrix/block/bucket writes to particular registers within the architectural registers. Every new bucket updates the inheritance vector by writing the thread number and bucket number into the vector. This vector leaves the fields unupdated for registers it doesn’t write into. The bucket inheritance vector B_iv 856, is forward from one bucket to the other in program order. FIG. FIG. 8.b Each matrix writes its own number to the architecture destination registers if it is instructed to do so by the matrix instructions, otherwise it inherits from the B_iv value of the previous bucket within that thread.”

This concept can be scaled from a bucket buffer that manages a few closely coupled threads to hardware circuits that manage multiple buckets buffers and threads, as illustrated in FIG. 9. Circuits that can handle a larger number of threads with less interaction are called a global front. They process a thread head 902 but do not have to process the actual thread instructions to enforce dependency checks across distant threads. The thread’s header and its sub-headers contain information about the architecture registers those threads and buckets write into (destination registers for those instructions). There is no need to include actual instructions and sources in those headers. It is sufficient to list the destination registers and a bit vector indicating which register is destination for an instruction. The header doesn’t have to be physically placed in the instructions. It can be any formatted packet, compact representation, or the destination registers within the instructions. These may or may not be stored together with the rest.

“This global front end fetches only headers from threads/blocks according to program order. It generates dynamic bucket and/or thread inheritance vectors 901(Tiv/or Biv). These inheritance vectors are forwarded each time a thread is allocated by keeping the fields that the current thread bucket won’t write to or update. These inheritance vectors are then distributed to large numbers of engines/cores/processors 904, each of which may include a local back-end and fetch unit (which will fetch the actual instructions and produce the dependency vectors for each bucket), and a local matrix/block/bucket with local register files 905. Local front-ends retrieve the actual instructions from the global front-end and then use the information from those inheritance vectors to fill in the dependency information for instructions brought into the engines for execution. FIG. FIG. 9 shows a global front end implementation. It distributes the inheritance vectors to different engines 904 by using concise information about the instructions. (This is only the registers where those instructions are written into). Information about changes in the control path between or within the threads is also useful information to include in the header. To predict the flow and control between those threads, a global branch predictor is possible. These headers may include branching destinations as well as offsets. The hardware/compiler may decide to send independent threads along the 2 control paths for a branch. It will then merge those two paths with the inheritance vector, as shown in FIG. 11. FIG. FIG. 9 shows how a global front fetches a header for a new thread. Thread 2 (906) will update the 901 inheritance vector. This results in vector 910, where registers 1,2,3,4,6,0, and 7 are updated using T2 labels. Notably, register 5 in vector 910 was not written by the T2 buckets. Its label was inherited from an inheritance vector.

“One observation that is interesting is that register files allow cross communication between the cores/engines. To reduce access latency, a request can be made to the registers that are required from cross engines as soon as the thread’s instruction buckets are fetched. At that point the source dependency information can be populated so that cross engine threads reference can be issued. This is likely to be long before actual instructions are sent for execution. The instruction will not be sent until the cross-referenced source has been forwarded. The cross referenced source can also be stored in the register cache or local multi-threaded file. This cross-referenced source can be stored in the same buffer as the load buffer buffer. It can also reuse the load buffer buffer physical storage and dependencies check mechanisms, but it can be used instead of a memory load to store the register load. You can use many topologies to connect register files across engines/cores. These could include a ring topology, cross-bar topology, or mesh routed interconnect.

The following discussion will show you how register file segmentation can work inside and between engines. The bucket’s sources are sent simultaneously to the register file as well as the register cache when it is dispatched. If the register file has direct threading support, the operand can be read from the appropriate thread register section. If the register file is virtual, and includes a physically segmented file that uses tags as well, then a tag match must be performed as part of the virtual register reading. If the tag matches, the read takes place from the segmented registry file.

“FIG. 9.b displays the overall virtual register files and register cache. It also shows the segmentation and low-level register files. A SIMD/MIMD instruction accesses each SIMD/MIIVID register segment, while a VLIW instructs accesses each register separately. Each section has its own instructions, which can access registers separately and cross-segment registers in threading mode. FIGS. 1-5 show more ways that threads can use register file hierarchies. 1-5”

A bit or flag is used to lock the 4 buckets together, creating a super bucket. This allows for the possibility of maintaining the instruction scheduling as it was composed by compiler auto parallelization or SMID/MIMD composition. The flag or bit that is set will cause all 4 buckets in the super bucket to execute simultaneously within the same cycle. Flags that aren’t set will cause those buckets to not be locked together. They can execute independently at different times.

The limited bandwidth register file is under pressure due to the increased parallelism caused by threading, out-of-order execution and VLIW architectures. Register files are usually designed to be a single resource that allows access to all registers. Segmented register files have been designed before, but they have required handling of cross reads/writes at the architecture/software level, which prevents them from being utilized as a unified set of resources and adds overhead to the cross reads/writes.”

“Disclosed is register architecture which supports software threads and hardware generated threads. SIMD & MIMD execution. Also, emulations of out-of order super-scalar execution. It is physically separated, but it appears to be a single architecture resource. This segmented register is part the virtual register file that might include a register hierarchy, a register cache, and mechanisms to store and verify register tags. If we use a location-based scheme that uses the dependency inheritance vector, tag access can be eliminated. This scheme works in a way that all sources of instructions broadcast the executed bucket number during the dispatch stage. They then perform a content addressable match (CAM) that compares their sources buckets to determine the ready flag for each source. This allows for the registration number to be included along with the physical location where the bucket was executed. FIG. In FIG. 9.c, there are four register file segments. Each segment contains 16 registers. When a bucket #x is dispatched to section 2, the bucket number x is broadcast to the bucket buffer. The segment #2 is also broadcast with it so that any sources that depend on bucket x can see that it has written all its registers within segment 2. They will know that when they dispatch the instructions they must read the register from segment 2. To avoid tags, this applies to the register cache. This concept can be extended to the global front-end, where the inheritance vector can also specify which engine the instruction bucket writing into this register was allocated.

The following describes an implementation of a unified, dynamic execution architecture that can issue SIMD/MIMD instructions, VLIW bucket instructions, ULIB bucket instruction, as well as dynamic and static threading. This architecture allows for the imitation of a super-scalar outof-order implementation without explicitly supporting out-of order components. The invention could also include a physically segmented and architecturally unified register file as well as a register cache hierarchy.

Click here to view the patent on Google Patents.