Artificial Intelligence – Joydeep Ray, Abhishek R. Appu, Altug Koker, Kamal Sinha, Balaji Vembu, Rajkishore Barik, Eriko Nurvitadhi, Nicolas Galoppo Von Borries, Tsung-Han Lin, Sanjeev Jahagirdar, Vasanth Ranganathan, Intel Corp

Abstract for “Efficient thread group scheduling

“A mechanism for intelligent thread scheduling at autonomous machine is described. One method for detecting dependency information, as described in this document, involves identifying a plurality threads that correspond to a plurality workloads associated to tasks relating a processor with a graphics processor. To avoid dependency conflicts, the method can also include scheduling one or more thread groups that have a similar dependency.

Background for “Efficient thread group scheduling

“Current parallel graphic data processing” refers to systems and methods that can perform specific operations on graphics data, such as linear interpolation (linear interpolation), tessellation (rasterization), texture mapping, depth test, etc. Graphic processors were traditionally based on fixed-function computational units for processing graphics data. However, recent developments have made portions of the graphics processors programmable. This allows them to perform a greater variety of operations to process vertex and fragment data.

“To increase their performance, graphics processors often implement processing techniques like pipelining, which attempt to process as many graphics data in parallel as possible across the various parts of the graphics pipeline. Parallel graphics processors that use single instruction, multiple thread architectures (SIMT), are designed to maximize parallel processing within the graphics pipeline. SIMT architectures are composed of multiple threads that attempt to execute program instructions simultaneously as many times as possible in order to improve processing efficiency. You can find a general overview of hardware and software for SIMT architectures in Shane Cook’s CUDA Programming Chapter 3, pages 37-51 (2013), and/or Nicholas Wilt’s CUDA Handbook: A Comprehensive Guide to GPU Programming Sections 2.6.2 through 3.1.2 (June 2013).

“Machine Learning has been successful in solving many types of tasks. Parallel implementations are possible because of the complexity of machine learning algorithms, such as neural networks. Parallel processors, such as general-purpose graphics processing units (GPGPUs), have been a key part in the implementation of deep neural network. The parallel graphics processors that have a single instruction, multiple thread architecture (SIMT), are intended to maximize the number of parallel processing within the graphics pipeline. SIMT architectures are composed of multiple parallel threads that attempt to execute program instructions as quickly as possible. This increases processing efficiency. Parallel machine learning algorithm implementations provide high efficiency and allow for the use of large networks.

“Conventional methods for handling thread groups such as scheduling, prioritizing and dealing with dependencies are inefficient in terms consumption of system processing resource resources such as time, bandwidth and power.”

“Embodiments are a new technique to employ and use an intelligent thread dispatch mechanism for data disruption across compute clusters.” Embodiments also allow for prefetching thread group input data to caches when threads are loaded. This is in addition to vectorization of atomic operation.

“It should be noted that terms such as?convolutional neuronet?, CNN? or?neural net?, etc., may be used interchangeably throughout this document. You may also see terms such as?autonomous device? Or simply?machine’,?autonomous vehicule? Or simply?vehicle,?autonomous agent or simply ?agent?, ?autonomous device? This document may use interchangeably the terms?computing device’,?robot? and/or similar.

“In certain embodiments, a graphics processor unit (GPU), is communicatively coupled with host/processor cores in order to accelerate graphics operations and machine-learning operations. This allows for pattern analysis operations and other general purpose GPU functions. The GPU can be communicatively connected to the host processor/cores via a bus, or another interconnect (e.g. a high-speed interconnect like PCIe or NVLink). Other embodiments may integrate the GPU on the same package/chip as the cores. The GPU can also be communicatively connected to the cores via an internal bus/interconnect (i.e. internal to the package/chip). Regardless of the manner in which the GPU is connected, the processor cores may allocate work to the GPU in the form of sequences of commands/instructions contained in a work descriptor. The GPU then uses dedicated circuitry/logic for efficiently processing these commands/instructions.”

“The following description contains many specific details. However, embodiments described herein may be used without these details. Other circuits, structures, and techniques are not shown in detail in order to not obscure the meaning of this description.

“System Overview I”

“FIG. “FIG. A processing subsystem 101 is included in the computing system 100. It includes one or more processors 102 and a system storage 104. They communicate via an interconnection path, which may include a memory hub (105). The memory hub 105 can be either a separate component of a chipset or integrated into the processor(s) 102. Through a communication link 106, the memory hub 105 can be connected to an I/O system 111. I/O subsystem 111 also includes an I/O hub 107 which can allow the computing system 100 receive input from one or multiple input devices (108). The I/O hub (107) can also enable a display controller to be connected to the processor(s). 102 to provide outputs to the display device(s). 110A. One embodiment of the I/O hub107 and one or more display devices 110A can be combined with an embedded, local, or internal display device.

“One embodiment of the processing subsystem 101 comprises one or more parallel processors 112 connected to memory hub 105 via bus or another communication link 113. The communication link 113 can be any of a variety of standards-based communication link protocols or technologies, including PCI Express. It may also be a vendor-specific communications interface or fabric. One embodiment of the parallel processors 112 forms a parallel processing system or vector that is computationally focused. It may include many processing cores or clusters such as multiple integrated core (MIC), processors. One embodiment of the parallel processors 112 forms a graphics processing subsystem that can output pixels one the one or multiple display devices 110A connected via the I/O Hub.107. One or more parallel processors 112 may also include a display controller or display interface (not illustrated) in order to allow direct connection to one of the display devices 110B.

“In the I/O subsystem 112, a system storage device 114 can be connected to the I/O hub107 to provide a storage system for computing system 100. An I/O switch (116) can be used as an interface mechanism to allow connections between I/O hub107 and other components. This includes a network adapter/wireless network adapter (118/119), which may be integrated into platform, and many other devices that can added via add-in devices 120. The network adapter (118) can be either an Ethernet adapter, or another wired networking adapter. Wireless network adapter 119 may include one or several of Wi-Fi, Bluetooth or near field communication (NFC) devices or any other network device that includes one, or more, wireless radios.

“The computing system 100 may include additional components that are not shown. These could include USB or other ports connections, optical storage drives and video capture devices. The communication paths connecting the components of FIG. 1. may be implemented using any of the suitable protocols, including PCI (Peripheral Complement Interconnect) based protocols (e.g. PCI-Express) or any other bus/point-to-point communication interfaces or protocols(s), like the NV Link high-speed interconnect or interconnect protocols that are known in the art.

“In one embodiment, one or more parallel processors (112) incorporate circuitry optimized to graphics and video processing. This circuitry includes, for example video output circuitry and constitutes a graphic processing unit (GPU). Another embodiment uses circuitry that is optimized for general purpose processing. This preserves the underlying computational architecture. Another embodiment allows components of the computing systems 100 to be combined with other elements in a single integrated circuit. One or more parallel processors, 112 memory hub 105 and processor(s) 102 can all be integrated into a single integrated circuit. The components of the computing systems 100 can also be combined into one package to create a system-in-package (SIP) configuration. One embodiment allows at least one portion of the computing system 100 to be integrated into a multichip module (MCM), which can then be interconnected with other multichip modules to create a modular computing platform.

“It will be understood that the computing system 100 illustrated herein is only an illustration and that modifications and variations are possible. You can modify the connection topology including the number and arrangement for bridges, processor(s), 102, and parallel processor(s), 112, as needed. In some cases, system memory (104) is directly connected to processor(s), 102 via a bridge. Other devices, however, communicate with system memory (104) via the memory hub and processor(s), 102. Other topologies allow the parallel processor(s), 112, to be connected to either the I/O hub107 or directly one or more of the processor(s), 102 rather than the memory hub105. Other embodiments may include the I/O hub (107) and memory hub (105), which can be combined into one chip. One embodiment may have two or more processor(s), 102 connected via multiple sockets. These processors can be paired with one or more instances the parallel processor(s), 112.

“Some components listed herein may not be available in all versions of the computing system 100. You may support as many add-in cards and peripherals as you like, or eliminate certain components. Other architectures might use different terminology for components than those shown in FIG. 1. In some architectures, the memory hub (105) may be called a Northbridge, while the I/O center 107 may be called a Southbridge.

“FIG. 2A shows a parallel processor 200 according to an embodiment. One or more integrated circuit devices may be used to implement the various components of the parallel process 200, including programmable processors, field-programmable gate arrays or application-specific integrated circuits (ASICs). The parallel processor 200 illustrated in FIG. 1. According to an embodiment

“In one embodiment, the parallel process 200 includes a parallel unit 202. A parallel processing unit also includes an I/O 204 which allows communication with other devices, such as other instances of parallel processing unit 200. The I/O 204 can be connected directly to other devices. One embodiment of the I/O device 204 connects to other devices using a switch or hub interface, such memory hub 105. The communication link 113 is formed by the connections between the I/O units 204 and the memory hub 105. The parallel processing unit 200’s I/O unit204 connects to a host interface (206) and a memory crossbar (216). Here, commands are received from the host interface to perform processing operations, while commands to the memory crossbar 226 are directed at performing memory operations.

“When the I/O unit 200 sends a command buffer to the host interface, 206, the host interface can direct work operations to the front end 208. One embodiment of the front end 208 is coupled with a scheduler210. This scheduler is used to distribute commands and other work items to a processing array 212. The scheduler 210 in one embodiment ensures that the processing array 212 is correctly configured before tasks are distributed to processing clusters.

“The processing cluster array 212 can contain up to?N” Processing clusters (e.g. cluster 214A and cluster 214B through cluster 214N). The clusters 214A-214N can run a large number concurrent threads. The scheduler 220 can assign work to clusters 214A-214N from the processing cluster array 212, using different scheduling and/or work allocation algorithms. These may vary depending upon the type of program or computation being executed. The scheduler 210 can handle the scheduling dynamically or in part with compiler logic when compiling program logic for execution by processing cluster array 212.

“In one embodiment, different processing clusters 214A-214N can be assigned for different types or types of programs.”

“The processing cluster array 212 can be configured to perform different types of parallel processing operations. One embodiment of the processing cluster array is designed to perform general-purpose parallel computation operations. The processing cluster array 212 may include logic that can be used to perform processing tasks such as filtering video and/or audio, modeling operations including physics operations and data transformations.

“In one embodiment, processing cluster array 212 is designed to execute parallel graphics processing operations. The processing cluster array 212 may include additional logic that supports the execution of graphics processing operations in embodiments where the parallel processor 200 can be used to execute them. This includes texture sampling logic to perform texture operations as well as tessellation and other vertex processing logic. The processing cluster array 212 can also be used to run graphics processing-related shader programs, such as vertex shaders and tessellation shadesrs, geometry shadingrs, and pixelshaders. Parallel processing unit 202 can transfer data to system memory via I/O unit. 204 is used for processing. The transferred data can be saved to on-chip memory during processing (e.g. parallel processor memory 222) and then written back into system memory.

“In one embodiment, the parallel processing unit (202) can be used to process graphics processing. The scheduler 210 can be set up to split the processing workload into roughly equal-sized tasks to make it easier to distribute the operations to multiple clusters (214A-214N) of the processing cluster array 212. Some embodiments allow for different processing to be performed on portions of the processing cluster array 212. One example is that a portion of the processing cluster array 212 may be used to perform topology generation and vertex shading. A second portion may be used to perform geometry shading and tessellation. Finally, a third portion can be used to shade pixels or perform other screen space operations to create a rendered image. Buffers may be used to store intermediate data from one or more clusters 214A to 214N, which can be sent between clusters 214A to 214N for further processing.

“The processing cluster array 212 may receive processing tasks during operation. This is done via the scheduler210. The scheduler receives commands from the front end 208 defining processing tasks. Processing tasks are data that describe how data will be processed. These include data indices, such as surface (patch), primitive, vertex, and/or pixels data. Also, state parameters and commands can be used to define the process (e.g. what program is to run). The scheduler 210 can be set up to retrieve the indices for the tasks, or it may get them from the front-end 208. Front end 208 can also be configured to ensure that the processing cluster array (212) is in a valid state prior to the workload specified by incoming commands buffers (e.g. batch buffers, push buffers etc.). ”

“Each one or more instances the parallel processing unit 200 can be paired with parallel processor memory 222. The memory crossbar 226 can access the parallel processor memory 222. It can also receive memory requests from both the processing cluster array 212, and the I/O device 204. A memory interface 218 allows the memory crossbar to access parallel processor memory 221. Multiple partition units can be included in the memory interface 218 (e.g. partition unit 220A through 220B) which can each couple to a specific portion (e.g. memory unit) of parallel process memory 222. One implementation has the number 220A-220N partition units equal to the number memory units. For example, a first partition 220A has a corresponding 1 memory unit 224, a second 220B has the corresponding 284B memory unit, and an Nth 220N has the corresponding 284N memory unit 220N. Other embodiments may have the number 220A-220N not equal to the number memory devices.

“In different embodiments, the memory unit 224A-224N may include various memory devices such as dynamic random access memory or graphics random memory (DRAM), or synchronous graphics random address memory (SGRAM), and graphics double data rate memory (GDDR). The memory units 224A-249N can also contain 3D stacked memory. This includes but is not limited to high-bandwidth memory (HBM). The specific implementation of the memory unit 224A-224N may vary and can be chosen from a variety of conventional designs. Render targets such as texture maps or frame buffers may be stored across memory units 224A?224N. This allows partition units 220A-220N the ability to write portions of each render goal in parallel, which makes efficient use of the parallel processor memory 222. A local instance of parallel processor memory 222, may not be used in certain embodiments. Instead, a unified memory design which uses system memory and local cache memory is used.

“In one embodiment, any cluster 214A-214N in the processing cluster array212 can process data that will then be written to any memory unit 224A-224N within parallel CPU memory 222. The memory crossbar 221 can be used to transfer each cluster’s output 214A-21N to any partition unit 220-222N or another cluster 214A-21N that can perform additional processing operations. Each cluster 214A to 214N can communicate via the memory crossbar 218 to access and write to external memory devices. One embodiment of the memory crossbar 216 includes a connection with the memory interface 218 for communication with the I/O device 204 and a connection with a local instance the parallel processor memory 222. This allows the processing units in the different processing clusters 214A to 214N to communicate to system memory or any other memory not local to the parallel unit 202. One embodiment of the memory crossbar 221 can use virtual channels in order to separate traffic streams between clusters 214A-214N, and partition units 220A-220N.

The parallel processor 200 illustrates one instance of the parallel computing unit 202, but any number of instances can be added to the parallel processor 202. Multiple instances of the parallel processor unit 202 can be included on one add-in card or interconnected. Even if different instances of the parallel processor unit 202 have different processing cores or different amounts of local parallel CPU memory, they can still be interoperable. In one embodiment, for example, certain instances of the parallel processor unit 202 may have higher precision floating-point units than others. The parallel processor 200 or one of its instances can be integrated into systems in many configurations and form factors. These include desktop, laptop, handheld, and embedded computers as well as servers, workstations and game consoles.

“FIG. 2B is a block-diagram of a 220A-220N partition unit, according to an embodiment. The partition unit 220 in one embodiment is an example of one of the FIG. 220A-220N partition units. 2A. The partition unit 220 contains an L2 cache 221, frame buffer interface 2225 and a ROP 226, which are raster operations units. The L2 cache 221, a read/write cache, is designed to store and load data received from ROP 226 and memory crossbar 216, respectively. L2 cache 221 outputs read misses and urgent write back requests to the frame buffer interface 225. Frame buffer interface 225 can be used to send dirty updates to the buffer for opportunistic processing. One embodiment of the frame buffer interface interface 225 interfaces to one of the memory unit in parallel processor memory such as the memory units 228A-228N shown in FIG. 2A (e.g. within parallel processor memory 222).

The ROP 226 is a graphics processing unit that performs raster operations such as stencil, test, blending and the like. The ROP 226, then, outputs the processed graphics data to be stored in graphics memory. The ROP 226 may include compression logic that compresses z or color information that is written to memory, and decompresses z that is read from memory. The ROP 226 may be included in each processing cluster (e.g. cluster 214A-214N as shown in FIG. 2A) rather than within the partition unit 221. This embodiment allows read and write requests to pixel data to be transmitted over the memory crossbar 221 instead of pixel fragments data.

“The processed graphics data can be displayed on a display device such as the one or more display devices 110 in FIG. 1, routed to the processor(s), 102 or routed for further process by one of the processing entities in the parallel processor 200. 2A.”

“FIG. 2C is a block representation of a processing cluster (214) within a parallel processing unit according to an embodiment. The processing cluster in one embodiment is an instance of one the processing clusters 214A?214N of FIG. 2A. 2A. A program that executes on a specific set of input data. Some embodiments use single-instruction multiple-data (SIMD), instruction issue techniques to allow parallel execution of large numbers of threads without requiring multiple instruction units. Other embodiments use single-instruction multiple-thread (SIMT), techniques to allow parallel execution of large numbers of generally synchronized threads. This is done using a common instruction unit that issues instructions to each of the clusters of processor engines. SIMT execution is different from a SIMD execution, which has identical instructions executed by all processor engines, because it allows threads to follow divergent paths through a thread program. A SIMD processing regimen is a functional subset within a SIMT processing regimen, according to those who are skilled in the art.

“Operation can be controlled by a pipeline manager232 which distributes processing tasks among SIMT parallel processors. The instructions are received by the pipeline manager 232 from FIG. 2A, and executes those instructions via a texture unit 236, and/or a graphics multiprocessor 234. An example of a SIMT parallel processor is the illustrated graphics multiprocessor 234. The processing cluster 214 may contain different types of SIMT parallel CPUs with differing architectures. A processing cluster 214 can include one or more instances the graphics multiprocessor 234. A graphics multiprocessor 234 is capable of processing data. A data crossbar 240, which can be used to distribute processed data to any number of destinations including shader units, can be used by the graphics multiprocessor 234. Pipeline manager 232 allows you to specify destinations for the processed data to be distributed via the data crossbar 244.

“Each graphics multiprocessor 234 in the processing cluster 214 can contain an identical set functional execution logic (e.g. arithmetic units, load-store unit, etc.). Functional execution logic can also be set up in a pipelined fashion so that new instructions can be issued prior to the completion of previous instructions. Functional execution logic can be provided. Functional logic can support a wide range of operations such as integer and floating-point arithmetic comparisons, Boolean operations bit shifting, and computations of various algebraic function. One embodiment allows for the use of different functional-unit hardware to perform different operations. Any combination of functional units can also be used.

“A thread is a set of instructions sent to the processing cluster 214. Thread groups are a group of threads that execute across a set of parallel processing engine. Thread groups execute the same program using different input data. A thread group can have each thread assigned to a different graphics multiprocessor 234. A thread group can contain fewer threads that the number of graphics multiprocessor 234. A thread group that contains fewer threads than the number processing engines can cause one or more processing engines to be idle during the thread group’s processing. A thread group can also contain more threads that the number of graphics multiprocessor 234 processors. If the thread group contains more threads that the number of graphics multiprocessor 234, processing may be done over successive clock cycles. Multiple thread groups can be concurrently executed on a graphicsmultiprocessor 234.

“In one embodiment, the graphicsmultiprocessor 234 has an internal cache memory that can be used to perform load or store operations. One embodiment allows the graphics multiprocessor 234, to forgo an internal cache and instead use a cache memory (e.g. L1 cache 308) within processing cluster 214. Each graphics multiprocessor 234 has access to L2 caches in the partition units (e.g. partition units 220A-220N as shown in FIG. 2A) are shared between all processing clusters 214. They can be used to transfer data among threads. The graphics multiprocessor 234, which is also capable of accessing off-chip global memories, can also access local parallel processor memory or system memory. Global memory can be any memory that is not part of the parallel processing unit 200. If the processing cluster 234 includes multiple instances the graphics multiprocessor 234, then the embedded embodiments can share data and instructions. The L1 cache 308 may be used to store the data.

“Each processing cluster 214, may contain an MMU 245 (memory manager unit), which is used to convert virtual addresses into physical addresses. Other embodiments may include one or more MMU 245 instances within the memory interface 218 in FIG. 2A. 2A. The MMU 245 can contain address translation lookaside buffers or caches that could reside within the graphics multiprocessor 234, the L1 cache, or processing cluster 234 To allow efficient request interleaving between partition units, the physical address is processed. To determine if a request for cache lines is successful or not, the cache line index can be used.

A processing cluster 214 can be used for graphics and computing applications. Each graphics multiprocessor 234 may be coupled to a texture unit 236, which performs texture mapping operations such as determining the position of texture samples, reading texture data, filtering texture data and determining how they are filtered. Texture data can be read from either an internal texture L1 cache (not illustrated) or, in certain embodiments, from the L1 cache inside graphics multiprocessor 234. It is then fetched from L2 cache, local parallel process memory or system memory as required. Each graphics multiprocessor 234 outputs the processing tasks to the data crossingbar 240. This allows the task to be sent to another processing cluster 214, which can then process it further or store it in L2 cache, local processor memory, or system memories via the memory crossbar 221. PreROP 242 (preraster operations unit), is designed to receive data from graphicsmultiprocessor 234, and direct data to ROPs units. These units may be located with the partition units described herein (e.g. partition units 220A-220N in FIG. 2A). PreROP 242 can optimize color blending, organize color data and perform address translations.

“It is appreciated that the described core architecture is only an illustration and that modifications and variations are possible. A processing cluster 214 may contain any number of processing units. A parallel processing unit, as described herein, may be included in any number of processing clusters 214, even though only one is shown. Each processing cluster 214 may be configured to work independently from other clusters 214 by using distinct processing units, L1 caches and others.

“FIG. 2D depicts a graphics multiprocessor 234, according one embodiment. The graphics multiprocessor 234 is coupled with the pipeline manager 232 of the processing cluster 214. The execution pipeline of the graphics multiprocessor 234 includes an instruction cache 252, an address mapping unit 256 and an address mapping unit 256. Through a memory-cache interconnect 268 the GPGPU cores 262 & load/store units 266 can be coupled with cache memory 272 & shared memory 270.

“In one embodiment, an instruction cache 252 receives instructions from the pipeline manager 223 to execute. Instructions are stored in the instruction cache 252 before being dispatched to the instruction unit 254. Instructions can be dispatched by the instruction unit 254 as thread groups (e.g. warps). Each thread in each thread group is assigned to a different GPGPU core 262. A local, shared, and global address space can be accessed by an instruction that specifies an address in a unified space. Address mapping unit 256 is used to convert addresses from the unified address space into distinct memory addresses that can be accessed using the load/store units 266,

“The register file 258 contains registers that correspond to the functional units of the graphicsmultiprocessor 324. Register file 258 is used to temporarily store operands that are connected to the data paths for the functional units of the graphics multiprocessor 324. One embodiment divides the register file 258 between functional units so that each functional unit has its own section of the register 258. One embodiment divides the register file 258 between the various warps executed by the graphicsmultiprocessor 324.

“The GPGPU cores 262 can include floating point units and/or integer math logic units (ALUs), which are used to execute instructions for the graphics multiprocessor 324. The architecture of the GPGPU cores 262 may be identical or different depending on the embodiments. In one embodiment, the GPGPU Cores 262 have a portion that includes a single precision FPU, an integer ALU, and a portion that has a portion with a double precision FPU. One embodiment of the FPUs is capable of implementing the IEEE 754 2008 standard for floating-point arithmetic, or enabling variable precision floating-point arithmetic. To perform certain functions, such as copy rectangle and pixel blending operations, the graphics multiprocessor 324 may also include one or more special function or fixed function units. One or more GPGPU cores may also be able to include special or fixed function logic.

“The memory-cache interconnect 268 is an interconnect network connecting each of the functional units in the graphics multiprocessor 324, to the register file 258 or to the shared memory 277. The memory and cache interconnect 268 in one embodiment is a crossbar interconnect which allows the load/store 266 to perform load and store operations between register file 258 and shared memory 270. The register file 258 operates at the same frequency and GPGPU cores 262 so data transfer between GPGPU cores 262 and register file 258 is extremely low latency. The shared memory (270) can be used for communication between threads running on the graphics multiprocessor 234’s functional units. The cache memory 272 is used to store texture data between the functional units 234 and 236. A program managed cached can also be stored in the shared memory 270. Threads running on the GPGPU cores 262 have the ability to programmatically store data in the shared memory, as well as the automatically cached data stored in the cache memory 272.

“FIGS. 3A-3B show additional graphics multiprocessors according to embodiments. These graphics multiprocessors 325 and 350 illustrate variants of the FIG. 234 graphics multiprocessor. 2C. 2C.

“FIG. 3A illustrates a graphics multiprocessor 325 in accordance with an additional embodiment. Graphic multiprocessor 325 has multiple instances of additional execution resource units, in addition to the graphics multiprocessor 234, as shown in FIG. 2D. The graphics multiprocessor 325 may include multiple instances the instruction unit 332A-332B and register file 334A-334B. It also can contain texture units 344A-344B. Graphics multiprocessor 325 can also include multiple sets or compute execution units (e.g. GPGPU Core 336A-336B and GPGPU Core 337A-337B, GPGPU Core 338A-338B, GPGPU Core 338A-338B, GPGPUcore 337A-337B, GPGPUcore 337A-337B, GPGPU central 338A-338B, GPGPU center 338A-338B, GPGPU main 337A-store units 340A/store units 340A, 340A-store units 340A/store/load/store units 340A 340B 340A-340B 340A-340A-340B. One embodiment of the execution resource units includes a common instruction cache 333, texture and/or data memory 342, and shared memory 346. An interconnect fabric 327 allows the components to communicate with each other. One embodiment of the interconnect fabric 327 contains one or more crossbar switches that enable communication between various components of the graphics multiprocessor 325.

“FIG. 3B illustrates a graphics multiprocessor 350 according an additional embodiment. As illustrated in FIG. 2D and FIG. 3A. 3A. One embodiment allows the execution resources 356A-356D to share an instruction cache 354, shared memory 362, and multiple instances of a texture or data cache memory 358A-358B. Interconnect fabric 352 can be used to communicate with the various components, similar to FIG. 327’s interconnect fabric 327. 3A.”

“Persons who are skilled in the art will recognize that FIGS. 1 through 3A-2D and 3A-3B, are only examples and do not limit the scope of the inventions. The techniques described herein can be applied to any properly configured processing unit. This includes, without limitation, one, two, or three mobile application processors (CPUs), one or more server central processing units (CPUs), and one or several parallel processing units such as the parallel unit 202 in FIG. 2A), as well as one or two graphics processors, or special purpose processing unit, are all permissible without departing from the scope of the embodiments.

“In certain embodiments, a parallel processing unit or GPGPU is communicatively coupled with host/processor cores in order to accelerate graphics operations and machine-learning operations. This allows for the acceleration of pattern analysis operations and other general purpose GPU (GPGPU). functions. The GPU can be communicatively connected to the host processor/cores via a bus, or another interconnect (e.g. a high-speed interconnect like PCIe or NVLink). Other embodiments may integrate the GPU on the same package/chip as the cores. The GPU can also be communicatively connected to the cores via an internal bus/interconnect (i.e. internal to the package/chip). Regardless of the manner in which the GPU is connected, the processor cores may allocate work to the GPU in the form of sequences of commands/instructions contained in a work descriptor. The GPU then uses dedicated circuitry/logic for efficiently processing these commands/instructions.”

“Techniques to Interconnect GPU and Host Processors”

“FIG. “FIG. One embodiment supports a communication throughput up to 4 GB/s. 30 GB/s and 80 GB/s depending on how it is implemented. Various interconnect protocols can be used, including PCIe 4.0, 5.0, and NVLink 2.0. The invention’s underlying principles are not restricted to any specific communication protocol or throughput.

“In one embodiment, two or more GPUs 410-413 can be interconnected over high speed links 444-445. These links may be used with different protocols/links to those used for high speeds links 440-443. Two or more multi-core processors 405-406 can be connected via high-speed link 433. These links may be symmetric multiprocessor (SMP), buses that operate at 20 GB/s to 30 GB/s and 120 GB/s respectively. All communication between all components of the system shown in FIG. 4A could also be communicated using the same protocols/links, e.g. over a common interconnection network. The underlying principles of this invention are not restricted to any specific interconnect technology, as mentioned.

“In one embodiment, each multicore processor 405-406 is communicatively connected to a processor memory 400-02 via memory interconnects 431-431. Each GPU 410-413 can be communicatively linked to GPU memory 420-423 using GPU memory interconnects 453-453 respectively. The memory interconnects of 430-431 or 450-453 can use different memory access technologies. The processor memories 401-402, and GPU memories 420-423 could be volatile memories, such as dynamic random-access memories (DRAMs), GraphicsDDR SDRAM(GDDR), or High Bandwidth Memory, (HBM), and/or non-volatile memories, such as 3D XPoint and Nano-Ram. One embodiment may have some volatile memories and others non-volatile (e.g. using a two-level memory (2LM), hierarchy).

“Although the processors 405-406 and GPUs 411-413 may physically be coupled to a specific memory 401-402, or 420-423 respectively, a unified architecture of memory may be implemented in which the virtual system address space (also known as the?effective adres?) is shared. The virtual system address space (also known as the?effective address?) is shared among all the physical memories. The system memory address space may be divided into 64 GB for processor memories 401-402 and 32 GB for GPU memories 420-423 (a total of 256GB addressable memory).

“FIG. “FIG. One or more GPU chips may be integrated into the graphics acceleration module 446. This line card is connected to the processor 407 via high-speed link 441. The graphics acceleration module 446 could be integrated in the same package as the processor 407 via the high-speed link 440.

“The illustrated processor 407 has a plurality cores 460A to460D each having a translation lookaside buffer 461-661D and one or several caches 462A-462D. Other components may be included in the cores for processing instructions or data execution. These are not illustrated to show the underlying principles of invention. Caches 462A-462D can include level 1 (L1) or level 2 (L2) caches. One or more shared caches 426, may also be included in the caching hierarchy. They are shared by cores 460A to 460D. One embodiment of processor 407 has 24 cores each with its L1 cache and 12 shared L2 caches. It also includes twelve shared L3 caches. Two adjacent cores share one of the L2 or L3 caches in this embodiment. The processor 407, the graphics accelerator integration module 446, and system memory 441 connect to each other. This may include processor memories 401-402.

“Coherency is maintained in data and instructions stored within the different caches 462A-462D and 456 via inter-core communications over a coherencebus 464. Each cache could have its own cache coherency logic/circuitry that it can communicate with over the coherencebus 464 to respond to reads and writes to specific cache lines. One implementation implements a cache snooping protocol over the coherence buses 464 to monitor cache accesses. Cache snooping/coherency techniques can be easily understood by persons skilled in the art. We will not describe the details here in order to keep the invention’s underlying principles from being obscured.

“In one embodiment, the proxy circuit 425 connects the graphics accelerator module 446 with the coherence bus 464, allowing the graphics accelerator module 446 participate in the cache coherent protocol as a peer to the cores. An interface 435 is used to connect the proxy circuit 425 to the high-speed link 440 (e.g. a PCIe bus or NVLink). An interface 437 links the graphics acceleration module 446 with the link 440.

“In one implementation, the accelerator integration circuit 436 provides memory access, cache management, context management and interrupt management services for a plurality graphics processing engines 431, 432 and N of the graphics accelerator module 446. Each of the graphics processing engines 431, 432 and N may contain a separate graphics processor unit (GPU). Alternately, the graphics processor engines 431, 432 and N may contain different types of graphics engines within a GPU, such as media processing engines (e.g. video encoders/decoders), samplers and blit engine. The graphics acceleration module can be either a GPU that has multiple graphics processing engines 431-432 N or N graphics processing engines 431-432 N. Individual GPUs may also be integrated into a single package, line card or chip.

“In one embodiment, an accelerator integration circuit 436 contains a memory management device (MMU) 439 that performs various memory management functions. These include virtual-to physical memory translations (also known as effective-to real memory translations) as well as memory access protocols for accessing memory 441. A translation lookaside buffer, (TLB), may be included in the MMU 439 for caching virtual/effective/physical/real address translations. One implementation stores data and commands for efficient access by graphics processing engines 431-432. N. According to one embodiment, data in cache 438 and graphics memory 433-434 are kept consistent with core caches 462A-462D and 456. System memory 411 is also preserved. As mentioned, this may be accomplished via proxy circuit 425 which takes part in the cache coherency mechanism on behalf of cache 438 and memories 433-434, N (e.g., sending updates to the cache 438 related to modifications/accesses of cache lines on processor caches 462A-462D, 456 and receiving updates from the cache 438).”

A set of registers 445 stores context data for threads that are executed by graphics processing engines 431-432. N, and a context managementcircuit 448 manages thread contexts. The context management circuit 448 can perform save-and-restore operations to restore and save contexts for various threads during context switches. This is where the first thread is saved, and the second thread is stored, so that the second thread may be executed by a graphics processor engine. The context management circuit 448 can store the current register values in a specified region of memory, such as a context pointer. When returning to context, it may restore register values. An interrupt management circuit 447 is one example of a circuit that receives and processes interrupts from system devices.

“In one implementation, virtual/effective address from a graphics processing engines 431 are converted to physical addresses in system memory 411 using the MMU 439. An accelerator integration circuit 436 supports multiple accelerator devices (e.g., 4, 8, 16 graphics accelerator modules 446 or other). The graphics accelerator module 446 can be used for a single application or shared by multiple applications. One embodiment presents a virtualized graphics execution system in which the graphics processing engine 431-432, N resources are shared with multiple applications and virtual machines (VMs). You can subdivide the resources into “slices”. based on their processing requirements and priority, they are assigned to different VMs or applications.”

“The accelerator integration circuit acts as an interface to the system for graphics acceleration module 446. It provides address translation services and system memory cache services. The accelerator integration circuit 436 can also provide virtualization facilities to the host processor for managing interrupts and virtualization of graphics processing engines.

“Because the hardware resources of the graphics processor engines 431-432 and N are explicitly mapped to the real address space viewed by the host processor 407, any host can address these resources directly with an effective address value. The accelerator integration circuit 436 performs one function. It allows for the physical separation between the graphics processing engines 431-432 and N, so they can be seen as separate units to the system.

“As noted, one or more graphics memory 433-434, M is coupled to each of N’s graphics processing engines 431-432. The graphics memories 431-434, M store instructions as well as data processed by each of N’s graphics processing engines 431-432.

“In one embodiment, in order to reduce data traffic over link 443, biasing techniques are used. M refers to data that will be most often used by the graphics processing engine 431-432, N, and preferably not by the cores 462A-462D (at all). The biasing mechanism tries to preserve data that is required by cores (and preferably the graphics processing engine 431-432, N) in the caches 462A-462D and 456 cores as well as system memory 411.

“FIG. 4C shows another embodiment where the accelerator integration circuit 436 can be integrated into the processor 407. This embodiment allows the graphics processing engines 431-432 and N to communicate directly via the high-speed line 440 to the accelerator circuit 436 via interfaces 437 and 435. Again, these interfaces can be used for any type of bus or protocol. The operations described in FIG. 436 could be performed by the accelerator integration circuit 436, which may also perform similar operations to FIG. 4B, but possibly at a higher throughput due to its proximity to the accelerator integration circuit 436 and caches 462A-462D.

“One embodiment supports multiple programming models, including a dedicated-process model (no virtualization of the graphics accelerator module) and shared models (with virtualization). These models can include programming models controlled by accelerator integration circuit 436 or programming models controlled by graphics acceleration module 446.

“In one embodiment, the dedicated process model’s graphics processing engines 431-432 and N are dedicated to one application or process running under a single operating systems. A single application can channel other applications to the graphics engines 431-432 N. This allows virtualization within a partition or VM.

“In dedicated-process programming models the graphics processing engines 431-432 and N may be shared between multiple VM/application parts. To allow each operating system to access the shared models, a system hypervisor is required to virtualize graphics processing engines 431-432 and N. The operating system is responsible for the graphics processing engines 431-432 and N in single-partition systems that do not have a hypervisor. The operating system can virtualize graphics processing engines 431-432 and N in both cases to give access to each application or process.

“For the shared programming model the graphics accelerator module 446 or an independent graphics processing engine 431-432 is used. N selects a process elements using a process handle. One embodiment stores process elements in system memory 411, and they can be addressed using the effective address-to-real address translation techniques described herein. The process handle may be an implementation-specific value provided to the host process when registering its context with the graphics processing engine 431-432, N (that is, calling system software to add the process element to the process element linked list). The offset of the process elements within the process element linked lists may be represented by the lower 16 bits of the process handle.

“FIG. 4D shows an example accelerator integration slice 490. A?slice’ is the term used herein. A specified amount of processing resources in the accelerator integration circuit 436 is considered a?slice? Process elements 483 are stored in the application effective address space 482 of system memory 411. The process elements 483 are stored according to GPU invocations 481, which are executed by applications 480 on the processor 407. The process state of the associated application 480 is contained in a process element 483 The process element 483 may contain the work descriptor (WD), 484 which can either be a job request by an applicant or a pointer at a list of jobs. The WD 484 in this case is a pointer at the job request queue within the address space 482 of the application.

“The graphics accelerator module 446 and/or individual graphics processing engine 431-432, N may be shared by all of the process or just a subset. The invention includes an infrastructure to set up the process state and send a WD 484 over to a graphics accelerator module 446 to initiate a job in virtualized environments.

“In one implementation, the dedicated-process programming model is implementation-specific. This model allows a single process to own the graphics accelerator module 446 and an individual graphics processing engines 431. The graphics accelerator module 446 is only owned by one process. Therefore, the hypervisor initializes accelerator integration circuit 436 to create the owning partition. The operating system initializes accelerator integration circuit 436 to determine which process is responsible for assigning the graphics accelerator module 446.

“In operation, the accelerator integration slice 490’s WD fetch unit 491 fetches the next WDU 484. This includes an indication of how the work is being done by the graphics processing engine of the graphics acceleration module 446. The data from the WD 484 can be stored in registers 445, and may be used by the MMU, interrupt management circuit 447, and/or context managementcir 446 as illustrated. One embodiment of the MMU 43 includes segment/page circuitry to access segment/page tables within the OS virtual address space 485. The interrupt management circuit 447 can process interrupt events 492 from the graphics accelerator module 446. Graphic operations are performed using an effective address 493 generated from a graphics processing engine 431-432. N is converted to a real address via the MMU 439.

“In one embodiment, the same set 445 of registers are duplicated for each graphics processor 431-432, N, and/or graphics accelerator module 446. They may be initialized and used by the hypervisor or the operating system. An accelerator integration slice 490 may include each of these duplicated registers. Table 1 shows examples of registers that could be initialized using the hypervisor.

“TABLE 1nHypervisor Initialized Registersn1 Cut Control Registern2 Real address (RA), Scheduled Processes Area pointern3 Authority Mask Registern4 Interrupt vector Table Entry Offsetn5 State Registern6 Logical Partition IDn8 Hypervisor Accelerator Utilization Register Pointern9 Storage Description Registry

Table 2 shows examples of registers that could be initialized by an operating system.

“TABLE 2nOperating Systems Initialized Registersn1 Procedure and Thread Identificationn2 Effective address (EA), Context Save/Restore Pointern3 Virtual address (VA) Accelerator Utilization Records Pointern4 Virtual address (VA) Storage Segment Tabern5 Authority maskn6 Work descriptor

“In one embodiment, each of the WD 484s is specific to a particular graphic acceleration module 446 or graphics processing engines 431-432 N. It can also point to a memory location where an application has created a command queue for work to be completed.

“FIG. “FIG.4E” illustrates further details for one embodiment a shared model. This embodiment contains a hypervisor real space 498 where a process element listing 499 is stored. Hypervisor 496 allows access to the hypervisor real address space 488, which virtualizes graphics acceleration module engines for operating system 495.

“The shared programming models permit all or part of all processes or subsets of partitions to use the graphics acceleration module 446. The graphics accelerator module 446 can be shared between multiple processes and partitions in two programming models: graphics directed shared and time-sliced shared.

The system hypervisor 496 is the owner of the graphics accelerator module 446. It makes its function available for all operating systems 495. The graphics accelerator module 446 must meet the following requirements to enable virtualization via the system hypervisor 496. The graphics acceleration module 446 guarantees that an application’s job request will be processed within a given time frame, including translation errors. When operating in the directed share programming model, graphics acceleration module 446 must guarantee fairness between processes.

“In one embodiment, the shared model requires the application 480 to make an operating systems 495 system call. It must include a graphics accelerator module 446 type and a work description (WD), an authority register (AMR value) and a context save/restore pointer (CSRP). The graphics accelerator module 446 type indicates the target acceleration function for the system calls. Graphics acceleration module 446 may contain a system-specific value. The graphics accelerator module 446 type WD can be a graphics accelerator module 446 command, an address pointer for a user-defined structure, a pointer to an order queue, or any other data structure that describes the work being done by the graphics accelerator module 446. The AMR value can be used to determine the current AMR state. Similar to setting the AMR, the value that is passed to the operating systems is the same as an application setting it. The accelerator integration circuit 436 or graphics acceleration module 446 may not support the User Authority Mask Override Register. In these cases, the operating system can apply the current UAMOR value as the AMR value and pass the AMR in the hypervisor calls. Optionally, the hypervisor 496 can apply the current Authority Mask Override Register value (AMOR), before placing the AMR in the process element 483. One embodiment of the CSRP is one register 445 that contains the effective address for an area within the application’s addressspace 482 for the graphics accelerator module 446 to save or restore the context state. If no state is needed to be saved between jobs, or when a job has been preempted, this pointer can be used as an optional parameter. The context save/restore may be stored in pinned system memory.

“The operating system 495 will verify that the application 480 is registered and has been authorized to use the graphics accelerator module 446. The hypervisor 496 is then called by the operating system 495 with the information in Table 3.

“TABLE 3nOS and Hypervisor Call Parameters”

“The hypervisor 496 confirms that the operating system 495 is registered and has been authorized to use the graphics accelerator module 446. The hypervisor 496 then adds the process element 483 to the process element linked for the appropriate graphics acceleration module 446 type. The information in Table 4 may be included in the process element.

“TABLE 4nProcess Ellement Informationn1 An Authority Mask Register value (AMR), (potentially mask),n3 An effective adress (EA) Context Save/Restore area pointer (CSRP),n5 An virtual address (VA), hypervisor accelerator utilization pointer (AURP),n6 An optional thread ID(TID),n7 An logical partition ID(LPID)n11 An actual address (RA), hypervisor accelerator utilization pointer (SDR)

“In one embodiment, a hypervisor initializes a plurality accelerator integration slice 490 registers 445”

“As illustrated at FIG. “As illustrated in FIG. This implementation allows operations on the GPUs from 410-413 to use the same virtual/effective address space to access processor memories 401-402 and vice-versa. This simplifies programming. One embodiment allocates a portion of the virtual/effective memory space to the processor memory, 401. A second portion is allocated to the second processor RAM 402, and a third to the GPU RAM 420. The entire virtual/effective space (sometimes called the effective address space), is thus distributed among each of the GPU memories 420-423 and processor memories 401-402. This allows any processor or GPU access to any physical memory that has a virtual address.

“In one embodiment, bias/coherence circuitry 494A-494E within one of the MMUs 433A-439E ensures cache coordination between the caches on the host processors (e.g. 405) and GPUs 410-413 and implements biaseding techniques to indicate the physical memories where certain types of data should reside. FIG. 4F illustrates multiple instances of bias/coherence circuitry 494A-494E. 4F shows that the bias/coherence circuitry can be implemented in the MMU of one host processor 405 or within the accelerator integration board 436.

One embodiment permits GPU-attached memory from 420-423 to become part of system memory and can be accessed using shared virtual memories (SVM) technology. However, this does not suffer the usual performance drawbacks that come with full system cache coordination. GPU offload is made easier by the ability to access GPU-attached memory (420-423) as system memory. This arrangement allows the host processor 405 to set operands and access computations without the need for traditional I/O DMA data duplicates. These traditional copies are made using driver calls, interrupts, and memory mapped (MMIO) accesses. They are inefficient relative to simple accesses. The execution time of an offloaded calculation can also be affected by the ability to access GPU-attached memory 420-423 with no cache coherence overheads. Cache coherence overhead may be a significant factor in reducing the GPU’s effective write bandwidth 410-413 when there is a lot of streaming write traffic. In determining the effectiveness and efficiency of GPU offload, efficiency in operand setup, efficiency of results access, as well as efficiency of GPU computation, all play important roles.

A bias tracker data structure is used to select between host processor bias and GPU bias in one implementation. For example, a bias table can be used. This may be a page-granular structure that is controlled at the granularity a memory page and includes 1 to 2 bits per GPU attached memory page. A bias table can be implemented in a stolen range of one or several GPU-attached memory pages 420-423 with or without a GPU 410-413 bias cache (e.g. to cache frequently/recently accessed entries of the bias tableau). The entire bias table can be stored within the GPU.

“In one implementation, the entry in the GPU-attached memory bias table associated with each access to the GPU-attached storage 420-423 was accessed before the actual access to the GPU memory. This causes the following operations. First, requests from the GPU (410-413) that locate their page in GPU bias will be forwarded to the corresponding GPU memory (420-423). The GPU sends local requests to the processor 405 if they find their page in host bias. Requests from the processor 405 to find the requested page within host processor bias are completed in one embodiment. Requests directed to a GPU biased page can be sent to the GPU 410-413. If the GPU isn’t currently using the page, it may transition the page to a host-processor bias.

“The bias state can be altered by either a software-based or hardware-assisted mechanism. Or, in some cases, it can be purely hardware-based.”

An API call (e.g.) is one way to change the bias state. OpenCL calls the GPU’s device drivers, which in turn calls OpenCL. This sends an API call (or enqueues command descriptors) to the GPU, directing it to alter the bias state. Some transitions require a cache flushing operation in host. For a transition from the host processor 405 bias towards GPU bias, the cache flushing operation must be performed. However, it is not necessary for the reverse transition.

“In one embodiment cache coherency can be maintained by rendering GPU-biased pages temporarily uncacheable by host processor 405. The processor 405, depending on the implementation, may request access from GPU 410 to access these pages. To reduce communication between GPU 410 and processor 405 it is important to ensure that GPU-biased pages only contain the pages required by the GPU and not the host processor 405.

“Graphics Processing Pipeline.”

“FIG. “FIG.5″ illustrates a 500-bit graphics processing pipeline according to an embodiment. A graphics processor may implement the illustrated graphics process pipeline 500 in one embodiment. The graphics processor may be integrated within the parallel processing subsystems described herein, such the parallel processor 200 in FIG. 2A is, in one embodiment, a variant on the parallel processor(s), 112 of FIG. 1. Parallel processing systems can implement the graphics pipeline 500 through one or more instances (e.g. parallel processing unit 200 of FIG. 2A) as described herein. A shader unit, such as the graphics multiprocessor 234 in FIG. 2D may be configured to perform one or more functions, such as a vertex processing units 504, 508, 508, 508 tessellation control processor unit 508, 508, 508 tessellation evaluation process unit 512, 516 and 524. Other processing engines in a cluster (e.g. processing cluster 214, FIG.) may perform the functions of data assembler 502. 3A), and the corresponding partition unit (e.g. partition unit 220A-220N in FIG. 2C). 2C). The graphics processing pipeline 500 can also be implemented with dedicated processing units that perform one or more functions. One embodiment allows for parallel processing logic to be used within a general purpose processor (e.g. CPU) in order to execute a portion of the graphics processing pipeline 500. One embodiment allows one or more graphics processing pipeline 500 to access on-chip memory (e.g. parallel processor memory 222, as shown in FIG. 2A) via a Memory Interface 528. This may be an instance the memory interface 218 in FIG. 2A.”

Summary for “Efficient thread group scheduling

“Current parallel graphic data processing” refers to systems and methods that can perform specific operations on graphics data, such as linear interpolation (linear interpolation), tessellation (rasterization), texture mapping, depth test, etc. Graphic processors were traditionally based on fixed-function computational units for processing graphics data. However, recent developments have made portions of the graphics processors programmable. This allows them to perform a greater variety of operations to process vertex and fragment data.

“To increase their performance, graphics processors often implement processing techniques like pipelining, which attempt to process as many graphics data in parallel as possible across the various parts of the graphics pipeline. Parallel graphics processors that use single instruction, multiple thread architectures (SIMT), are designed to maximize parallel processing within the graphics pipeline. SIMT architectures are composed of multiple threads that attempt to execute program instructions simultaneously as many times as possible in order to improve processing efficiency. You can find a general overview of hardware and software for SIMT architectures in Shane Cook’s CUDA Programming Chapter 3, pages 37-51 (2013), and/or Nicholas Wilt’s CUDA Handbook: A Comprehensive Guide to GPU Programming Sections 2.6.2 through 3.1.2 (June 2013).

“Machine Learning has been successful in solving many types of tasks. Parallel implementations are possible because of the complexity of machine learning algorithms, such as neural networks. Parallel processors, such as general-purpose graphics processing units (GPGPUs), have been a key part in the implementation of deep neural network. The parallel graphics processors that have a single instruction, multiple thread architecture (SIMT), are intended to maximize the number of parallel processing within the graphics pipeline. SIMT architectures are composed of multiple parallel threads that attempt to execute program instructions as quickly as possible. This increases processing efficiency. Parallel machine learning algorithm implementations provide high efficiency and allow for the use of large networks.

“Conventional methods for handling thread groups such as scheduling, prioritizing and dealing with dependencies are inefficient in terms consumption of system processing resource resources such as time, bandwidth and power.”

“Embodiments are a new technique to employ and use an intelligent thread dispatch mechanism for data disruption across compute clusters.” Embodiments also allow for prefetching thread group input data to caches when threads are loaded. This is in addition to vectorization of atomic operation.

“It should be noted that terms such as?convolutional neuronet?, CNN? or?neural net?, etc., may be used interchangeably throughout this document. You may also see terms such as?autonomous device? Or simply?machine’,?autonomous vehicule? Or simply?vehicle,?autonomous agent or simply ?agent?, ?autonomous device? This document may use interchangeably the terms?computing device’,?robot? and/or similar.

“In certain embodiments, a graphics processor unit (GPU), is communicatively coupled with host/processor cores in order to accelerate graphics operations and machine-learning operations. This allows for pattern analysis operations and other general purpose GPU functions. The GPU can be communicatively connected to the host processor/cores via a bus, or another interconnect (e.g. a high-speed interconnect like PCIe or NVLink). Other embodiments may integrate the GPU on the same package/chip as the cores. The GPU can also be communicatively connected to the cores via an internal bus/interconnect (i.e. internal to the package/chip). Regardless of the manner in which the GPU is connected, the processor cores may allocate work to the GPU in the form of sequences of commands/instructions contained in a work descriptor. The GPU then uses dedicated circuitry/logic for efficiently processing these commands/instructions.”

“The following description contains many specific details. However, embodiments described herein may be used without these details. Other circuits, structures, and techniques are not shown in detail in order to not obscure the meaning of this description.

“System Overview I”

“FIG. “FIG. A processing subsystem 101 is included in the computing system 100. It includes one or more processors 102 and a system storage 104. They communicate via an interconnection path, which may include a memory hub (105). The memory hub 105 can be either a separate component of a chipset or integrated into the processor(s) 102. Through a communication link 106, the memory hub 105 can be connected to an I/O system 111. I/O subsystem 111 also includes an I/O hub 107 which can allow the computing system 100 receive input from one or multiple input devices (108). The I/O hub (107) can also enable a display controller to be connected to the processor(s). 102 to provide outputs to the display device(s). 110A. One embodiment of the I/O hub107 and one or more display devices 110A can be combined with an embedded, local, or internal display device.

“One embodiment of the processing subsystem 101 comprises one or more parallel processors 112 connected to memory hub 105 via bus or another communication link 113. The communication link 113 can be any of a variety of standards-based communication link protocols or technologies, including PCI Express. It may also be a vendor-specific communications interface or fabric. One embodiment of the parallel processors 112 forms a parallel processing system or vector that is computationally focused. It may include many processing cores or clusters such as multiple integrated core (MIC), processors. One embodiment of the parallel processors 112 forms a graphics processing subsystem that can output pixels one the one or multiple display devices 110A connected via the I/O Hub.107. One or more parallel processors 112 may also include a display controller or display interface (not illustrated) in order to allow direct connection to one of the display devices 110B.

“In the I/O subsystem 112, a system storage device 114 can be connected to the I/O hub107 to provide a storage system for computing system 100. An I/O switch (116) can be used as an interface mechanism to allow connections between I/O hub107 and other components. This includes a network adapter/wireless network adapter (118/119), which may be integrated into platform, and many other devices that can added via add-in devices 120. The network adapter (118) can be either an Ethernet adapter, or another wired networking adapter. Wireless network adapter 119 may include one or several of Wi-Fi, Bluetooth or near field communication (NFC) devices or any other network device that includes one, or more, wireless radios.

“The computing system 100 may include additional components that are not shown. These could include USB or other ports connections, optical storage drives and video capture devices. The communication paths connecting the components of FIG. 1. may be implemented using any of the suitable protocols, including PCI (Peripheral Complement Interconnect) based protocols (e.g. PCI-Express) or any other bus/point-to-point communication interfaces or protocols(s), like the NV Link high-speed interconnect or interconnect protocols that are known in the art.

“In one embodiment, one or more parallel processors (112) incorporate circuitry optimized to graphics and video processing. This circuitry includes, for example video output circuitry and constitutes a graphic processing unit (GPU). Another embodiment uses circuitry that is optimized for general purpose processing. This preserves the underlying computational architecture. Another embodiment allows components of the computing systems 100 to be combined with other elements in a single integrated circuit. One or more parallel processors, 112 memory hub 105 and processor(s) 102 can all be integrated into a single integrated circuit. The components of the computing systems 100 can also be combined into one package to create a system-in-package (SIP) configuration. One embodiment allows at least one portion of the computing system 100 to be integrated into a multichip module (MCM), which can then be interconnected with other multichip modules to create a modular computing platform.

“It will be understood that the computing system 100 illustrated herein is only an illustration and that modifications and variations are possible. You can modify the connection topology including the number and arrangement for bridges, processor(s), 102, and parallel processor(s), 112, as needed. In some cases, system memory (104) is directly connected to processor(s), 102 via a bridge. Other devices, however, communicate with system memory (104) via the memory hub and processor(s), 102. Other topologies allow the parallel processor(s), 112, to be connected to either the I/O hub107 or directly one or more of the processor(s), 102 rather than the memory hub105. Other embodiments may include the I/O hub (107) and memory hub (105), which can be combined into one chip. One embodiment may have two or more processor(s), 102 connected via multiple sockets. These processors can be paired with one or more instances the parallel processor(s), 112.

“Some components listed herein may not be available in all versions of the computing system 100. You may support as many add-in cards and peripherals as you like, or eliminate certain components. Other architectures might use different terminology for components than those shown in FIG. 1. In some architectures, the memory hub (105) may be called a Northbridge, while the I/O center 107 may be called a Southbridge.

“FIG. 2A shows a parallel processor 200 according to an embodiment. One or more integrated circuit devices may be used to implement the various components of the parallel process 200, including programmable processors, field-programmable gate arrays or application-specific integrated circuits (ASICs). The parallel processor 200 illustrated in FIG. 1. According to an embodiment

“In one embodiment, the parallel process 200 includes a parallel unit 202. A parallel processing unit also includes an I/O 204 which allows communication with other devices, such as other instances of parallel processing unit 200. The I/O 204 can be connected directly to other devices. One embodiment of the I/O device 204 connects to other devices using a switch or hub interface, such memory hub 105. The communication link 113 is formed by the connections between the I/O units 204 and the memory hub 105. The parallel processing unit 200’s I/O unit204 connects to a host interface (206) and a memory crossbar (216). Here, commands are received from the host interface to perform processing operations, while commands to the memory crossbar 226 are directed at performing memory operations.

“When the I/O unit 200 sends a command buffer to the host interface, 206, the host interface can direct work operations to the front end 208. One embodiment of the front end 208 is coupled with a scheduler210. This scheduler is used to distribute commands and other work items to a processing array 212. The scheduler 210 in one embodiment ensures that the processing array 212 is correctly configured before tasks are distributed to processing clusters.

“The processing cluster array 212 can contain up to?N” Processing clusters (e.g. cluster 214A and cluster 214B through cluster 214N). The clusters 214A-214N can run a large number concurrent threads. The scheduler 220 can assign work to clusters 214A-214N from the processing cluster array 212, using different scheduling and/or work allocation algorithms. These may vary depending upon the type of program or computation being executed. The scheduler 210 can handle the scheduling dynamically or in part with compiler logic when compiling program logic for execution by processing cluster array 212.

“In one embodiment, different processing clusters 214A-214N can be assigned for different types or types of programs.”

“The processing cluster array 212 can be configured to perform different types of parallel processing operations. One embodiment of the processing cluster array is designed to perform general-purpose parallel computation operations. The processing cluster array 212 may include logic that can be used to perform processing tasks such as filtering video and/or audio, modeling operations including physics operations and data transformations.

“In one embodiment, processing cluster array 212 is designed to execute parallel graphics processing operations. The processing cluster array 212 may include additional logic that supports the execution of graphics processing operations in embodiments where the parallel processor 200 can be used to execute them. This includes texture sampling logic to perform texture operations as well as tessellation and other vertex processing logic. The processing cluster array 212 can also be used to run graphics processing-related shader programs, such as vertex shaders and tessellation shadesrs, geometry shadingrs, and pixelshaders. Parallel processing unit 202 can transfer data to system memory via I/O unit. 204 is used for processing. The transferred data can be saved to on-chip memory during processing (e.g. parallel processor memory 222) and then written back into system memory.

“In one embodiment, the parallel processing unit (202) can be used to process graphics processing. The scheduler 210 can be set up to split the processing workload into roughly equal-sized tasks to make it easier to distribute the operations to multiple clusters (214A-214N) of the processing cluster array 212. Some embodiments allow for different processing to be performed on portions of the processing cluster array 212. One example is that a portion of the processing cluster array 212 may be used to perform topology generation and vertex shading. A second portion may be used to perform geometry shading and tessellation. Finally, a third portion can be used to shade pixels or perform other screen space operations to create a rendered image. Buffers may be used to store intermediate data from one or more clusters 214A to 214N, which can be sent between clusters 214A to 214N for further processing.

“The processing cluster array 212 may receive processing tasks during operation. This is done via the scheduler210. The scheduler receives commands from the front end 208 defining processing tasks. Processing tasks are data that describe how data will be processed. These include data indices, such as surface (patch), primitive, vertex, and/or pixels data. Also, state parameters and commands can be used to define the process (e.g. what program is to run). The scheduler 210 can be set up to retrieve the indices for the tasks, or it may get them from the front-end 208. Front end 208 can also be configured to ensure that the processing cluster array (212) is in a valid state prior to the workload specified by incoming commands buffers (e.g. batch buffers, push buffers etc.). ”

“Each one or more instances the parallel processing unit 200 can be paired with parallel processor memory 222. The memory crossbar 226 can access the parallel processor memory 222. It can also receive memory requests from both the processing cluster array 212, and the I/O device 204. A memory interface 218 allows the memory crossbar to access parallel processor memory 221. Multiple partition units can be included in the memory interface 218 (e.g. partition unit 220A through 220B) which can each couple to a specific portion (e.g. memory unit) of parallel process memory 222. One implementation has the number 220A-220N partition units equal to the number memory units. For example, a first partition 220A has a corresponding 1 memory unit 224, a second 220B has the corresponding 284B memory unit, and an Nth 220N has the corresponding 284N memory unit 220N. Other embodiments may have the number 220A-220N not equal to the number memory devices.

“In different embodiments, the memory unit 224A-224N may include various memory devices such as dynamic random access memory or graphics random memory (DRAM), or synchronous graphics random address memory (SGRAM), and graphics double data rate memory (GDDR). The memory units 224A-249N can also contain 3D stacked memory. This includes but is not limited to high-bandwidth memory (HBM). The specific implementation of the memory unit 224A-224N may vary and can be chosen from a variety of conventional designs. Render targets such as texture maps or frame buffers may be stored across memory units 224A?224N. This allows partition units 220A-220N the ability to write portions of each render goal in parallel, which makes efficient use of the parallel processor memory 222. A local instance of parallel processor memory 222, may not be used in certain embodiments. Instead, a unified memory design which uses system memory and local cache memory is used.

“In one embodiment, any cluster 214A-214N in the processing cluster array212 can process data that will then be written to any memory unit 224A-224N within parallel CPU memory 222. The memory crossbar 221 can be used to transfer each cluster’s output 214A-21N to any partition unit 220-222N or another cluster 214A-21N that can perform additional processing operations. Each cluster 214A to 214N can communicate via the memory crossbar 218 to access and write to external memory devices. One embodiment of the memory crossbar 216 includes a connection with the memory interface 218 for communication with the I/O device 204 and a connection with a local instance the parallel processor memory 222. This allows the processing units in the different processing clusters 214A to 214N to communicate to system memory or any other memory not local to the parallel unit 202. One embodiment of the memory crossbar 221 can use virtual channels in order to separate traffic streams between clusters 214A-214N, and partition units 220A-220N.

The parallel processor 200 illustrates one instance of the parallel computing unit 202, but any number of instances can be added to the parallel processor 202. Multiple instances of the parallel processor unit 202 can be included on one add-in card or interconnected. Even if different instances of the parallel processor unit 202 have different processing cores or different amounts of local parallel CPU memory, they can still be interoperable. In one embodiment, for example, certain instances of the parallel processor unit 202 may have higher precision floating-point units than others. The parallel processor 200 or one of its instances can be integrated into systems in many configurations and form factors. These include desktop, laptop, handheld, and embedded computers as well as servers, workstations and game consoles.

“FIG. 2B is a block-diagram of a 220A-220N partition unit, according to an embodiment. The partition unit 220 in one embodiment is an example of one of the FIG. 220A-220N partition units. 2A. The partition unit 220 contains an L2 cache 221, frame buffer interface 2225 and a ROP 226, which are raster operations units. The L2 cache 221, a read/write cache, is designed to store and load data received from ROP 226 and memory crossbar 216, respectively. L2 cache 221 outputs read misses and urgent write back requests to the frame buffer interface 225. Frame buffer interface 225 can be used to send dirty updates to the buffer for opportunistic processing. One embodiment of the frame buffer interface interface 225 interfaces to one of the memory unit in parallel processor memory such as the memory units 228A-228N shown in FIG. 2A (e.g. within parallel processor memory 222).

The ROP 226 is a graphics processing unit that performs raster operations such as stencil, test, blending and the like. The ROP 226, then, outputs the processed graphics data to be stored in graphics memory. The ROP 226 may include compression logic that compresses z or color information that is written to memory, and decompresses z that is read from memory. The ROP 226 may be included in each processing cluster (e.g. cluster 214A-214N as shown in FIG. 2A) rather than within the partition unit 221. This embodiment allows read and write requests to pixel data to be transmitted over the memory crossbar 221 instead of pixel fragments data.

“The processed graphics data can be displayed on a display device such as the one or more display devices 110 in FIG. 1, routed to the processor(s), 102 or routed for further process by one of the processing entities in the parallel processor 200. 2A.”

“FIG. 2C is a block representation of a processing cluster (214) within a parallel processing unit according to an embodiment. The processing cluster in one embodiment is an instance of one the processing clusters 214A?214N of FIG. 2A. 2A. A program that executes on a specific set of input data. Some embodiments use single-instruction multiple-data (SIMD), instruction issue techniques to allow parallel execution of large numbers of threads without requiring multiple instruction units. Other embodiments use single-instruction multiple-thread (SIMT), techniques to allow parallel execution of large numbers of generally synchronized threads. This is done using a common instruction unit that issues instructions to each of the clusters of processor engines. SIMT execution is different from a SIMD execution, which has identical instructions executed by all processor engines, because it allows threads to follow divergent paths through a thread program. A SIMD processing regimen is a functional subset within a SIMT processing regimen, according to those who are skilled in the art.

“Operation can be controlled by a pipeline manager232 which distributes processing tasks among SIMT parallel processors. The instructions are received by the pipeline manager 232 from FIG. 2A, and executes those instructions via a texture unit 236, and/or a graphics multiprocessor 234. An example of a SIMT parallel processor is the illustrated graphics multiprocessor 234. The processing cluster 214 may contain different types of SIMT parallel CPUs with differing architectures. A processing cluster 214 can include one or more instances the graphics multiprocessor 234. A graphics multiprocessor 234 is capable of processing data. A data crossbar 240, which can be used to distribute processed data to any number of destinations including shader units, can be used by the graphics multiprocessor 234. Pipeline manager 232 allows you to specify destinations for the processed data to be distributed via the data crossbar 244.

“Each graphics multiprocessor 234 in the processing cluster 214 can contain an identical set functional execution logic (e.g. arithmetic units, load-store unit, etc.). Functional execution logic can also be set up in a pipelined fashion so that new instructions can be issued prior to the completion of previous instructions. Functional execution logic can be provided. Functional logic can support a wide range of operations such as integer and floating-point arithmetic comparisons, Boolean operations bit shifting, and computations of various algebraic function. One embodiment allows for the use of different functional-unit hardware to perform different operations. Any combination of functional units can also be used.

“A thread is a set of instructions sent to the processing cluster 214. Thread groups are a group of threads that execute across a set of parallel processing engine. Thread groups execute the same program using different input data. A thread group can have each thread assigned to a different graphics multiprocessor 234. A thread group can contain fewer threads that the number of graphics multiprocessor 234. A thread group that contains fewer threads than the number processing engines can cause one or more processing engines to be idle during the thread group’s processing. A thread group can also contain more threads that the number of graphics multiprocessor 234 processors. If the thread group contains more threads that the number of graphics multiprocessor 234, processing may be done over successive clock cycles. Multiple thread groups can be concurrently executed on a graphicsmultiprocessor 234.

“In one embodiment, the graphicsmultiprocessor 234 has an internal cache memory that can be used to perform load or store operations. One embodiment allows the graphics multiprocessor 234, to forgo an internal cache and instead use a cache memory (e.g. L1 cache 308) within processing cluster 214. Each graphics multiprocessor 234 has access to L2 caches in the partition units (e.g. partition units 220A-220N as shown in FIG. 2A) are shared between all processing clusters 214. They can be used to transfer data among threads. The graphics multiprocessor 234, which is also capable of accessing off-chip global memories, can also access local parallel processor memory or system memory. Global memory can be any memory that is not part of the parallel processing unit 200. If the processing cluster 234 includes multiple instances the graphics multiprocessor 234, then the embedded embodiments can share data and instructions. The L1 cache 308 may be used to store the data.

“Each processing cluster 214, may contain an MMU 245 (memory manager unit), which is used to convert virtual addresses into physical addresses. Other embodiments may include one or more MMU 245 instances within the memory interface 218 in FIG. 2A. 2A. The MMU 245 can contain address translation lookaside buffers or caches that could reside within the graphics multiprocessor 234, the L1 cache, or processing cluster 234 To allow efficient request interleaving between partition units, the physical address is processed. To determine if a request for cache lines is successful or not, the cache line index can be used.

A processing cluster 214 can be used for graphics and computing applications. Each graphics multiprocessor 234 may be coupled to a texture unit 236, which performs texture mapping operations such as determining the position of texture samples, reading texture data, filtering texture data and determining how they are filtered. Texture data can be read from either an internal texture L1 cache (not illustrated) or, in certain embodiments, from the L1 cache inside graphics multiprocessor 234. It is then fetched from L2 cache, local parallel process memory or system memory as required. Each graphics multiprocessor 234 outputs the processing tasks to the data crossingbar 240. This allows the task to be sent to another processing cluster 214, which can then process it further or store it in L2 cache, local processor memory, or system memories via the memory crossbar 221. PreROP 242 (preraster operations unit), is designed to receive data from graphicsmultiprocessor 234, and direct data to ROPs units. These units may be located with the partition units described herein (e.g. partition units 220A-220N in FIG. 2A). PreROP 242 can optimize color blending, organize color data and perform address translations.

“It is appreciated that the described core architecture is only an illustration and that modifications and variations are possible. A processing cluster 214 may contain any number of processing units. A parallel processing unit, as described herein, may be included in any number of processing clusters 214, even though only one is shown. Each processing cluster 214 may be configured to work independently from other clusters 214 by using distinct processing units, L1 caches and others.

“FIG. 2D depicts a graphics multiprocessor 234, according one embodiment. The graphics multiprocessor 234 is coupled with the pipeline manager 232 of the processing cluster 214. The execution pipeline of the graphics multiprocessor 234 includes an instruction cache 252, an address mapping unit 256 and an address mapping unit 256. Through a memory-cache interconnect 268 the GPGPU cores 262 & load/store units 266 can be coupled with cache memory 272 & shared memory 270.

“In one embodiment, an instruction cache 252 receives instructions from the pipeline manager 223 to execute. Instructions are stored in the instruction cache 252 before being dispatched to the instruction unit 254. Instructions can be dispatched by the instruction unit 254 as thread groups (e.g. warps). Each thread in each thread group is assigned to a different GPGPU core 262. A local, shared, and global address space can be accessed by an instruction that specifies an address in a unified space. Address mapping unit 256 is used to convert addresses from the unified address space into distinct memory addresses that can be accessed using the load/store units 266,

“The register file 258 contains registers that correspond to the functional units of the graphicsmultiprocessor 324. Register file 258 is used to temporarily store operands that are connected to the data paths for the functional units of the graphics multiprocessor 324. One embodiment divides the register file 258 between functional units so that each functional unit has its own section of the register 258. One embodiment divides the register file 258 between the various warps executed by the graphicsmultiprocessor 324.

“The GPGPU cores 262 can include floating point units and/or integer math logic units (ALUs), which are used to execute instructions for the graphics multiprocessor 324. The architecture of the GPGPU cores 262 may be identical or different depending on the embodiments. In one embodiment, the GPGPU Cores 262 have a portion that includes a single precision FPU, an integer ALU, and a portion that has a portion with a double precision FPU. One embodiment of the FPUs is capable of implementing the IEEE 754 2008 standard for floating-point arithmetic, or enabling variable precision floating-point arithmetic. To perform certain functions, such as copy rectangle and pixel blending operations, the graphics multiprocessor 324 may also include one or more special function or fixed function units. One or more GPGPU cores may also be able to include special or fixed function logic.

“The memory-cache interconnect 268 is an interconnect network connecting each of the functional units in the graphics multiprocessor 324, to the register file 258 or to the shared memory 277. The memory and cache interconnect 268 in one embodiment is a crossbar interconnect which allows the load/store 266 to perform load and store operations between register file 258 and shared memory 270. The register file 258 operates at the same frequency and GPGPU cores 262 so data transfer between GPGPU cores 262 and register file 258 is extremely low latency. The shared memory (270) can be used for communication between threads running on the graphics multiprocessor 234’s functional units. The cache memory 272 is used to store texture data between the functional units 234 and 236. A program managed cached can also be stored in the shared memory 270. Threads running on the GPGPU cores 262 have the ability to programmatically store data in the shared memory, as well as the automatically cached data stored in the cache memory 272.

“FIGS. 3A-3B show additional graphics multiprocessors according to embodiments. These graphics multiprocessors 325 and 350 illustrate variants of the FIG. 234 graphics multiprocessor. 2C. 2C.

“FIG. 3A illustrates a graphics multiprocessor 325 in accordance with an additional embodiment. Graphic multiprocessor 325 has multiple instances of additional execution resource units, in addition to the graphics multiprocessor 234, as shown in FIG. 2D. The graphics multiprocessor 325 may include multiple instances the instruction unit 332A-332B and register file 334A-334B. It also can contain texture units 344A-344B. Graphics multiprocessor 325 can also include multiple sets or compute execution units (e.g. GPGPU Core 336A-336B and GPGPU Core 337A-337B, GPGPU Core 338A-338B, GPGPU Core 338A-338B, GPGPUcore 337A-337B, GPGPUcore 337A-337B, GPGPU central 338A-338B, GPGPU center 338A-338B, GPGPU main 337A-store units 340A/store units 340A, 340A-store units 340A/store/load/store units 340A 340B 340A-340B 340A-340A-340B. One embodiment of the execution resource units includes a common instruction cache 333, texture and/or data memory 342, and shared memory 346. An interconnect fabric 327 allows the components to communicate with each other. One embodiment of the interconnect fabric 327 contains one or more crossbar switches that enable communication between various components of the graphics multiprocessor 325.

“FIG. 3B illustrates a graphics multiprocessor 350 according an additional embodiment. As illustrated in FIG. 2D and FIG. 3A. 3A. One embodiment allows the execution resources 356A-356D to share an instruction cache 354, shared memory 362, and multiple instances of a texture or data cache memory 358A-358B. Interconnect fabric 352 can be used to communicate with the various components, similar to FIG. 327’s interconnect fabric 327. 3A.”

“Persons who are skilled in the art will recognize that FIGS. 1 through 3A-2D and 3A-3B, are only examples and do not limit the scope of the inventions. The techniques described herein can be applied to any properly configured processing unit. This includes, without limitation, one, two, or three mobile application processors (CPUs), one or more server central processing units (CPUs), and one or several parallel processing units such as the parallel unit 202 in FIG. 2A), as well as one or two graphics processors, or special purpose processing unit, are all permissible without departing from the scope of the embodiments.

“In certain embodiments, a parallel processing unit or GPGPU is communicatively coupled with host/processor cores in order to accelerate graphics operations and machine-learning operations. This allows for the acceleration of pattern analysis operations and other general purpose GPU (GPGPU). functions. The GPU can be communicatively connected to the host processor/cores via a bus, or another interconnect (e.g. a high-speed interconnect like PCIe or NVLink). Other embodiments may integrate the GPU on the same package/chip as the cores. The GPU can also be communicatively connected to the cores via an internal bus/interconnect (i.e. internal to the package/chip). Regardless of the manner in which the GPU is connected, the processor cores may allocate work to the GPU in the form of sequences of commands/instructions contained in a work descriptor. The GPU then uses dedicated circuitry/logic for efficiently processing these commands/instructions.”

“Techniques to Interconnect GPU and Host Processors”

“FIG. “FIG. One embodiment supports a communication throughput up to 4 GB/s. 30 GB/s and 80 GB/s depending on how it is implemented. Various interconnect protocols can be used, including PCIe 4.0, 5.0, and NVLink 2.0. The invention’s underlying principles are not restricted to any specific communication protocol or throughput.

“In one embodiment, two or more GPUs 410-413 can be interconnected over high speed links 444-445. These links may be used with different protocols/links to those used for high speeds links 440-443. Two or more multi-core processors 405-406 can be connected via high-speed link 433. These links may be symmetric multiprocessor (SMP), buses that operate at 20 GB/s to 30 GB/s and 120 GB/s respectively. All communication between all components of the system shown in FIG. 4A could also be communicated using the same protocols/links, e.g. over a common interconnection network. The underlying principles of this invention are not restricted to any specific interconnect technology, as mentioned.

“In one embodiment, each multicore processor 405-406 is communicatively connected to a processor memory 400-02 via memory interconnects 431-431. Each GPU 410-413 can be communicatively linked to GPU memory 420-423 using GPU memory interconnects 453-453 respectively. The memory interconnects of 430-431 or 450-453 can use different memory access technologies. The processor memories 401-402, and GPU memories 420-423 could be volatile memories, such as dynamic random-access memories (DRAMs), GraphicsDDR SDRAM(GDDR), or High Bandwidth Memory, (HBM), and/or non-volatile memories, such as 3D XPoint and Nano-Ram. One embodiment may have some volatile memories and others non-volatile (e.g. using a two-level memory (2LM), hierarchy).

“Although the processors 405-406 and GPUs 411-413 may physically be coupled to a specific memory 401-402, or 420-423 respectively, a unified architecture of memory may be implemented in which the virtual system address space (also known as the?effective adres?) is shared. The virtual system address space (also known as the?effective address?) is shared among all the physical memories. The system memory address space may be divided into 64 GB for processor memories 401-402 and 32 GB for GPU memories 420-423 (a total of 256GB addressable memory).

“FIG. “FIG. One or more GPU chips may be integrated into the graphics acceleration module 446. This line card is connected to the processor 407 via high-speed link 441. The graphics acceleration module 446 could be integrated in the same package as the processor 407 via the high-speed link 440.

“The illustrated processor 407 has a plurality cores 460A to460D each having a translation lookaside buffer 461-661D and one or several caches 462A-462D. Other components may be included in the cores for processing instructions or data execution. These are not illustrated to show the underlying principles of invention. Caches 462A-462D can include level 1 (L1) or level 2 (L2) caches. One or more shared caches 426, may also be included in the caching hierarchy. They are shared by cores 460A to 460D. One embodiment of processor 407 has 24 cores each with its L1 cache and 12 shared L2 caches. It also includes twelve shared L3 caches. Two adjacent cores share one of the L2 or L3 caches in this embodiment. The processor 407, the graphics accelerator integration module 446, and system memory 441 connect to each other. This may include processor memories 401-402.

“Coherency is maintained in data and instructions stored within the different caches 462A-462D and 456 via inter-core communications over a coherencebus 464. Each cache could have its own cache coherency logic/circuitry that it can communicate with over the coherencebus 464 to respond to reads and writes to specific cache lines. One implementation implements a cache snooping protocol over the coherence buses 464 to monitor cache accesses. Cache snooping/coherency techniques can be easily understood by persons skilled in the art. We will not describe the details here in order to keep the invention’s underlying principles from being obscured.

“In one embodiment, the proxy circuit 425 connects the graphics accelerator module 446 with the coherence bus 464, allowing the graphics accelerator module 446 participate in the cache coherent protocol as a peer to the cores. An interface 435 is used to connect the proxy circuit 425 to the high-speed link 440 (e.g. a PCIe bus or NVLink). An interface 437 links the graphics acceleration module 446 with the link 440.

“In one implementation, the accelerator integration circuit 436 provides memory access, cache management, context management and interrupt management services for a plurality graphics processing engines 431, 432 and N of the graphics accelerator module 446. Each of the graphics processing engines 431, 432 and N may contain a separate graphics processor unit (GPU). Alternately, the graphics processor engines 431, 432 and N may contain different types of graphics engines within a GPU, such as media processing engines (e.g. video encoders/decoders), samplers and blit engine. The graphics acceleration module can be either a GPU that has multiple graphics processing engines 431-432 N or N graphics processing engines 431-432 N. Individual GPUs may also be integrated into a single package, line card or chip.

“In one embodiment, an accelerator integration circuit 436 contains a memory management device (MMU) 439 that performs various memory management functions. These include virtual-to physical memory translations (also known as effective-to real memory translations) as well as memory access protocols for accessing memory 441. A translation lookaside buffer, (TLB), may be included in the MMU 439 for caching virtual/effective/physical/real address translations. One implementation stores data and commands for efficient access by graphics processing engines 431-432. N. According to one embodiment, data in cache 438 and graphics memory 433-434 are kept consistent with core caches 462A-462D and 456. System memory 411 is also preserved. As mentioned, this may be accomplished via proxy circuit 425 which takes part in the cache coherency mechanism on behalf of cache 438 and memories 433-434, N (e.g., sending updates to the cache 438 related to modifications/accesses of cache lines on processor caches 462A-462D, 456 and receiving updates from the cache 438).”

A set of registers 445 stores context data for threads that are executed by graphics processing engines 431-432. N, and a context managementcircuit 448 manages thread contexts. The context management circuit 448 can perform save-and-restore operations to restore and save contexts for various threads during context switches. This is where the first thread is saved, and the second thread is stored, so that the second thread may be executed by a graphics processor engine. The context management circuit 448 can store the current register values in a specified region of memory, such as a context pointer. When returning to context, it may restore register values. An interrupt management circuit 447 is one example of a circuit that receives and processes interrupts from system devices.

“In one implementation, virtual/effective address from a graphics processing engines 431 are converted to physical addresses in system memory 411 using the MMU 439. An accelerator integration circuit 436 supports multiple accelerator devices (e.g., 4, 8, 16 graphics accelerator modules 446 or other). The graphics accelerator module 446 can be used for a single application or shared by multiple applications. One embodiment presents a virtualized graphics execution system in which the graphics processing engine 431-432, N resources are shared with multiple applications and virtual machines (VMs). You can subdivide the resources into “slices”. based on their processing requirements and priority, they are assigned to different VMs or applications.”

“The accelerator integration circuit acts as an interface to the system for graphics acceleration module 446. It provides address translation services and system memory cache services. The accelerator integration circuit 436 can also provide virtualization facilities to the host processor for managing interrupts and virtualization of graphics processing engines.

“Because the hardware resources of the graphics processor engines 431-432 and N are explicitly mapped to the real address space viewed by the host processor 407, any host can address these resources directly with an effective address value. The accelerator integration circuit 436 performs one function. It allows for the physical separation between the graphics processing engines 431-432 and N, so they can be seen as separate units to the system.

“As noted, one or more graphics memory 433-434, M is coupled to each of N’s graphics processing engines 431-432. The graphics memories 431-434, M store instructions as well as data processed by each of N’s graphics processing engines 431-432.

“In one embodiment, in order to reduce data traffic over link 443, biasing techniques are used. M refers to data that will be most often used by the graphics processing engine 431-432, N, and preferably not by the cores 462A-462D (at all). The biasing mechanism tries to preserve data that is required by cores (and preferably the graphics processing engine 431-432, N) in the caches 462A-462D and 456 cores as well as system memory 411.

“FIG. 4C shows another embodiment where the accelerator integration circuit 436 can be integrated into the processor 407. This embodiment allows the graphics processing engines 431-432 and N to communicate directly via the high-speed line 440 to the accelerator circuit 436 via interfaces 437 and 435. Again, these interfaces can be used for any type of bus or protocol. The operations described in FIG. 436 could be performed by the accelerator integration circuit 436, which may also perform similar operations to FIG. 4B, but possibly at a higher throughput due to its proximity to the accelerator integration circuit 436 and caches 462A-462D.

“One embodiment supports multiple programming models, including a dedicated-process model (no virtualization of the graphics accelerator module) and shared models (with virtualization). These models can include programming models controlled by accelerator integration circuit 436 or programming models controlled by graphics acceleration module 446.

“In one embodiment, the dedicated process model’s graphics processing engines 431-432 and N are dedicated to one application or process running under a single operating systems. A single application can channel other applications to the graphics engines 431-432 N. This allows virtualization within a partition or VM.

“In dedicated-process programming models the graphics processing engines 431-432 and N may be shared between multiple VM/application parts. To allow each operating system to access the shared models, a system hypervisor is required to virtualize graphics processing engines 431-432 and N. The operating system is responsible for the graphics processing engines 431-432 and N in single-partition systems that do not have a hypervisor. The operating system can virtualize graphics processing engines 431-432 and N in both cases to give access to each application or process.

“For the shared programming model the graphics accelerator module 446 or an independent graphics processing engine 431-432 is used. N selects a process elements using a process handle. One embodiment stores process elements in system memory 411, and they can be addressed using the effective address-to-real address translation techniques described herein. The process handle may be an implementation-specific value provided to the host process when registering its context with the graphics processing engine 431-432, N (that is, calling system software to add the process element to the process element linked list). The offset of the process elements within the process element linked lists may be represented by the lower 16 bits of the process handle.

“FIG. 4D shows an example accelerator integration slice 490. A?slice’ is the term used herein. A specified amount of processing resources in the accelerator integration circuit 436 is considered a?slice? Process elements 483 are stored in the application effective address space 482 of system memory 411. The process elements 483 are stored according to GPU invocations 481, which are executed by applications 480 on the processor 407. The process state of the associated application 480 is contained in a process element 483 The process element 483 may contain the work descriptor (WD), 484 which can either be a job request by an applicant or a pointer at a list of jobs. The WD 484 in this case is a pointer at the job request queue within the address space 482 of the application.

“The graphics accelerator module 446 and/or individual graphics processing engine 431-432, N may be shared by all of the process or just a subset. The invention includes an infrastructure to set up the process state and send a WD 484 over to a graphics accelerator module 446 to initiate a job in virtualized environments.

“In one implementation, the dedicated-process programming model is implementation-specific. This model allows a single process to own the graphics accelerator module 446 and an individual graphics processing engines 431. The graphics accelerator module 446 is only owned by one process. Therefore, the hypervisor initializes accelerator integration circuit 436 to create the owning partition. The operating system initializes accelerator integration circuit 436 to determine which process is responsible for assigning the graphics accelerator module 446.

“In operation, the accelerator integration slice 490’s WD fetch unit 491 fetches the next WDU 484. This includes an indication of how the work is being done by the graphics processing engine of the graphics acceleration module 446. The data from the WD 484 can be stored in registers 445, and may be used by the MMU, interrupt management circuit 447, and/or context managementcir 446 as illustrated. One embodiment of the MMU 43 includes segment/page circuitry to access segment/page tables within the OS virtual address space 485. The interrupt management circuit 447 can process interrupt events 492 from the graphics accelerator module 446. Graphic operations are performed using an effective address 493 generated from a graphics processing engine 431-432. N is converted to a real address via the MMU 439.

“In one embodiment, the same set 445 of registers are duplicated for each graphics processor 431-432, N, and/or graphics accelerator module 446. They may be initialized and used by the hypervisor or the operating system. An accelerator integration slice 490 may include each of these duplicated registers. Table 1 shows examples of registers that could be initialized using the hypervisor.

“TABLE 1nHypervisor Initialized Registersn1 Cut Control Registern2 Real address (RA), Scheduled Processes Area pointern3 Authority Mask Registern4 Interrupt vector Table Entry Offsetn5 State Registern6 Logical Partition IDn8 Hypervisor Accelerator Utilization Register Pointern9 Storage Description Registry

Table 2 shows examples of registers that could be initialized by an operating system.

“TABLE 2nOperating Systems Initialized Registersn1 Procedure and Thread Identificationn2 Effective address (EA), Context Save/Restore Pointern3 Virtual address (VA) Accelerator Utilization Records Pointern4 Virtual address (VA) Storage Segment Tabern5 Authority maskn6 Work descriptor

“In one embodiment, each of the WD 484s is specific to a particular graphic acceleration module 446 or graphics processing engines 431-432 N. It can also point to a memory location where an application has created a command queue for work to be completed.

“FIG. “FIG.4E” illustrates further details for one embodiment a shared model. This embodiment contains a hypervisor real space 498 where a process element listing 499 is stored. Hypervisor 496 allows access to the hypervisor real address space 488, which virtualizes graphics acceleration module engines for operating system 495.

“The shared programming models permit all or part of all processes or subsets of partitions to use the graphics acceleration module 446. The graphics accelerator module 446 can be shared between multiple processes and partitions in two programming models: graphics directed shared and time-sliced shared.

The system hypervisor 496 is the owner of the graphics accelerator module 446. It makes its function available for all operating systems 495. The graphics accelerator module 446 must meet the following requirements to enable virtualization via the system hypervisor 496. The graphics acceleration module 446 guarantees that an application’s job request will be processed within a given time frame, including translation errors. When operating in the directed share programming model, graphics acceleration module 446 must guarantee fairness between processes.

“In one embodiment, the shared model requires the application 480 to make an operating systems 495 system call. It must include a graphics accelerator module 446 type and a work description (WD), an authority register (AMR value) and a context save/restore pointer (CSRP). The graphics accelerator module 446 type indicates the target acceleration function for the system calls. Graphics acceleration module 446 may contain a system-specific value. The graphics accelerator module 446 type WD can be a graphics accelerator module 446 command, an address pointer for a user-defined structure, a pointer to an order queue, or any other data structure that describes the work being done by the graphics accelerator module 446. The AMR value can be used to determine the current AMR state. Similar to setting the AMR, the value that is passed to the operating systems is the same as an application setting it. The accelerator integration circuit 436 or graphics acceleration module 446 may not support the User Authority Mask Override Register. In these cases, the operating system can apply the current UAMOR value as the AMR value and pass the AMR in the hypervisor calls. Optionally, the hypervisor 496 can apply the current Authority Mask Override Register value (AMOR), before placing the AMR in the process element 483. One embodiment of the CSRP is one register 445 that contains the effective address for an area within the application’s addressspace 482 for the graphics accelerator module 446 to save or restore the context state. If no state is needed to be saved between jobs, or when a job has been preempted, this pointer can be used as an optional parameter. The context save/restore may be stored in pinned system memory.

“The operating system 495 will verify that the application 480 is registered and has been authorized to use the graphics accelerator module 446. The hypervisor 496 is then called by the operating system 495 with the information in Table 3.

“TABLE 3nOS and Hypervisor Call Parameters”

“The hypervisor 496 confirms that the operating system 495 is registered and has been authorized to use the graphics accelerator module 446. The hypervisor 496 then adds the process element 483 to the process element linked for the appropriate graphics acceleration module 446 type. The information in Table 4 may be included in the process element.

“TABLE 4nProcess Ellement Informationn1 An Authority Mask Register value (AMR), (potentially mask),n3 An effective adress (EA) Context Save/Restore area pointer (CSRP),n5 An virtual address (VA), hypervisor accelerator utilization pointer (AURP),n6 An optional thread ID(TID),n7 An logical partition ID(LPID)n11 An actual address (RA), hypervisor accelerator utilization pointer (SDR)

“In one embodiment, a hypervisor initializes a plurality accelerator integration slice 490 registers 445”

“As illustrated at FIG. “As illustrated in FIG. This implementation allows operations on the GPUs from 410-413 to use the same virtual/effective address space to access processor memories 401-402 and vice-versa. This simplifies programming. One embodiment allocates a portion of the virtual/effective memory space to the processor memory, 401. A second portion is allocated to the second processor RAM 402, and a third to the GPU RAM 420. The entire virtual/effective space (sometimes called the effective address space), is thus distributed among each of the GPU memories 420-423 and processor memories 401-402. This allows any processor or GPU access to any physical memory that has a virtual address.

“In one embodiment, bias/coherence circuitry 494A-494E within one of the MMUs 433A-439E ensures cache coordination between the caches on the host processors (e.g. 405) and GPUs 410-413 and implements biaseding techniques to indicate the physical memories where certain types of data should reside. FIG. 4F illustrates multiple instances of bias/coherence circuitry 494A-494E. 4F shows that the bias/coherence circuitry can be implemented in the MMU of one host processor 405 or within the accelerator integration board 436.

One embodiment permits GPU-attached memory from 420-423 to become part of system memory and can be accessed using shared virtual memories (SVM) technology. However, this does not suffer the usual performance drawbacks that come with full system cache coordination. GPU offload is made easier by the ability to access GPU-attached memory (420-423) as system memory. This arrangement allows the host processor 405 to set operands and access computations without the need for traditional I/O DMA data duplicates. These traditional copies are made using driver calls, interrupts, and memory mapped (MMIO) accesses. They are inefficient relative to simple accesses. The execution time of an offloaded calculation can also be affected by the ability to access GPU-attached memory 420-423 with no cache coherence overheads. Cache coherence overhead may be a significant factor in reducing the GPU’s effective write bandwidth 410-413 when there is a lot of streaming write traffic. In determining the effectiveness and efficiency of GPU offload, efficiency in operand setup, efficiency of results access, as well as efficiency of GPU computation, all play important roles.

A bias tracker data structure is used to select between host processor bias and GPU bias in one implementation. For example, a bias table can be used. This may be a page-granular structure that is controlled at the granularity a memory page and includes 1 to 2 bits per GPU attached memory page. A bias table can be implemented in a stolen range of one or several GPU-attached memory pages 420-423 with or without a GPU 410-413 bias cache (e.g. to cache frequently/recently accessed entries of the bias tableau). The entire bias table can be stored within the GPU.

“In one implementation, the entry in the GPU-attached memory bias table associated with each access to the GPU-attached storage 420-423 was accessed before the actual access to the GPU memory. This causes the following operations. First, requests from the GPU (410-413) that locate their page in GPU bias will be forwarded to the corresponding GPU memory (420-423). The GPU sends local requests to the processor 405 if they find their page in host bias. Requests from the processor 405 to find the requested page within host processor bias are completed in one embodiment. Requests directed to a GPU biased page can be sent to the GPU 410-413. If the GPU isn’t currently using the page, it may transition the page to a host-processor bias.

“The bias state can be altered by either a software-based or hardware-assisted mechanism. Or, in some cases, it can be purely hardware-based.”

An API call (e.g.) is one way to change the bias state. OpenCL calls the GPU’s device drivers, which in turn calls OpenCL. This sends an API call (or enqueues command descriptors) to the GPU, directing it to alter the bias state. Some transitions require a cache flushing operation in host. For a transition from the host processor 405 bias towards GPU bias, the cache flushing operation must be performed. However, it is not necessary for the reverse transition.

“In one embodiment cache coherency can be maintained by rendering GPU-biased pages temporarily uncacheable by host processor 405. The processor 405, depending on the implementation, may request access from GPU 410 to access these pages. To reduce communication between GPU 410 and processor 405 it is important to ensure that GPU-biased pages only contain the pages required by the GPU and not the host processor 405.

“Graphics Processing Pipeline.”

“FIG. “FIG.5″ illustrates a 500-bit graphics processing pipeline according to an embodiment. A graphics processor may implement the illustrated graphics process pipeline 500 in one embodiment. The graphics processor may be integrated within the parallel processing subsystems described herein, such the parallel processor 200 in FIG. 2A is, in one embodiment, a variant on the parallel processor(s), 112 of FIG. 1. Parallel processing systems can implement the graphics pipeline 500 through one or more instances (e.g. parallel processing unit 200 of FIG. 2A) as described herein. A shader unit, such as the graphics multiprocessor 234 in FIG. 2D may be configured to perform one or more functions, such as a vertex processing units 504, 508, 508, 508 tessellation control processor unit 508, 508, 508 tessellation evaluation process unit 512, 516 and 524. Other processing engines in a cluster (e.g. processing cluster 214, FIG.) may perform the functions of data assembler 502. 3A), and the corresponding partition unit (e.g. partition unit 220A-220N in FIG. 2C). 2C). The graphics processing pipeline 500 can also be implemented with dedicated processing units that perform one or more functions. One embodiment allows for parallel processing logic to be used within a general purpose processor (e.g. CPU) in order to execute a portion of the graphics processing pipeline 500. One embodiment allows one or more graphics processing pipeline 500 to access on-chip memory (e.g. parallel processor memory 222, as shown in FIG. 2A) via a Memory Interface 528. This may be an instance the memory interface 218 in FIG. 2A.”

Click here to view the patent on Google Patents.

How to Search for Patents

A patent search is the first step to getting your patent. You can do a google patent search or do a USPTO search. Patent-pending is the term for the product that has been covered by the patent application. You can search the public pair to find the patent application. After the patent office approves your application, you will be able to do a patent number look to locate the patent issued. Your product is now patentable. You can also use the USPTO search engine. See below for details. You can get help from a patent lawyer. Patents in the United States are granted by the US trademark and patent office or the United States Patent and Trademark office. This office also reviews trademark applications.

Are you interested in similar patents? These are the steps to follow:

1. Brainstorm terms to describe your invention, based on its purpose, composition, or use.

Write down a brief, but precise description of the invention. Don’t use generic terms such as “device”, “process,” or “system”. Consider synonyms for the terms you chose initially. Next, take note of important technical terms as well as keywords.

Use the questions below to help you identify keywords or concepts.

  • What is the purpose of the invention Is it a utilitarian device or an ornamental design?
  • Is invention a way to create something or perform a function? Is it a product?
  • What is the composition and function of the invention? What is the physical composition of the invention?
  • What’s the purpose of the invention
  • What are the technical terms and keywords used to describe an invention’s nature? A technical dictionary can help you locate the right terms.

2. These terms will allow you to search for relevant Cooperative Patent Classifications at Classification Search Tool. If you are unable to find the right classification for your invention, scan through the classification’s class Schemas (class schedules) and try again. If you don’t get any results from the Classification Text Search, you might consider substituting your words to describe your invention with synonyms.

3. Check the CPC Classification Definition for confirmation of the CPC classification you found. If the selected classification title has a blue box with a “D” at its left, the hyperlink will take you to a CPC classification description. CPC classification definitions will help you determine the applicable classification’s scope so that you can choose the most relevant. These definitions may also include search tips or other suggestions that could be helpful for further research.

4. The Patents Full-Text Database and the Image Database allow you to retrieve patent documents that include the CPC classification. By focusing on the abstracts and representative drawings, you can narrow down your search for the most relevant patent publications.

5. This selection of patent publications is the best to look at for any similarities to your invention. Pay attention to the claims and specification. Refer to the applicant and patent examiner for additional patents.

6. You can retrieve published patent applications that match the CPC classification you chose in Step 3. You can also use the same search strategy that you used in Step 4 to narrow your search results to only the most relevant patent applications by reviewing the abstracts and representative drawings for each page. Next, examine all published patent applications carefully, paying special attention to the claims, and other drawings.

7. You can search for additional US patent publications by keyword searching in AppFT or PatFT databases, as well as classification searching of patents not from the United States per below. Also, you can use web search engines to search non-patent literature disclosures about inventions. Here are some examples:

  • Add keywords to your search. Keyword searches may turn up documents that are not well-categorized or have missed classifications during Step 2. For example, US patent examiners often supplement their classification searches with keyword searches. Think about the use of technical engineering terminology rather than everyday words.
  • Search for foreign patents using the CPC classification. Then, re-run the search using international patent office search engines such as Espacenet, the European Patent Office’s worldwide patent publication database of over 130 million patent publications. Other national databases include:
  • Search non-patent literature. Inventions can be made public in many non-patent publications. It is recommended that you search journals, books, websites, technical catalogs, conference proceedings, and other print and electronic publications.

To review your search, you can hire a registered patent attorney to assist. A preliminary search will help one better prepare to talk about their invention and other related inventions with a professional patent attorney. In addition, the attorney will not spend too much time or money on patenting basics.

Download patent guide file – Click here