Artificial Intelligence – Pradeep Janedula, Bijoy Pazhanimala, Bharat Daga, Saurabh Dhoble, Intel Corp

Abstract for “System and Method for an Optimized Winograd Convolution Accelerator”

“One embodiment includes a compute apparatus for performing machine learning operations. The compute apparatus comprises a hardware accelerator that includes a compute unit to perform Winograd Convolution. The compute unit is configured to perform Winograd convolution at a first kernel size by using a transform associated to a second kernel.

Background for “System and Method for an Optimized Winograd Convolution Accelerator”

“Machine Learning has been successful in solving many types of tasks. Parallel implementations are possible because of the complexity of machine learning algorithms, such as neural networks. Parallel processors, such as general-purpose graphics processing units (GPGPUs), have been a key part in the implementation of deep neural network. Complementing deep learning-based machine learning systems may require large amounts of memory and computing power. Many megabytes can be used to create deep learning neural networks models. They also require billions upon billions of floating point operations per second to process. These requirements may prevent the deployment of many neural networks models to low-power computing devices. This is especially true for devices that are suitable for the Internet of Things (IoT), which consists mainly of low-end embedded devices.

“In some embodiments, a GPU is communicatively coupled with host/processor cores in order to accelerate graphics operations and machine-learning operations. This allows for pattern analysis operations and other general-purpose GPU functions (GPGPU). The GPU can be communicatively connected to the cores/host processors via a bus or other interconnect. Other embodiments may integrate the GPU on the same package/chip as the cores. The GPU can also be communicatively connected to the cores via an internal bus/interconnect (i.e. internal to the package/chip). Regardless of the manner in which the GPU is connected, the processor cores may allocate work to the GPU in the form of sequences of commands/instructions contained in a work descriptor. The GPU then uses dedicated circuitry/logic for efficiently processing these commands/instructions.”

FIGS. FIGS. 1-14 give an overview of the exemplary data processing system or graphics processor logic that incorporates, or relates to, the various embodiments. FIGS. FIGS. 15 through 26 provide details about the different embodiments. FIGS. FIGS. 27-31 give an overview of machine learning hardware architecture and software architecture. The following embodiments will be described using a graphics processor. Other aspects will be described using a general-purpose processor such as a central processor unit (CPU). Similar techniques and teachings are also applicable to other types or semiconductor devices such as a GPU cluster, many integrated core processors, or one or more instances a field-programmable gate array. The teachings can be applied to any machine or processor that processes images (e.g., vertex data, geometry data, sample, pixels, vertex data or geometry data). You can use the embodiments described herein without any of the details. Some well-known features may not be described in order to keep the details from being obscured.

“System Overview”

“FIG. “FIG. The system 100 may include one or more processors (102, one or multiple graphics processors (108) and it can be a single-processor desktop system, multiprocessor workstation, or a server with a large number processor cores 107 or 102. One embodiment of the system 100 is a processor platform integrated into a system on a chip (SoC) integrated circuit that can be used in mobile, handheld, and embedded devices.

“In one embodiment, the system 100 may include, or be integrated within a server-based game platform, a gaming console. This includes a game and media console as well as a mobile, handheld, or online gaming console. The system 100 can be a smart phone, mobile phone, tablet computing device, or mobile Internet device in some instances. The processing system 100 may also be coupled with or integrated into a wearable device such as a smartwatch, smart eyewear, augmented reality, virtual reality, or smart watch. The processing system 100 can be a television set-top box device with one or more processors, 102 and a graphic interface generated by one of the graphics processors.

“In some embodiments, one or more processors (102) each contain one or more cores 107 to execute instructions that, when executed perform operations for system or user software. Each of the processor cores 107 can be configured to process a particular instruction set 109 in some embodiments. In some cases, instruction set 109 can facilitate Complex Instruction Set Computing or Reduced Instruction Set Computing (RISC) or computing via a Very Long Instruction Word (VLIW). Multiple processor cores (107) may process different instruction sets 109. These instructions may be used to emulate other instruction sets. Other processing devices such as a Digital Signal Processor, (DSP) may be included in processor core 107.

“In some embodiments the processor 102 also includes cache memory (104). The architecture of the processor 102 may allow for a single or multiple levels internal caches. Some embodiments share the cache memory among different components of the processor. Some embodiments of the processor 102 use an external cache, such as a Level-3 cache (L3) or Last Level Cache(LLC), (not shown), that may be shared between processor cores (107) using known cache coherency methods. In processor 102, a register file 106 may be included. This register file can contain different types of registers to store different types data, such as integer registers and floating point registers. While some registers are general-purpose, others may be more specific to processor 102.

“Some embodiments use one or more processors 102 to communicate with a processor bus 110. This allows for communication signals, such as control signals, data or address between processors 102 and other components of the system 100. In one embodiment, the processor bus 110 is a Direct Media Interface (DMI), bus. One embodiment of the processor(s), 102 includes an integrated memory controller (116) and a peripheral controller (130). The memory controller (116) facilitates communication between a device and other components of system 100. The peripheral controller hub (PCH 130) provides connections to I/O devices via local I/O buses.

The memory device 120 could be either a dynamic random-access memory (DRAM), a static random memory (SRAM), flash memory, phase-change device or any other type of memory that is suitable for use as process memory. The memory device 120 can be used as system memory in one embodiment to store data 122, instructions 121 and other information that is needed by the processors 102 when they execute an application or process. The optional external graphics processor 112 may also be connected to the memory controller 116. This allows for graphics and media operations to be performed by the processors. A display device 111 may be connected to the processor(s), 102 in some embodiments. One or more internal display devices can connect to the processor(s) 102. One embodiment of the display device 111 is a head-mounted display (HMD), such as a stereoscopic device that can be used in virtual reality (VR), or augmented reality(AR) applications.

“Some embodiments of the peripheral controller 130 allow peripherals to connect via high-speed I/O buses to memory device 120 or processor 102. I/O peripherals can include an audio controller 146 and a network controller 135, a firmware interface 128, touch sensors 125, and a data storage device (e.g. hard disk drive, flash, etc.). The data storage device (124) can be connected via a storage interface (e.g. SATA), or via a peripheral bus (e.g. PCI Express, PCI Express). Touch sensors 125 include pressure sensors, fingerprint sensors, and touch screen sensors. The wireless transceiver 126 is a Wi-Fi, Bluetooth, or mobile network transceiver. It can also be a Wi/Fi, 4G or Long Term Evolution (LTE), transceiver. The firmware interface 128 allows communication with system firmware and can, for example, be a unified extensible software interface (UEFI). The network controller (134) can be used to establish a connection to a wired network. A high-performance network controller (not illustrated) can be used in conjunction with the processor bus 110. One embodiment of the audio controller 146 is a multichannel high-definition audio controller. One embodiment of the system 100 includes an optional legacy I/O control 140 to allow for the coupling of legacy devices (e.g. Personal System 2 (PS/2)). The peripheral controller 130 can be connected to one or more Universal Serial Bus controllers (USB), 142 to connect input devices such as keyboard, mouse, and camera 143, or other USB input device.

It will be appreciated that the 100-bit system is only an example and that it is not exhaustive. Other data processing systems may be used that have different configurations. One example of this is the memory controller 116, and peripheral controller 130. This could be combined into an external graphics processor such as the external graphic processor 112. One embodiment may have the peripheral controller 130 or memory controller 1160 external to one or more processors 102. The system 100 may include an external memory controller (116) and peripheral controller 130. These can be used as memory controller hubs and peripheral controller hubs within a system chipset which is in communication to the processor(s).

“FIG. 2. This is a block diagram showing an embodiment of a processor 200 with one or more cores 202A to 202N, an integrated controller 214 and an integrated graphics processor 208. FIG. FIG. 2 elements that have the same names or reference numbers as any other figure can function or operate in any way similar to the one described elsewhere. Additional cores can be added to Processor 200, represented by dashed-lined boxes. Each processor core 202A-202N has one or more internal cache units (204A-204N). Each processor core may also have access to shared cached units (206).

The cache memory hierarchy of the processor 200 is represented by the shared cache units 206-204. The cache memory hierarchy can include at least one level each of data and instruction caches within each processor core, as well as one or more shared mid-level cache levels, such as Level 2 (L2) or Level 3 (L3) or Level 4 (L4). This cache hierarchy is where the LLC is the highest level of cache prior to external memory. Some embodiments maintain coherency between the cache units 206A-204N.

“In some cases, processor 200 may include a set or more bus controller units (216) and a system agent center 210. One or more bus controllers 216 manage a number of peripheral buses such as one or two PCI express or PCI express busesses. The system agent core 210 provides management functionality to the various processor components. System agent core 210 may include one or more integrated memory controllers (214) to manage access to external memory devices.

“In certain embodiments, one of the processor cores (202A-202N) includes support for simultaneous multithreading. The system agent core core 210 is responsible for operating and coordinating cores 202A to 202N during multithreaded processing. The system agent core 210 can also include a power control module (PCU), which contains logic and components that regulate the power status of processor cores 200A-202N, and graphics processor 208.

“In some embodiments processor 200 also includes graphics processor 208 to perform graphics processing operations. In some embodiments, graphics processor 208 is coupled with the set of shared memory units 206 and the system agent core (210), including one or more integrated memories controllers 214. Some embodiments include a display controller (211) to enable graphics processor output to one of several coupled displays. Display controller 211 can be either a separate module that is coupled to the graphics processor via at most one interconnect or integrated into the graphics processor.

“Some embodiments use a ring-based interconnect unit (212) to connect the internal components of processor 200. An alternative interconnect unit, such as a point to point interconnect or a switched interconnect may be used. These are all techniques that are well-known in the art. Graphics processor 208 may be connected to the ring interconnect (212) via an I/O line 213 in some embodiments.

“The exemplary I/O connection 213 is at least one among multiple types of I/O interconnects. This interconnect facilitates communication between processor components and a high performance embedded memory module 218, such eDRAM modules. Each of the processor cores (202A-202N) and graphics processor 208 utilize embedded memory modules 218 to share a Last Level Cache.

“In some embodiments processor cores of 202A-202N can be homogenous cores that execute the same instruction set architecture. Another embodiment of processor cores is 202A to 202N that are heterogeneous in instruction set architecture (ISA). In this case, one or more processor cores of 202A to 202N execute a single instruction set while the rest execute subsets of the first or another instruction set. One embodiment of processor cores 202A to 202N is heterogeneous in terms microarchitecture. This means that one or more cores with a higher power consumption are paired with one or several power cores with a lower power consumption. The processor 200 may also be implemented on one or several chips, or as an SoC integrated Circuit with the illustrated components.

“FIG. “FIG. The graphics processor may communicate with the registers of the graphics processor via a memory-mapped I/O interface. In some cases, commands are placed in the processor memory. Graphics processor 300 may include a memory interface 314 for accessing memory. The memory interface 314 may be used to access local memory, one, or more internal caches, or system memory.

“In some embodiments, the graphics processor 300 includes a display controller 302. This allows for display output data to be sent to a display device 320. Display controller 302 is hardware that allows for the display of one or more overlay layers for the composition of multiple layers or user interface elements. Display device 320 may be either an external or internal display device. One embodiment of the display device 320 includes a head-mounted display device such as an AR or virtual reality (VR). Some embodiments include a video codec engine 306 that encodes, decodes, or transcodes media to, between, or among one or more media encoding format. This includes, but is not limited to, Moving Picture Experts Group formats like MPEG-2 and Advanced Video Coding formats such H.264/MPEG-4 AVC formats as well as the Society of Motion Picture & Television Engineers 421M/VC-1 and Joint Photographic Experts Group formats such JPEG and Motion JPEG formats.

“In some embodiments, the graphics processor 300 has a block-image transfer (BLIT), engine 304 that performs two-dimensional (2D), rasterizer operations, including bit-boundary block transfer transfers. In one embodiment, 2D graphics operations can be performed using one or more graphics processing engines (GPE) 310. GPE 310 can be used to perform graphics operations in some embodiments. This includes media operations and three-dimensional (3D), graphics operations.

“In certain embodiments, GPE 310 contains a 3D pipeline 312 that performs 3D operations such as rendering three-dimensional scenes and images using processing functions that act on 3D primitive shapes (e.g. rectangle, triangle, etc.). The 3D pipeline 312 contains programmable and fixed-function elements that perform different tasks and/or spawn executions threads to a 3D/Media Sub-system 315. 3D pipeline 312 is able to be used for media operations. However, GPE 310 includes a media pipeline 316 which is used specifically to perform media operations such as image enhancement and video post-processing.

“In some instances, media pipeline 3316 may include fixed function or programmeable logic units that perform one or more specialized media operations such as video decode acceleration or video de-interlacing or video encode acceleration in lieu of or for video codec engine 306 Media pipeline 316 may also include a thread-spawning unit that spawns threads to be executed on 3D/Media 315. The spawned Threads compute for media operations on the graphics execution units in 3D/Media 315.

“In certain embodiments, 3D/Media 315 includes logic to execute threads spawned from 3D pipeline 312 or media pipeline 316. One embodiment has the pipelines sending thread execution requests to 3D/Media 315. This subsystem 315 includes thread dispatch logic to arbitrate and dispatch the requests to the available thread execution resources. An array of graphics execution units are used to process media and 3D threads. 3D/Media subsystem 315 may include one or more internal caches that store data and instructions for threads. The subsystem may also include shared memory, such as registers and addressable memories, which can be used to share data among threads or store output data.

“Graphics Processing Engine.”

“FIG. “FIG. One embodiment of the graphics processing engine (GPE), 410 is a modified version of the GPE 310 as shown in FIG. 3. FIG. 4. Having the same reference numbers or names as elements in any other figure herein, can operate or function similarly to that described elsewhere. However, they are not limited to these. FIG. 3D pipeline 312 (or media pipeline 316) are examples. 3 are illustrated. In some GPE 410 embodiments, the media pipeline 316 may be optional and not included in the GPE 410. In at least one embodiment, a separate image and/or media processor is connected to the GPE 410.

“In some embodiments, GPE 410 is coupled with or includes a Command streamer 403, which provides a commandstream to the 3D pipeline 312, and/or media pipes 316. Some embodiments of command streamer 403 are coupled with memory. This memory can be system memory or any combination of internal cache memory or shared cache memory. Some embodiments of command streamer 403 can receive commands from memory and send them to the 3D pipeline 312 or media pipeline 316. Commands are directives that are retrieved from a ring buffer. This buffer stores commands for 3D pipeline 312 or media pipeline 316. The ring buffer may also include batch command buffers that store multiple commands in batches. Commands for the 3D pipe 312 may also refer to data in memory. This includes vertex and geometry data, image data, and memory objects for media pipeline 316. The media pipeline 316 and 3D pipeline 312 process data and perform operations within their respective pipelines. Or, they dispatch one or more execution threads directly to a graphics core array 414. One embodiment of the graphics core array 414 includes one or more blocks including graphics core(s), 415A and 415B, each block including one to more graphics cores. Each graphics core contains a set graphics execution resources. This includes general purpose and graphics-specific execution logic that can perform graphics and computations, as well fixed function texture processing and/or machinelearning and artificial intelligence acceleration logic.

“In different embodiments, the 3D pipeline 312 contains fixed function and programable logic to process one of many shader programs such as vertex shading, geometry shading, pixel shading, fragment shaders or compute shaders. The instructions are processed and dispatched to the graphics core array 414 by dispatching execution threads. The graphics core array 414 contains a single block of execution resources that can be used to process these shader programs. Multi-purpose execution logic (e.g. execution units) in the graphics core(s), 415A-414B can execute multiple execution threads that are associated with multiple shaders.

“In some embodiments, the graphics core array 414 includes execution logic that can perform media functions such as image and video processing. One embodiment of the execution units also includes general-purpose logic, which can be programmable to perform parallel general purposes computational operations in addition to graphics processing. The general purpose logic is able to perform parallel processing operations or in conjunction of general purpose logic within processor core(s), 107 as shown in FIG. 1, or core 202A?202N as shown in FIG. 2.”

“Threads that execute on the graphics core array 414 may generate output data and can send it to memory using a unified buffer (URB). 418. Multiple threads can be stored data in the URB 418. The URB 418 can be used in some instances to transmit data between threads that execute on the graphics core array 414. The URB 418 can also be used in some embodiments to synchronize threads within the graphics core array as well as fixed function logic within shared function logic (420).

“In some embodiments graphics core array 414 can be scaled so that it includes a variable amount of graphics cores each with a variable number execution units based upon the target power and performance levels of GPE 410. One embodiment allows execution resources to be dynamically scaled, so that they can be disabled or enabled as required.

“The graphics core array 414 is coupled with the shared function logic 420, which includes multiple resources that can be shared among the graphics cores of the graphics core array. The shared functions in the shared function logic logic 420 are hardware units that provide specialized functionality to the graphics center array 414. Shared function logic 420 can include inter-thread communication logic (ITC), sampler 421, and math 422. Some embodiments also implement one or more caches 425 within the shared functionality logic 420.

A shared function is one that is used when the demand for a particular specialized function is not sufficient to be included within the graphics core array 414. Instead, a single instance of the specialized function is implemented in the shared function logic 422, and shared among the execution resource within the graphics core 414. There are many different embodiments that differ in the exact set of functions that are shared among the graphics core 414 and included within it. Some embodiments may include specific shared functions from the shared function 420 which are frequently used by the graphics center 414. The graphics core array 414 may include some or all of the logic from the shared functionality logic 420 in the shared function Logic 416. One embodiment allows all elements of the shared function 420 to be duplicated in the graphics core array 414’s shared function logic 416. One embodiment of the shared functionality logic 420 is removed to make way for the shared function 416 in the graphics core array 414

“FIG. “FIG. FIG. FIG. 5 elements can function or operate in any way similar to the ones described elsewhere. However, they are not limited to the above. In some embodiments, the illustrated graphics processor core 500 is part of FIG. 4. A core slice is a graphic processor core 500 that can contain one or more graphics cores in a modular graphics processing unit. One graphics processor core 500 is an example of a single graphics core slice. A graphics processor described herein can include multiple graphics core slices depending on the target power and performance envelopes. Each graphics core 500 may include a fixed block 530 and multiple sub-cores 501A-501F. These sub-slices are also known as sub-slices. They include modular blocks of general-purpose and fixed function logic.

“In some embodiments, the fixed function block530 contains a geometry/fixed-function pipeline 536 that can shared by all graphics processor sub-cores 500. This is useful for example in low performance or lower power graphics processor implementations. The geometry/fixed function pipeline 536 may include a 3D fixed functionality pipeline (e.g. 3D pipeline 312 in FIG. 3. and FIG. 3 and FIG. 4.”

“In one embodiment, the fixed function block 537 includes a graphics SoC Interface 537, a graphic microcontroller 538 and a media pipeline 539. The graphics SoC interface 537 is an interface between the graphics processor core 500 and other processor cores in a system on chip integrated circuit. The graphics microcontroller 538, a programmable subprocessor, is configurable for managing various functions of the graphics processor 500. These include thread dispatch, scheduling and pre-emption. The media pipeline 539 (e.g. media pipeline 316 in FIG. 3. and FIG. 3 and FIG. The media pipeline 539 implements media operations through requests to compute and/or sampling logic within sub-cores 501-501F.

“In one embodiment, the SoC interface 537 allows the graphics core 500 communication with general purpose processor cores (e.g. CPUs) or other components within an SoC. This includes memory hierarchy elements like a shared last-level cache memory, system RAM and/or embedded-on-chip/on-package DRAM. The SoC interface 537 allows communication with fixed function devices, such as camera imaging pipes, and allows the use and/or implementation of global memory atomics. These atomics can be shared between the graphics processor 500 and the SoC CPUs. The SoC interface 537 is able to implement power management controls for graphics core 500, and allow interface between a clockdomain of graphic core 500 with other clock domains in the SoC. One embodiment of the SoC interface 537 allows for the receipt of command buffers from both a global thread dispatcher and a command streamer. These command buffers can be used to send commands and instructions to one or more graphics cores in a graphics processor. When media operations are being performed, the commands and instructions can either be sent to the media pipeline 539 or to a geometry/fixed function pipeline (e.g. geometry and fixed functional pipeline 536, geometry & fixed function pipe 514).

The graphics microcontroller 538 is able to manage and schedule tasks for the graphics core 500. One embodiment of the graphics microcontroller 538 is capable of scheduling graphics and/or compute workloads on various parallel graphics engines within execution unit arrays 502A-502F, or 504A-504F within sub-cores501A-501F. This scheduling model allows host software to submit workloads to one of the multiple graphic processor doorbells. This triggers a scheduling operation for the appropriate graphics engine. Scheduling operations include the determination of which workload to run next and sending it to a command streamer. This allows host software to monitor progress and notify host software when the workload is completed. One embodiment of the graphics microcontroller 538 allows for low-power or idle state graphics core 500. This provides the graphics core 500 with the ability save and restore registers in the graphics core 500 during low-power transitions without the need to be connected to the operating system or graphics driver software.

The graphics core 500 can have more or less than the illustrated subcores 501A-501F. It can also include N modular subcores. The graphics core 500 may include N sub-cores and shared function logic 510. Shared and/or cache memory512 can be shared. A geometry/fixed function pipeline 514 and additional fixed function logic 516 can be added to speed up various graphics and compute processing. The shared function logic logic 510 may include logic units that are associated with the shared functionality logic 420 in FIG. Each N sub-cores of the graphics core 500 can share 4 (e.g. sampler, math and/or inter thread communication logic). The cache and/or shared memory 512 can serve as a cache at the last level for the N sub-cores 500A-501F. It can also be used to share memory that is available by multiple subcores. The geometry/fixed-function pipeline 514 may be used in place of the geometry/fixed functional pipeline 536 within fixed function block 533. It can contain the same or similar logic units.

“In one embodiment, the graphics core 500 contains additional fixed function logic 516 which can be used by the graphics core 500 to include fixed function acceleration logic. One embodiment of the additional fixed function logic 516 contains an additional geometry pipeline that can be used in position-only shading. Two geometry pipelines are available for position-only shading. One is the full geometry pipeline in the geometry/fixed functionality pipeline 516, 536 and one is the cull pipeline. This pipeline can be added to the additional fixed function logic 516. The cull pipeline, in one embodiment, is a reduced version of the full geometry pipeline. Both the full pipeline and cull pipeline can execute multiple instances of the same application. Each instance has its own context. The ability to shade only from position can conceal long cull runs of discarded triangulars and allow shading to be completed faster in certain instances. In one embodiment, the cull pipeline logic in the additional fixed function logic logic 516 can execute the position shaders parallel to the main application. This is because the cull pipeline fetches only the position attribute from the vertices and shades that, without rendering the pixels to the buffer. The generated critical results can be used by the cull pipeline to compute visibility information for all triangles, regardless of whether they have been culled. The full pipeline, which in this case may be called a replay pipeline, can consume visibility information to skip culled triangles and shade only those triangles that are passed to the rasterization stage.

“In one embodiment, the additional fixed-function logic 516 may also include machine learning acceleration logic such as fixed function matrix multiplication logic for implementations that include optimizations for machine training or inferencing.”

“In each graphics subcore 501A-504F, there is a set execution resources that can be used to execute graphics, media, or compute operations in response requests from graphics pipeline, media pipe, or shader program. Graphic sub-cores 501A-508F contain multiple EU arrays 502A-502F and 504A-504F. Thread dispatch and inter-thread communications (TD/IC), logic 503A-503F, thread dispatch 503A-503F, and a media sampler 506A-506F. A shader processor 507A-507F. And shared local memory (SLM), 508A-508F. Each of the EU arrays 502A-502F and 504A-504F includes multiple execution units. These are general-purpose graphics processing units that can perform floating-point or integer/fixed point logic operations in support of graphics, media or compute operations, including graphics shader programs or media. The TD/ICLogic 503A-503F provides thread control and local thread dispatch for execution units within a subcore. It also facilitates communication between threads executing in the execution units. The 3D sampler 505A-505F is capable of reading texture and other 3D graphics-related data into memory. The 3D sampler is able to read different types of texture data based on the sample state and texture format. Based on the media data type and format, similar operations can be performed by the media sampler 506A-506F. One embodiment allows each graphics subcore 501A to 501F to include an alternately unified 3D or media sampler. The execution units of each of the subcores 501A-501F can be used by threads to execute on them. This allows threads within a thread group the ability to use a common pool on-chip memory.

“Execution Units”

“FIGS. 6A-6B show thread execution logic 600, which includes an array of processing elements used in a core graphics processor according to the embodiments herein. FIGS. FIGS. 6A-6B have the same reference numbers or names as any element of any other figure herein. They can function or operate in any way similar to that described elsewhere, but they are not limited. FIG. FIG. 6A shows an overview of thread execution Logic 600. It can include a variant the hardware logic illustrated with each of the sub-cores 501A-501F in FIG. 5. FIG. FIG. 6B shows an illustration of the internal details of an execution module.”

“As illustrated at FIG. “As illustrated in FIG. 6A, thread execution logic 600 may include a shader processor 602, thread dispatcher 604, instruction memory 606, and a scalable execution array 608A-608N that includes a plurality execution units 608A-608N. A sampler 610, data cache 612 and data port 614 are some examples. One embodiment of the scalable execution array allows for dynamic scaling by enabling or disabling any one or more execution units, e.g. execution unit 608A or 608B, 608C or 608D through 608N-1 or 608N depending on the computational requirements of the workload. The interconnect fabric links the components to one another in one embodiment. Some embodiments of thread execution logic 600 include one or more connections with memory, such system memory or cache memory through one or more instruction cache 606, data ports 614, sampler610 and execution units 608A-608N. Each execution unit (e.g. 608A is a standalone programmable general-purpose computational unit capable of processing multiple data elements simultaneously for multiple threads. The array of execution units 608A-608N can be scaled to include multiple execution units in different embodiments.

“In certain embodiments, execution units 608A-608N can be used primarily to execute shader program. Shader processor 602 is capable of processing the various shader program and dispatching execution threads through a thread dispatcher 604. One embodiment of the thread dispatcher has logic that can arbitrate thread initiation requests via the graphics and media pipes and instantiate the requested Threads on one or more execution units in the execution units 608A-608N. A geometry pipeline, for example, can dispatch vertex or tessellation to the thread execution logic. Some embodiments of thread dispatcher 604 allow for runtime thread spawning requests to be processed by the executing shader program.

“In certain embodiments, execution units 608A-608N can support an instruction set that includes native support of many standard 3D graphics shading instructions. This means that shader programs from graphics library (e.g. Direct 3D or OpenGL) can be executed with minimal translation. These units are capable of vertex and geo processing (e.g. vertex programs and geometry programs), pixel processing, e.g. pixel shaders and fragment shaders, and general-purpose processing such as compute and media shading. The execution units 608A-608N are capable of multi-issue single-instruction multiple data (SIMD), execution, and multi-threaded operation. This allows for a more efficient execution environment in the face higher latency memory acceses. Each execution unit contains a high-bandwidth register and an associated thread-state. Execution can be multi-issue per clock. It is possible to execute integer, single, double precision floating point operations as well as SIMD branch capability. Waiting for data from memory, or one of the shared function, the dependency logic within execution units 608A-608N causes the waiting thread to go into sleep until the requested data is returned. The waiting thread may sleep while hardware resources are used to process other threads. An execution unit, for example, can execute operations for a fragment shader or pixel shader during a delay in a vertex shading operation.

“Each execution unit of execution units 608A-608N works on arrays data elements. The?execution size? is determined by the number of data elements. The number of instructions that can be executed. An execution channel is a unit of execution that allows data element access, masking and flow control within instructions. The number of physical Arithmetic Logic Units or Floating Point Units for a particular graphics processing unit may not be affected by the number of channels. Execution units 608A-608N may support both integer and floating point data types in some embodiments.

“The execution unit instruction sets includes SIMD instructions. The execution unit can store the different data elements as a packed type in a register. Based on the size of the elements, the execution unit will process them. The execution unit can operate on a 256-bit-wide vector by storing the 256 bits in a register. It will then process the vector as four 64-bit packed elements (Quad Word (QW), size 32-bit packed elements (Double Word, DW) size elements), sixteen 16-bit packed elements (Word, W) size elements), and thirty-two separate 8 bit data elements. It is possible to have different vector widths or register sizes.

“In one embodiment, one or more execution units may be combined to form a fused execution un 609A-609N with thread control logic (607A-607N), that is common among the fused EUs. Multiple EUs can be fused together to form an EU group. Each EU can be configured to execute its own SIMD hardware thread. There are many ways to vary the number of EUs within a fused EU group. You can also perform different SIMD widths per-EU including SIMD8, SIMD16 and SIMD32. Each fused graphics execution module 609A-609N has at least two executions units. Fused execution unit 609A for example includes a first EU 608A, a second EU 608B and thread control logic 607A. This is common to both the first EU 608A (and the second EU 608B). The thread control logic 607A manages threads on fused graphics execution units 609A-609N, which allows each EU to execute using a common instruction register.

“One or more thread execution logic 600 thread instruction caches (e.g. 606) are included to cache thread instructions for execution units. One or more data caches (e.g. 612) may be included in some embodiments to cache thread data during thread execution. A sampler 610 may be included in some embodiments to provide texture sampling for 3D operations or media sampling for media operations. Sampler 610 may include specialized texture and media sampling functionality that processes texture and media data during the sampling process, before being provided to an execution unit.

“During execution, graphics and media pipelines send request thread initiation to thread execution logic 600 via the thread spawning or dispatch logic. Once a grouping of geometric objects have been processed and rasterized to pixel data, the pixel processor logic (e.g. pixel shader logic or fragment shader logic) is invoked. Shader processor 602 can be invoked to further compute output information and cause results (e.g. color buffers. depth buffers. stencil buffers.) A pixel shader, or fragment shader, calculates the vertex attributes to be interpolated across a rasterized object in some embodiments. In some cases, the shader processor 602’s pixel processor logic executes an API-supplied pixel shader program. Shader processor 602 sends threads via thread dispatcher 604 to execute the shader program. Shader processor 602 may use texture sampling logic in sampler 610 to access texture information stored in memory. The input geometry data and the texture data are combined to compute the pixel color data for each fragment. Or, it discards one or several pixels that could be used in further processing.

“In some embodiments the data port 614 provides an access mechanism to the thread execution logic 600 for transferring processed data to memory to be further processed on a graphics processing output pipeline. Some embodiments include or couple to one or more cache memories, such as data cache 612 to store data for memory access via data port.

“As illustrated at FIG. “As illustrated in FIG. 6B, a graphics execut unit 608 may include an instruction fetch device 637, an architectural register array (ARF 626), a thread arbiter 622 and a send unit 630. A set of SIMD floating points units (FPUs 634) can also be included. In one embodiment, there are dedicated integer SIMD ALUs 635. GRF 624/ARF 626 include the general register files and architecture files that are associated with each parallel hardware thread that is active in the graphics execution device 608. One embodiment stores the thread’s architectural state in the ARF 626 while the thread execution data is kept in the GRF 644. In the ARF 626, thread-specific registers can be used to store the execution state of each thread and the instructions pointers for each thread.

“In one embodiment, the graphics execution unit 608 uses a combination Simultaneous Multi-Threading and fine-grained Interleaved Multi-Threading. Architecture can be modified at design time to adjust the number of concurrent threads and registers per execution unit. This allows for execution unit resources to be divided among multiple threads.

“In one embodiment, graphics execution unit 608 may co-issue multiple instructions. These instructions could be different.” The thread arbiter 622 in the graphics execution unit thread 608 is able to dispatch instructions to any of the send units 630, 642, or SIMD FPU(s), 634 for execution. Each execution thread has access to 128 general purpose registers in the GRF 624. Each register can store 32 bytes and can be accessed as a SIMD8 element vector of 32-bit data elements. Each execution unit thread can access 4 bytes within GRF 624 in one embodiment. However, other embodiments may provide more or less register resources. One embodiment allows up to seven threads to run simultaneously. However, the number of threads per execution units can vary depending on the embodiments. The GRF 624 can store 28 Kbytes in total, even though seven threads could access 4 Kbytes. Flexible addressing modes allow registers to be addressed together in order to create larger registers or represent rectangular strided block data structures.

“In one embodiment, sampler operations and memory operations are sent via?send?” Instructions that are sent to the message passing unit 630. One embodiment of the invention allows branch instructions to be sent to a dedicated unit 632 in order to facilitate SIMD divergence, and eventually convergence.

“One embodiment of the graphics execution unit 608 contains one or more SIMD floating points units (FPU(s),) 634 that perform floating-point operations. The FPU(s), 634 can also perform integer computation in one embodiment. SIMD can execute 32-bit floating point (or integer) operations up to M, SIMD executes up to 2M 16 bit integer and 16-bit floating point operations. One embodiment provides an extended math capability that supports high-throughput transcendental mathematics functions and double precision 64 bit floating-point. There may also be a set 8-bit integer SIMD 635 that are present in some embodiments. These ALUs can be optimized for machine learning operations.

“In one embodiment arrays of multiple instances the graphics execution unit 608 may be instantiated within a graphics subcore grouping (e.g. a subslice). Product architects can choose the number of execution units per subcore grouping to ensure scalability. One embodiment of the execution unit 608 is capable of executing instructions on a variety of execution channels. A further embodiment allows each thread to be executed on a different channel.

“FIG. “FIG.7” is a block diagram that illustrates a graphics processor instruction format 700 according to certain embodiments. The graphics processor execution units can support multiple instruction formats in one or more of the embodiments. The components in the solid-lined boxes are those that are included in the execution unit instructions. However, the dashed lines indicate components that are not included or are only found in a subset of the instructions. Instruction format 700, described and illustrated in some embodiments, are macro-instructions. They are instructions that are supplied to the execution units, rather than micro-operations that result from instruction decode after the instruction has been processed.

“In some embodiments, graphics processor execution units support instructions in a 128 bit instruction format 710. Some instructions can be executed in a 64-bit compacted format 730 depending on the instruction selected, number of operands and instruction options. All instruction options are available in the native 128-bit format 710. Some options and operations are limited in the 64-bit format 730. There are many variations in the native instructions that can be found in the 64-bit format 730. Some embodiments compress the instruction using part a set index values in an index area 713. Execution unit hardware refers to a set compaction tables based upon the index values. The compaction table outputs are used by the execution unit hardware to reconstruct a native instruction using the 128-bit instruction format 710.

“For each format instruction opcode 712 specifies the operation that the execution units is to perform. Execution units execute each instruction simultaneously across multiple data elements within each operand. An example: In response to an add command, the execution unit executes a simultaneous addition operation across each color channel that represents a texture element and picture element. The execution unit executes every instruction across all operands by default. Instruction control field 714 in some embodiments allows control over specific execution options such as channel selection (e.g. predication) or data channel order (e.g. swizzle). Instructions in the 128-bit instruction formats 710 have an exec size field 716 that limits the number data channels that can be executed simultaneously. Some embodiments of the compact 64-bit instruction format 730 do not allow for the use of exec-size 716.

“Some execution unit instructions can have three operands. These include two source operands (src0720, src1722) and one destination 718. Some embodiments support dual destination instructions. In this case, one of the destinations can be implied. Data manipulation instructions may have a third source operand, such as SRC2 724, where the instruction opcode 702 determines how many source operands. A source operand for an instruction can be an instant (e.g. hard-coded) value that is passed along with it.

“In certain embodiments, the 128 bit instruction format 710 contains an access/address type field 726 indicating, for example, whether direct or indirect register address mode is used. Direct register addressing mode means that the register address of one operand is provided directly by bits in an instruction.

“In some embodiments, a 128-bit instruction file 710 contains an access/address field 726. This specifies an address mode or an access mode for instruction. The access mode can be used to determine the data access alignment of the instruction in one embodiment. Some embodiments support access modes, including a 16-byte aligned and 1-byte aligned mode. The byte alignment of an access mode determines which access mode the instruction operands will use. In a first mode, an instruction might use byte-aligned address for source and destinations operands, while in a second mode, it may use 16-byte aligned addressing to address all source and destination operands.

“In one embodiment, address mode in the access/address field 726 determines if the instruction will use direct or indirect addresses. If direct register addressing is used bits in an instruction provide the register address for one or more operands. Indirect register addressing mode may be used. The register address of one or several operands can be calculated based on the address register value and the address immediate field.

“In some embodiments, instructions are grouped according to opcode 712 bit fields to simplify Opcode decode 740. Bits 4, 5, and 6, for an 8-bit Opcode, allow the execution unit determine the type. This is just an example of the opcode grouping. A move and logic opcode number 742 may include data movement and logic instruction (e.g. compare (cmp), move (mov), etc.). Some embodiments share the five most important bits (MSB) of move and logic group 742. In these cases, move (move) instructions are given in the form 0000xxxxb while logic instructions are given in the form 0001xxxxb. Instructions in flow control instruction group 744, such as call, jump (jmp), are given in the form 0010xxxxb (e.g. 0x20). Miscellaneous instruction 746 contains a mixture of instructions. It includes instructions for synchronization (e.g. wait, send) and is in the form 0011xxxxb (e.g. 0x30). Parallel math instruction group 748 contains component-wise arithmetic directions (e.g. add, multiply (mul),) in the form 0100xxxxb. (e.g. 0x40). Parallel math group 748 allows you to perform parallel arithmetic across multiple data channels. The vector math group 748 includes arithmetic instructions in the form 0101xxxxb (e.g. 0x50). The vector math group does arithmetic, such as dot product calculations using vector operands.

“Graphics Pipeline”

“FIG. 8 is a block diagram for another version of the graphics processor 800. FIG. Elements of FIG. 8 with the same references numbers (or names as any other figure herein) can operate or function similarly to that described elsewhere. However, they are not limited to these.

“In some embodiments, graphics process 800 includes a geometry pipeline 820 and a media pipeline 840. A display engine 840 is included. Thread execution logic 850 is also included. And a render output pipeline 870. Graphic processor 800 may be a graphics processor in a multi-core processor system with one or more general purpose cores. Register writes to one or several control registers (not illustrated) are used to control the graphics processor. Commands to graphics processor 800 can be issued via a ringinterconnect 802. In some cases, the ring interconnect 802 links graphics processor 800 with other processing components such as general-purpose processors or other graphics processors. A command streamer 803, which provides instructions to the individual components of the media pipeline 830 or the geometry pipeline 802, interprets commands from ring interconnect 802.

“In some embodiments command streamer 803 directs operation of vertex fetcher 805. This reads vertex data out of memory and executes vertex processing commands from command streamer 803. Vertex fetcher 805 may provide vertex data to vertex shader 807, which performs coordinate-space transformations and lighting operations on each vertex. Vertex fetcher 805 or vertex shader 807 execute vertex processing instructions in some embodiments by sending execution threads to execution units 852A-852B through a thread dispatcher 831.

“In some embodiments execution units 852A-852B can be described as an array of vector processors with an instruction set for performing graphics or media operations. Execution units 852A-852B may have an L1 cache 851 attached that is either specific to each array or shared among them. You can configure the cache to be a data cache or an instruction cache. Or, it could be a single cache partitioned to store data and instructions in multiple partitions.

“Some embodiments of geometry pipeline 820 include tessellation parts to perform hardware-accelerated 3D object tessellation. A programmable hull shadingr 811 can be used to configure the tessellation operations in some embodiments. A programmable domain shadingr 817 allows back-end evaluations of the tessellation output. The tessellator 803 operates in the direction of hull shading 811 and includes special purpose logic to generate detailed geometric objects. This is done by using a coarse model as an input to geometry pipeline 820. If tessellation is not required, some embodiments can be bypassed (e.g., domain shader 817) and hull shader 811.

“In certain embodiments, complete geometric objects may be processed by a geo shader 819 through one or more threads dispatched at execution units 852A-852B or directly to the clipper829. The geometry shader can operate on whole geometric objects in some embodiments. This is different from previous stages of the graphics pipeline, which only works on vertices and patches of vertices. The vertex shader 807 will provide input to the geometry shader 819 if the tessellation is disabled. If the tessellation units have been disabled, then geometry shader 819 can be programmable using a geometry shader program.

“Before rasterization, a clipper 829 processes vertex data. The clipper 829 can be either a clipper with a fixed function or a programable clipper that has clipping and geometry shading functions. In some embodiments, the render output pipeline 870 includes a depth test and rasterizer component 873 that dispatch pixel shaders to convert geometric objects into per-pixel representations. In some embodiments, thread execution logic 855 includes pixel shader logic. An application may be able to bypass the depth test and rasterizer components 873 and gain un-rasterized vertex information via a stream out device 823 in some embodiments.

“The graphics processor 800 includes an interconnect bus, interconnect cloth, or another interconnect mechanism that permits data and message to be passed among the main components. Execution units 852A-852B, and any associated logic units (e.g. L1 cache 851, sampler 854 texture cache 858 etc.) may be used in some embodiments. Interconnect via a data channel 856 to access memory and communicate with render pipeline components of processor. Some embodiments have sampler 854, caches 851, 851, 858, and execution units 852A-852B with their own memory access paths. One embodiment allows the texture cache 858 to be used as a sampler cache.

“In certain embodiments, render output pipeline 870 includes a depth test component 873 and a rasterizer that convert vertex-based objects to pixel-based representations. Some embodiments include a windower/masker unit that performs fixed function triangle or line rasterization. In some embodiments, a render cache 878 or depth cache 879 is also available. The pixel operations component 8877 performs pixel-based operations upon the data. However, in certain instances, pixel operations that are associated with 2D operations (e.g. Bit block image transfers with blending are performed by the 2D engines 841 or by the overlay display planes 843 by the pixel operations component 877. A shared L3 cache 875 can be made available to all graphics components in some embodiments. This allows data sharing without main system memory.

“In certain embodiments, the graphics processor media pipeline 830 contains a media engine 837 as well as a video front end 834. Video front-end 834 may receive pipeline commands from command streamer 803. Some embodiments include a separate command streamer in the media pipeline 830. Some embodiments of video front-end 834 process media commands before sending them to the media engine 837. Some embodiments of media engine 837 include thread spawning functionality that spawns threads to be dispatched to thread execution logic 851 via thread dispatcher 831.

“In some instances, graphics processor 800 also includes a display driver 840. Display engine 840 can be external to processor 800. It is connected with processor 800 via the ring interconnect 802, another interconnect bus, or fabric. Display engine 840 may include a 2D engine (831) and a display controller (843). Display engine 840 may contain special purpose logic that can operate independently from the 3D pipeline in some instances. Display controller 843 may be coupled with a display unit (not shown), which could be either a system-integrated display device such as a laptop or an external display unit attached via a connector.

“In some embodiments the media pipeline 820 or geometry pipeline 830 can be configured to perform operations that are based on multiple graphics programming interfaces. They are not limited to one particular application programming interface (API). Driver software for the graphics processing unit converts API calls specific to a particular media library or graphics into commands that can then be processed by the processor. Some embodiments provide support for Open Graphics Library, Open Computing Language (OpenCL), Vulkan graphics, and compute API all from the Khronos Group. Support may be available for Direct3D from Microsoft Corporation in some embodiments. One or more of these libraries might be supported in some cases. Open Source Computer Vision Library (OpenCV) may also be supported. If a mapping is possible between the pipeline of future API and the pipeline of graphics processor, a future API would support a compatible 3D pipe.

“Graphics Pipeline Programming

“FIG. 9A is a block illustration of a graphics processor command structure 900 according to certain embodiments. FIG. FIG. 9B shows a block diagram that illustrates a graphics processor command sequence 910 according an embodiment. FIG. FIG. 9A shows the components that are typically included in a graphic command. The dashed lines indicate components that can be optionally included or only included in a subset of graphics commands. FIG. 9A shows an exemplary graphic processor command format 900. 9A shows the exemplary graphics processor command format 900 of FIG. Some commands also include a sub-opcode 905 as well as a command size of 908

Client 902 in some embodiments specifies the client unit that processes command data. A graphics processor command parser may examine the client field for each command in order to condition further processing and route the command data the appropriate client unit. Some graphics processor client units may include a memory interface, render, a 2D, 3D, and media unit. Each client unit is equipped with a processing pipeline that processes commands. The client unit receives the command and reads sub-opcode 904 to determine what operation it should perform. The command is executed by the client unit using data field 906. Some commands may require an explicit command size 908 to indicate the size of the command. Some embodiments automatically determine the size of some commands based upon the command opcode. Some embodiments align commands via multiples or a double-word.

“Flow diagram in FIG. “The flow diagram in FIG. 9B illustrates an example graphics processor command sequence 910. A version of the command sequence 910 is used by some embodiments of software or firmware for a data processing device that includes a graphics processor. This allows them to execute and terminate various graphics operations. This is a sample command sequence that will be described as an example. Embodiments are not limited to the commands shown or this command sequence. The commands can also be issued in batches of commands within a command sequence so that the graphics processor processes the sequence in at least partial concurrence.

“In some instances, the graphics processor command sequence 910 could begin with a pipeline flush 912 to cause any active pipeline to execute the currently pending orders for the pipeline. The media pipeline 924 and the 3D pipeline 922 may not work simultaneously in some embodiments. To cause the active graphics pipeline complete all pending commands, the pipeline flush is performed. The pipeline flush will cause the command parser of the graphics processor to stop processing commands until the active drawing engines have completed all pending operations. Any data in the render cache marked as ‘dirty’ can be flushed to memory. You can flush to memory any data in the render cache that is marked?dirty? Pipeline flush command 912 may be used in some embodiments to synchronize pipelines or to place the graphics processor into a low-power state.

“In some embodiments, the pipeline select command 913 can be used when a command sequence demands that the graphics processor switch between pipelines. A pipeline select command 913 may only be required once in an execution context before issuing commands to pipelines, except when the context is to issue pipeline commands for both pipelines. A pipeline flush command 912 may be required before a pipeline switch is made via the pipeline select order 913.

“In certain embodiments, the pipeline control command 914 configures the graphics pipeline for operation. It is used to program both the 3D pipeline 922 or the media pipeline 924. Pipeline control command 914 can be used to configure the pipeline state of the active pipeline in some embodiments. One embodiment uses the pipeline control command 914 to synchronize the pipeline and clear data from cache memories.

“In some embodiments return buffer state commands 916 can be used to set up a number of return buffers that will allow the pipelines to write data. Some pipeline operations require configuration, selection or allocation of return buffers that will be used to write intermediate data. Some embodiments also use one or more return buffers for cross-thread communication and storage of output data. The return buffer state 916 may include the ability to select the size and number return buffers that will be used for certain pipeline operations.

“The command sequence’s remaining commands differ depending on the active pipeline for operation. The pipeline determination 920 determines that the command sequence should be tailored to the 3D pipe 922 starting at the 3D state 930 or the mediapipe 924 starting at the 940.

“Commands to configure the 3D pipe state 930 include 3D state setting command for vertex buffer, vertex element, depth buffer, and other state variables. These commands must be used before 3D primitive commands can be processed. These commands are determined at most in part according to the 3D API being used. 3D pipeline state 930 commands can be used to disable or bypass specific pipeline elements in certain embodiments.

“In certain embodiments, the 3D primitive 932 command can be used to submit 3D primitives for processing by the 3D pipeline. The command data and parameters passed to the graphics processor by the 3D primitive 932 command will be forwarded to vertex fetch in the graphics pipeline. To generate vertex data structure, the vertex fetch function uses 3D primitive 932 command information. One or more return buffers are used to store the vertex data structures. 3D primitive 932 command can be used in some embodiments to perform vertex operations via vertex shading. 3D pipeline 922 dispatches shader execution Threads to graphics processor execution unit in order to process vertex shadingrs.

“Some embodiments trigger 3D pipeline 922 via an execute 934 command, event or command. A register write triggers command execution in some embodiments. Execution can be triggered by a?go? or ?kick? or?kick? command in the command sequence. One embodiment triggers command execution by using a pipeline sync command to flush the command sequence through graphics pipeline. The 3D pipeline will do geometry processing for 3D primitives. After operations are completed, the resulting geometric objects will be rasterized. The pixel engine then colors the pixels. These operations might also include additional commands for controlling pixel shading or pixel back-end operations.

“In certain embodiments, the graphics processing command sequence 910 is set to follow the media pipeline 924 path for performing media operations. The media operations or compute operations that are to be done will determine the programming style and the use of the media pipeline 924. During media decode, specific media decode operations can be offloaded to media pipeline. Some embodiments allow for media decode to be done entirely or partially using the resources of one or more general-purpose processing cores. One embodiment of the media pipeline includes elements for general purpose graphics processor unit (GPGPU), operations. In this case, the graphics processor is used to perform SIMD Vector operations using computational shader programmes that are not directly related to rendering graphics primitives.

“Media pipeline 924 may be configured in the same way as 3D pipeline 922 in some embodiments. A list of commands that configure the media pipeline state 940 is dispatched before the media object commands 942. Some embodiments of the commands for the media pipe state 940 include data that configures the elements of the media pipeline to be used in processing the media objects. This data includes data to configure video decode and video code logic within the media pipeline. For example, encode or decode format. Some embodiments support one or more pointers to “indirect” when executing commands for the media pipeline state 940. state elements that include a set of state settings.”

“Media object commands 942 supply points to media objects for processing in the media pipeline. Memory buffers that contain video data are media objects. In certain embodiments, the media object command 942 must be issued from all valid media pipeline states. Once the pipeline state has been configured and queued media object commands 942, the media pipeline 924 can be triggered by an execute command 944, or another equivalent execute event (e.g. register write). The media pipeline 924 output may be processed using operations provided by either the 3D pipeline 922, or the media pipeline 924. GPGPU operations may be configured and executed in a similar way to media operations.

“Graphics Software Architecture.”

“FIG. “FIG. Software architecture may include a 3D graphics app 1010, an operating software 1020 and at least one processor (1030). Some embodiments include a processor 1030 that includes a processor core (1032) and one or more general purpose processor cores (1034). Each of the operating system 1020 and graphics application 1010 are executed in the system memory 1050.

“In some instances, 3D graphics software 1010 includes one or more shader programmes including shader instructions 1012. Shader language instructions can be written in a high-level shader languages such as the High Level Shader Languages (HLSL) and the OpenGL Shader Languages (GLSL). Executable instructions 1014 are also included in the application. These instructions can be executed by the general-purpose core 1034. Graphic objects 1016 are also included in the application, as defined by vertex information.

Summary for “System and Method for an Optimized Winograd Convolution Accelerator”

“Machine Learning has been successful in solving many types of tasks. Parallel implementations are possible because of the complexity of machine learning algorithms, such as neural networks. Parallel processors, such as general-purpose graphics processing units (GPGPUs), have been a key part in the implementation of deep neural network. Complementing deep learning-based machine learning systems may require large amounts of memory and computing power. Many megabytes can be used to create deep learning neural networks models. They also require billions upon billions of floating point operations per second to process. These requirements may prevent the deployment of many neural networks models to low-power computing devices. This is especially true for devices that are suitable for the Internet of Things (IoT), which consists mainly of low-end embedded devices.

“In some embodiments, a GPU is communicatively coupled with host/processor cores in order to accelerate graphics operations and machine-learning operations. This allows for pattern analysis operations and other general-purpose GPU functions (GPGPU). The GPU can be communicatively connected to the cores/host processors via a bus or other interconnect. Other embodiments may integrate the GPU on the same package/chip as the cores. The GPU can also be communicatively connected to the cores via an internal bus/interconnect (i.e. internal to the package/chip). Regardless of the manner in which the GPU is connected, the processor cores may allocate work to the GPU in the form of sequences of commands/instructions contained in a work descriptor. The GPU then uses dedicated circuitry/logic for efficiently processing these commands/instructions.”

FIGS. FIGS. 1-14 give an overview of the exemplary data processing system or graphics processor logic that incorporates, or relates to, the various embodiments. FIGS. FIGS. 15 through 26 provide details about the different embodiments. FIGS. FIGS. 27-31 give an overview of machine learning hardware architecture and software architecture. The following embodiments will be described using a graphics processor. Other aspects will be described using a general-purpose processor such as a central processor unit (CPU). Similar techniques and teachings are also applicable to other types or semiconductor devices such as a GPU cluster, many integrated core processors, or one or more instances a field-programmable gate array. The teachings can be applied to any machine or processor that processes images (e.g., vertex data, geometry data, sample, pixels, vertex data or geometry data). You can use the embodiments described herein without any of the details. Some well-known features may not be described in order to keep the details from being obscured.

“System Overview”

“FIG. “FIG. The system 100 may include one or more processors (102, one or multiple graphics processors (108) and it can be a single-processor desktop system, multiprocessor workstation, or a server with a large number processor cores 107 or 102. One embodiment of the system 100 is a processor platform integrated into a system on a chip (SoC) integrated circuit that can be used in mobile, handheld, and embedded devices.

“In one embodiment, the system 100 may include, or be integrated within a server-based game platform, a gaming console. This includes a game and media console as well as a mobile, handheld, or online gaming console. The system 100 can be a smart phone, mobile phone, tablet computing device, or mobile Internet device in some instances. The processing system 100 may also be coupled with or integrated into a wearable device such as a smartwatch, smart eyewear, augmented reality, virtual reality, or smart watch. The processing system 100 can be a television set-top box device with one or more processors, 102 and a graphic interface generated by one of the graphics processors.

“In some embodiments, one or more processors (102) each contain one or more cores 107 to execute instructions that, when executed perform operations for system or user software. Each of the processor cores 107 can be configured to process a particular instruction set 109 in some embodiments. In some cases, instruction set 109 can facilitate Complex Instruction Set Computing or Reduced Instruction Set Computing (RISC) or computing via a Very Long Instruction Word (VLIW). Multiple processor cores (107) may process different instruction sets 109. These instructions may be used to emulate other instruction sets. Other processing devices such as a Digital Signal Processor, (DSP) may be included in processor core 107.

“In some embodiments the processor 102 also includes cache memory (104). The architecture of the processor 102 may allow for a single or multiple levels internal caches. Some embodiments share the cache memory among different components of the processor. Some embodiments of the processor 102 use an external cache, such as a Level-3 cache (L3) or Last Level Cache(LLC), (not shown), that may be shared between processor cores (107) using known cache coherency methods. In processor 102, a register file 106 may be included. This register file can contain different types of registers to store different types data, such as integer registers and floating point registers. While some registers are general-purpose, others may be more specific to processor 102.

“Some embodiments use one or more processors 102 to communicate with a processor bus 110. This allows for communication signals, such as control signals, data or address between processors 102 and other components of the system 100. In one embodiment, the processor bus 110 is a Direct Media Interface (DMI), bus. One embodiment of the processor(s), 102 includes an integrated memory controller (116) and a peripheral controller (130). The memory controller (116) facilitates communication between a device and other components of system 100. The peripheral controller hub (PCH 130) provides connections to I/O devices via local I/O buses.

The memory device 120 could be either a dynamic random-access memory (DRAM), a static random memory (SRAM), flash memory, phase-change device or any other type of memory that is suitable for use as process memory. The memory device 120 can be used as system memory in one embodiment to store data 122, instructions 121 and other information that is needed by the processors 102 when they execute an application or process. The optional external graphics processor 112 may also be connected to the memory controller 116. This allows for graphics and media operations to be performed by the processors. A display device 111 may be connected to the processor(s), 102 in some embodiments. One or more internal display devices can connect to the processor(s) 102. One embodiment of the display device 111 is a head-mounted display (HMD), such as a stereoscopic device that can be used in virtual reality (VR), or augmented reality(AR) applications.

“Some embodiments of the peripheral controller 130 allow peripherals to connect via high-speed I/O buses to memory device 120 or processor 102. I/O peripherals can include an audio controller 146 and a network controller 135, a firmware interface 128, touch sensors 125, and a data storage device (e.g. hard disk drive, flash, etc.). The data storage device (124) can be connected via a storage interface (e.g. SATA), or via a peripheral bus (e.g. PCI Express, PCI Express). Touch sensors 125 include pressure sensors, fingerprint sensors, and touch screen sensors. The wireless transceiver 126 is a Wi-Fi, Bluetooth, or mobile network transceiver. It can also be a Wi/Fi, 4G or Long Term Evolution (LTE), transceiver. The firmware interface 128 allows communication with system firmware and can, for example, be a unified extensible software interface (UEFI). The network controller (134) can be used to establish a connection to a wired network. A high-performance network controller (not illustrated) can be used in conjunction with the processor bus 110. One embodiment of the audio controller 146 is a multichannel high-definition audio controller. One embodiment of the system 100 includes an optional legacy I/O control 140 to allow for the coupling of legacy devices (e.g. Personal System 2 (PS/2)). The peripheral controller 130 can be connected to one or more Universal Serial Bus controllers (USB), 142 to connect input devices such as keyboard, mouse, and camera 143, or other USB input device.

It will be appreciated that the 100-bit system is only an example and that it is not exhaustive. Other data processing systems may be used that have different configurations. One example of this is the memory controller 116, and peripheral controller 130. This could be combined into an external graphics processor such as the external graphic processor 112. One embodiment may have the peripheral controller 130 or memory controller 1160 external to one or more processors 102. The system 100 may include an external memory controller (116) and peripheral controller 130. These can be used as memory controller hubs and peripheral controller hubs within a system chipset which is in communication to the processor(s).

“FIG. 2. This is a block diagram showing an embodiment of a processor 200 with one or more cores 202A to 202N, an integrated controller 214 and an integrated graphics processor 208. FIG. FIG. 2 elements that have the same names or reference numbers as any other figure can function or operate in any way similar to the one described elsewhere. Additional cores can be added to Processor 200, represented by dashed-lined boxes. Each processor core 202A-202N has one or more internal cache units (204A-204N). Each processor core may also have access to shared cached units (206).

The cache memory hierarchy of the processor 200 is represented by the shared cache units 206-204. The cache memory hierarchy can include at least one level each of data and instruction caches within each processor core, as well as one or more shared mid-level cache levels, such as Level 2 (L2) or Level 3 (L3) or Level 4 (L4). This cache hierarchy is where the LLC is the highest level of cache prior to external memory. Some embodiments maintain coherency between the cache units 206A-204N.

“In some cases, processor 200 may include a set or more bus controller units (216) and a system agent center 210. One or more bus controllers 216 manage a number of peripheral buses such as one or two PCI express or PCI express busesses. The system agent core 210 provides management functionality to the various processor components. System agent core 210 may include one or more integrated memory controllers (214) to manage access to external memory devices.

“In certain embodiments, one of the processor cores (202A-202N) includes support for simultaneous multithreading. The system agent core core 210 is responsible for operating and coordinating cores 202A to 202N during multithreaded processing. The system agent core 210 can also include a power control module (PCU), which contains logic and components that regulate the power status of processor cores 200A-202N, and graphics processor 208.

“In some embodiments processor 200 also includes graphics processor 208 to perform graphics processing operations. In some embodiments, graphics processor 208 is coupled with the set of shared memory units 206 and the system agent core (210), including one or more integrated memories controllers 214. Some embodiments include a display controller (211) to enable graphics processor output to one of several coupled displays. Display controller 211 can be either a separate module that is coupled to the graphics processor via at most one interconnect or integrated into the graphics processor.

“Some embodiments use a ring-based interconnect unit (212) to connect the internal components of processor 200. An alternative interconnect unit, such as a point to point interconnect or a switched interconnect may be used. These are all techniques that are well-known in the art. Graphics processor 208 may be connected to the ring interconnect (212) via an I/O line 213 in some embodiments.

“The exemplary I/O connection 213 is at least one among multiple types of I/O interconnects. This interconnect facilitates communication between processor components and a high performance embedded memory module 218, such eDRAM modules. Each of the processor cores (202A-202N) and graphics processor 208 utilize embedded memory modules 218 to share a Last Level Cache.

“In some embodiments processor cores of 202A-202N can be homogenous cores that execute the same instruction set architecture. Another embodiment of processor cores is 202A to 202N that are heterogeneous in instruction set architecture (ISA). In this case, one or more processor cores of 202A to 202N execute a single instruction set while the rest execute subsets of the first or another instruction set. One embodiment of processor cores 202A to 202N is heterogeneous in terms microarchitecture. This means that one or more cores with a higher power consumption are paired with one or several power cores with a lower power consumption. The processor 200 may also be implemented on one or several chips, or as an SoC integrated Circuit with the illustrated components.

“FIG. “FIG. The graphics processor may communicate with the registers of the graphics processor via a memory-mapped I/O interface. In some cases, commands are placed in the processor memory. Graphics processor 300 may include a memory interface 314 for accessing memory. The memory interface 314 may be used to access local memory, one, or more internal caches, or system memory.

“In some embodiments, the graphics processor 300 includes a display controller 302. This allows for display output data to be sent to a display device 320. Display controller 302 is hardware that allows for the display of one or more overlay layers for the composition of multiple layers or user interface elements. Display device 320 may be either an external or internal display device. One embodiment of the display device 320 includes a head-mounted display device such as an AR or virtual reality (VR). Some embodiments include a video codec engine 306 that encodes, decodes, or transcodes media to, between, or among one or more media encoding format. This includes, but is not limited to, Moving Picture Experts Group formats like MPEG-2 and Advanced Video Coding formats such H.264/MPEG-4 AVC formats as well as the Society of Motion Picture & Television Engineers 421M/VC-1 and Joint Photographic Experts Group formats such JPEG and Motion JPEG formats.

“In some embodiments, the graphics processor 300 has a block-image transfer (BLIT), engine 304 that performs two-dimensional (2D), rasterizer operations, including bit-boundary block transfer transfers. In one embodiment, 2D graphics operations can be performed using one or more graphics processing engines (GPE) 310. GPE 310 can be used to perform graphics operations in some embodiments. This includes media operations and three-dimensional (3D), graphics operations.

“In certain embodiments, GPE 310 contains a 3D pipeline 312 that performs 3D operations such as rendering three-dimensional scenes and images using processing functions that act on 3D primitive shapes (e.g. rectangle, triangle, etc.). The 3D pipeline 312 contains programmable and fixed-function elements that perform different tasks and/or spawn executions threads to a 3D/Media Sub-system 315. 3D pipeline 312 is able to be used for media operations. However, GPE 310 includes a media pipeline 316 which is used specifically to perform media operations such as image enhancement and video post-processing.

“In some instances, media pipeline 3316 may include fixed function or programmeable logic units that perform one or more specialized media operations such as video decode acceleration or video de-interlacing or video encode acceleration in lieu of or for video codec engine 306 Media pipeline 316 may also include a thread-spawning unit that spawns threads to be executed on 3D/Media 315. The spawned Threads compute for media operations on the graphics execution units in 3D/Media 315.

“In certain embodiments, 3D/Media 315 includes logic to execute threads spawned from 3D pipeline 312 or media pipeline 316. One embodiment has the pipelines sending thread execution requests to 3D/Media 315. This subsystem 315 includes thread dispatch logic to arbitrate and dispatch the requests to the available thread execution resources. An array of graphics execution units are used to process media and 3D threads. 3D/Media subsystem 315 may include one or more internal caches that store data and instructions for threads. The subsystem may also include shared memory, such as registers and addressable memories, which can be used to share data among threads or store output data.

“Graphics Processing Engine.”

“FIG. “FIG. One embodiment of the graphics processing engine (GPE), 410 is a modified version of the GPE 310 as shown in FIG. 3. FIG. 4. Having the same reference numbers or names as elements in any other figure herein, can operate or function similarly to that described elsewhere. However, they are not limited to these. FIG. 3D pipeline 312 (or media pipeline 316) are examples. 3 are illustrated. In some GPE 410 embodiments, the media pipeline 316 may be optional and not included in the GPE 410. In at least one embodiment, a separate image and/or media processor is connected to the GPE 410.

“In some embodiments, GPE 410 is coupled with or includes a Command streamer 403, which provides a commandstream to the 3D pipeline 312, and/or media pipes 316. Some embodiments of command streamer 403 are coupled with memory. This memory can be system memory or any combination of internal cache memory or shared cache memory. Some embodiments of command streamer 403 can receive commands from memory and send them to the 3D pipeline 312 or media pipeline 316. Commands are directives that are retrieved from a ring buffer. This buffer stores commands for 3D pipeline 312 or media pipeline 316. The ring buffer may also include batch command buffers that store multiple commands in batches. Commands for the 3D pipe 312 may also refer to data in memory. This includes vertex and geometry data, image data, and memory objects for media pipeline 316. The media pipeline 316 and 3D pipeline 312 process data and perform operations within their respective pipelines. Or, they dispatch one or more execution threads directly to a graphics core array 414. One embodiment of the graphics core array 414 includes one or more blocks including graphics core(s), 415A and 415B, each block including one to more graphics cores. Each graphics core contains a set graphics execution resources. This includes general purpose and graphics-specific execution logic that can perform graphics and computations, as well fixed function texture processing and/or machinelearning and artificial intelligence acceleration logic.

“In different embodiments, the 3D pipeline 312 contains fixed function and programable logic to process one of many shader programs such as vertex shading, geometry shading, pixel shading, fragment shaders or compute shaders. The instructions are processed and dispatched to the graphics core array 414 by dispatching execution threads. The graphics core array 414 contains a single block of execution resources that can be used to process these shader programs. Multi-purpose execution logic (e.g. execution units) in the graphics core(s), 415A-414B can execute multiple execution threads that are associated with multiple shaders.

“In some embodiments, the graphics core array 414 includes execution logic that can perform media functions such as image and video processing. One embodiment of the execution units also includes general-purpose logic, which can be programmable to perform parallel general purposes computational operations in addition to graphics processing. The general purpose logic is able to perform parallel processing operations or in conjunction of general purpose logic within processor core(s), 107 as shown in FIG. 1, or core 202A?202N as shown in FIG. 2.”

“Threads that execute on the graphics core array 414 may generate output data and can send it to memory using a unified buffer (URB). 418. Multiple threads can be stored data in the URB 418. The URB 418 can be used in some instances to transmit data between threads that execute on the graphics core array 414. The URB 418 can also be used in some embodiments to synchronize threads within the graphics core array as well as fixed function logic within shared function logic (420).

“In some embodiments graphics core array 414 can be scaled so that it includes a variable amount of graphics cores each with a variable number execution units based upon the target power and performance levels of GPE 410. One embodiment allows execution resources to be dynamically scaled, so that they can be disabled or enabled as required.

“The graphics core array 414 is coupled with the shared function logic 420, which includes multiple resources that can be shared among the graphics cores of the graphics core array. The shared functions in the shared function logic logic 420 are hardware units that provide specialized functionality to the graphics center array 414. Shared function logic 420 can include inter-thread communication logic (ITC), sampler 421, and math 422. Some embodiments also implement one or more caches 425 within the shared functionality logic 420.

A shared function is one that is used when the demand for a particular specialized function is not sufficient to be included within the graphics core array 414. Instead, a single instance of the specialized function is implemented in the shared function logic 422, and shared among the execution resource within the graphics core 414. There are many different embodiments that differ in the exact set of functions that are shared among the graphics core 414 and included within it. Some embodiments may include specific shared functions from the shared function 420 which are frequently used by the graphics center 414. The graphics core array 414 may include some or all of the logic from the shared functionality logic 420 in the shared function Logic 416. One embodiment allows all elements of the shared function 420 to be duplicated in the graphics core array 414’s shared function logic 416. One embodiment of the shared functionality logic 420 is removed to make way for the shared function 416 in the graphics core array 414

“FIG. “FIG. FIG. FIG. 5 elements can function or operate in any way similar to the ones described elsewhere. However, they are not limited to the above. In some embodiments, the illustrated graphics processor core 500 is part of FIG. 4. A core slice is a graphic processor core 500 that can contain one or more graphics cores in a modular graphics processing unit. One graphics processor core 500 is an example of a single graphics core slice. A graphics processor described herein can include multiple graphics core slices depending on the target power and performance envelopes. Each graphics core 500 may include a fixed block 530 and multiple sub-cores 501A-501F. These sub-slices are also known as sub-slices. They include modular blocks of general-purpose and fixed function logic.

“In some embodiments, the fixed function block530 contains a geometry/fixed-function pipeline 536 that can shared by all graphics processor sub-cores 500. This is useful for example in low performance or lower power graphics processor implementations. The geometry/fixed function pipeline 536 may include a 3D fixed functionality pipeline (e.g. 3D pipeline 312 in FIG. 3. and FIG. 3 and FIG. 4.”

“In one embodiment, the fixed function block 537 includes a graphics SoC Interface 537, a graphic microcontroller 538 and a media pipeline 539. The graphics SoC interface 537 is an interface between the graphics processor core 500 and other processor cores in a system on chip integrated circuit. The graphics microcontroller 538, a programmable subprocessor, is configurable for managing various functions of the graphics processor 500. These include thread dispatch, scheduling and pre-emption. The media pipeline 539 (e.g. media pipeline 316 in FIG. 3. and FIG. 3 and FIG. The media pipeline 539 implements media operations through requests to compute and/or sampling logic within sub-cores 501-501F.

“In one embodiment, the SoC interface 537 allows the graphics core 500 communication with general purpose processor cores (e.g. CPUs) or other components within an SoC. This includes memory hierarchy elements like a shared last-level cache memory, system RAM and/or embedded-on-chip/on-package DRAM. The SoC interface 537 allows communication with fixed function devices, such as camera imaging pipes, and allows the use and/or implementation of global memory atomics. These atomics can be shared between the graphics processor 500 and the SoC CPUs. The SoC interface 537 is able to implement power management controls for graphics core 500, and allow interface between a clockdomain of graphic core 500 with other clock domains in the SoC. One embodiment of the SoC interface 537 allows for the receipt of command buffers from both a global thread dispatcher and a command streamer. These command buffers can be used to send commands and instructions to one or more graphics cores in a graphics processor. When media operations are being performed, the commands and instructions can either be sent to the media pipeline 539 or to a geometry/fixed function pipeline (e.g. geometry and fixed functional pipeline 536, geometry & fixed function pipe 514).

The graphics microcontroller 538 is able to manage and schedule tasks for the graphics core 500. One embodiment of the graphics microcontroller 538 is capable of scheduling graphics and/or compute workloads on various parallel graphics engines within execution unit arrays 502A-502F, or 504A-504F within sub-cores501A-501F. This scheduling model allows host software to submit workloads to one of the multiple graphic processor doorbells. This triggers a scheduling operation for the appropriate graphics engine. Scheduling operations include the determination of which workload to run next and sending it to a command streamer. This allows host software to monitor progress and notify host software when the workload is completed. One embodiment of the graphics microcontroller 538 allows for low-power or idle state graphics core 500. This provides the graphics core 500 with the ability save and restore registers in the graphics core 500 during low-power transitions without the need to be connected to the operating system or graphics driver software.

The graphics core 500 can have more or less than the illustrated subcores 501A-501F. It can also include N modular subcores. The graphics core 500 may include N sub-cores and shared function logic 510. Shared and/or cache memory512 can be shared. A geometry/fixed function pipeline 514 and additional fixed function logic 516 can be added to speed up various graphics and compute processing. The shared function logic logic 510 may include logic units that are associated with the shared functionality logic 420 in FIG. Each N sub-cores of the graphics core 500 can share 4 (e.g. sampler, math and/or inter thread communication logic). The cache and/or shared memory 512 can serve as a cache at the last level for the N sub-cores 500A-501F. It can also be used to share memory that is available by multiple subcores. The geometry/fixed-function pipeline 514 may be used in place of the geometry/fixed functional pipeline 536 within fixed function block 533. It can contain the same or similar logic units.

“In one embodiment, the graphics core 500 contains additional fixed function logic 516 which can be used by the graphics core 500 to include fixed function acceleration logic. One embodiment of the additional fixed function logic 516 contains an additional geometry pipeline that can be used in position-only shading. Two geometry pipelines are available for position-only shading. One is the full geometry pipeline in the geometry/fixed functionality pipeline 516, 536 and one is the cull pipeline. This pipeline can be added to the additional fixed function logic 516. The cull pipeline, in one embodiment, is a reduced version of the full geometry pipeline. Both the full pipeline and cull pipeline can execute multiple instances of the same application. Each instance has its own context. The ability to shade only from position can conceal long cull runs of discarded triangulars and allow shading to be completed faster in certain instances. In one embodiment, the cull pipeline logic in the additional fixed function logic logic 516 can execute the position shaders parallel to the main application. This is because the cull pipeline fetches only the position attribute from the vertices and shades that, without rendering the pixels to the buffer. The generated critical results can be used by the cull pipeline to compute visibility information for all triangles, regardless of whether they have been culled. The full pipeline, which in this case may be called a replay pipeline, can consume visibility information to skip culled triangles and shade only those triangles that are passed to the rasterization stage.

“In one embodiment, the additional fixed-function logic 516 may also include machine learning acceleration logic such as fixed function matrix multiplication logic for implementations that include optimizations for machine training or inferencing.”

“In each graphics subcore 501A-504F, there is a set execution resources that can be used to execute graphics, media, or compute operations in response requests from graphics pipeline, media pipe, or shader program. Graphic sub-cores 501A-508F contain multiple EU arrays 502A-502F and 504A-504F. Thread dispatch and inter-thread communications (TD/IC), logic 503A-503F, thread dispatch 503A-503F, and a media sampler 506A-506F. A shader processor 507A-507F. And shared local memory (SLM), 508A-508F. Each of the EU arrays 502A-502F and 504A-504F includes multiple execution units. These are general-purpose graphics processing units that can perform floating-point or integer/fixed point logic operations in support of graphics, media or compute operations, including graphics shader programs or media. The TD/ICLogic 503A-503F provides thread control and local thread dispatch for execution units within a subcore. It also facilitates communication between threads executing in the execution units. The 3D sampler 505A-505F is capable of reading texture and other 3D graphics-related data into memory. The 3D sampler is able to read different types of texture data based on the sample state and texture format. Based on the media data type and format, similar operations can be performed by the media sampler 506A-506F. One embodiment allows each graphics subcore 501A to 501F to include an alternately unified 3D or media sampler. The execution units of each of the subcores 501A-501F can be used by threads to execute on them. This allows threads within a thread group the ability to use a common pool on-chip memory.

“Execution Units”

“FIGS. 6A-6B show thread execution logic 600, which includes an array of processing elements used in a core graphics processor according to the embodiments herein. FIGS. FIGS. 6A-6B have the same reference numbers or names as any element of any other figure herein. They can function or operate in any way similar to that described elsewhere, but they are not limited. FIG. FIG. 6A shows an overview of thread execution Logic 600. It can include a variant the hardware logic illustrated with each of the sub-cores 501A-501F in FIG. 5. FIG. FIG. 6B shows an illustration of the internal details of an execution module.”

“As illustrated at FIG. “As illustrated in FIG. 6A, thread execution logic 600 may include a shader processor 602, thread dispatcher 604, instruction memory 606, and a scalable execution array 608A-608N that includes a plurality execution units 608A-608N. A sampler 610, data cache 612 and data port 614 are some examples. One embodiment of the scalable execution array allows for dynamic scaling by enabling or disabling any one or more execution units, e.g. execution unit 608A or 608B, 608C or 608D through 608N-1 or 608N depending on the computational requirements of the workload. The interconnect fabric links the components to one another in one embodiment. Some embodiments of thread execution logic 600 include one or more connections with memory, such system memory or cache memory through one or more instruction cache 606, data ports 614, sampler610 and execution units 608A-608N. Each execution unit (e.g. 608A is a standalone programmable general-purpose computational unit capable of processing multiple data elements simultaneously for multiple threads. The array of execution units 608A-608N can be scaled to include multiple execution units in different embodiments.

“In certain embodiments, execution units 608A-608N can be used primarily to execute shader program. Shader processor 602 is capable of processing the various shader program and dispatching execution threads through a thread dispatcher 604. One embodiment of the thread dispatcher has logic that can arbitrate thread initiation requests via the graphics and media pipes and instantiate the requested Threads on one or more execution units in the execution units 608A-608N. A geometry pipeline, for example, can dispatch vertex or tessellation to the thread execution logic. Some embodiments of thread dispatcher 604 allow for runtime thread spawning requests to be processed by the executing shader program.

“In certain embodiments, execution units 608A-608N can support an instruction set that includes native support of many standard 3D graphics shading instructions. This means that shader programs from graphics library (e.g. Direct 3D or OpenGL) can be executed with minimal translation. These units are capable of vertex and geo processing (e.g. vertex programs and geometry programs), pixel processing, e.g. pixel shaders and fragment shaders, and general-purpose processing such as compute and media shading. The execution units 608A-608N are capable of multi-issue single-instruction multiple data (SIMD), execution, and multi-threaded operation. This allows for a more efficient execution environment in the face higher latency memory acceses. Each execution unit contains a high-bandwidth register and an associated thread-state. Execution can be multi-issue per clock. It is possible to execute integer, single, double precision floating point operations as well as SIMD branch capability. Waiting for data from memory, or one of the shared function, the dependency logic within execution units 608A-608N causes the waiting thread to go into sleep until the requested data is returned. The waiting thread may sleep while hardware resources are used to process other threads. An execution unit, for example, can execute operations for a fragment shader or pixel shader during a delay in a vertex shading operation.

“Each execution unit of execution units 608A-608N works on arrays data elements. The?execution size? is determined by the number of data elements. The number of instructions that can be executed. An execution channel is a unit of execution that allows data element access, masking and flow control within instructions. The number of physical Arithmetic Logic Units or Floating Point Units for a particular graphics processing unit may not be affected by the number of channels. Execution units 608A-608N may support both integer and floating point data types in some embodiments.

“The execution unit instruction sets includes SIMD instructions. The execution unit can store the different data elements as a packed type in a register. Based on the size of the elements, the execution unit will process them. The execution unit can operate on a 256-bit-wide vector by storing the 256 bits in a register. It will then process the vector as four 64-bit packed elements (Quad Word (QW), size 32-bit packed elements (Double Word, DW) size elements), sixteen 16-bit packed elements (Word, W) size elements), and thirty-two separate 8 bit data elements. It is possible to have different vector widths or register sizes.

“In one embodiment, one or more execution units may be combined to form a fused execution un 609A-609N with thread control logic (607A-607N), that is common among the fused EUs. Multiple EUs can be fused together to form an EU group. Each EU can be configured to execute its own SIMD hardware thread. There are many ways to vary the number of EUs within a fused EU group. You can also perform different SIMD widths per-EU including SIMD8, SIMD16 and SIMD32. Each fused graphics execution module 609A-609N has at least two executions units. Fused execution unit 609A for example includes a first EU 608A, a second EU 608B and thread control logic 607A. This is common to both the first EU 608A (and the second EU 608B). The thread control logic 607A manages threads on fused graphics execution units 609A-609N, which allows each EU to execute using a common instruction register.

“One or more thread execution logic 600 thread instruction caches (e.g. 606) are included to cache thread instructions for execution units. One or more data caches (e.g. 612) may be included in some embodiments to cache thread data during thread execution. A sampler 610 may be included in some embodiments to provide texture sampling for 3D operations or media sampling for media operations. Sampler 610 may include specialized texture and media sampling functionality that processes texture and media data during the sampling process, before being provided to an execution unit.

“During execution, graphics and media pipelines send request thread initiation to thread execution logic 600 via the thread spawning or dispatch logic. Once a grouping of geometric objects have been processed and rasterized to pixel data, the pixel processor logic (e.g. pixel shader logic or fragment shader logic) is invoked. Shader processor 602 can be invoked to further compute output information and cause results (e.g. color buffers. depth buffers. stencil buffers.) A pixel shader, or fragment shader, calculates the vertex attributes to be interpolated across a rasterized object in some embodiments. In some cases, the shader processor 602’s pixel processor logic executes an API-supplied pixel shader program. Shader processor 602 sends threads via thread dispatcher 604 to execute the shader program. Shader processor 602 may use texture sampling logic in sampler 610 to access texture information stored in memory. The input geometry data and the texture data are combined to compute the pixel color data for each fragment. Or, it discards one or several pixels that could be used in further processing.

“In some embodiments the data port 614 provides an access mechanism to the thread execution logic 600 for transferring processed data to memory to be further processed on a graphics processing output pipeline. Some embodiments include or couple to one or more cache memories, such as data cache 612 to store data for memory access via data port.

“As illustrated at FIG. “As illustrated in FIG. 6B, a graphics execut unit 608 may include an instruction fetch device 637, an architectural register array (ARF 626), a thread arbiter 622 and a send unit 630. A set of SIMD floating points units (FPUs 634) can also be included. In one embodiment, there are dedicated integer SIMD ALUs 635. GRF 624/ARF 626 include the general register files and architecture files that are associated with each parallel hardware thread that is active in the graphics execution device 608. One embodiment stores the thread’s architectural state in the ARF 626 while the thread execution data is kept in the GRF 644. In the ARF 626, thread-specific registers can be used to store the execution state of each thread and the instructions pointers for each thread.

“In one embodiment, the graphics execution unit 608 uses a combination Simultaneous Multi-Threading and fine-grained Interleaved Multi-Threading. Architecture can be modified at design time to adjust the number of concurrent threads and registers per execution unit. This allows for execution unit resources to be divided among multiple threads.

“In one embodiment, graphics execution unit 608 may co-issue multiple instructions. These instructions could be different.” The thread arbiter 622 in the graphics execution unit thread 608 is able to dispatch instructions to any of the send units 630, 642, or SIMD FPU(s), 634 for execution. Each execution thread has access to 128 general purpose registers in the GRF 624. Each register can store 32 bytes and can be accessed as a SIMD8 element vector of 32-bit data elements. Each execution unit thread can access 4 bytes within GRF 624 in one embodiment. However, other embodiments may provide more or less register resources. One embodiment allows up to seven threads to run simultaneously. However, the number of threads per execution units can vary depending on the embodiments. The GRF 624 can store 28 Kbytes in total, even though seven threads could access 4 Kbytes. Flexible addressing modes allow registers to be addressed together in order to create larger registers or represent rectangular strided block data structures.

“In one embodiment, sampler operations and memory operations are sent via?send?” Instructions that are sent to the message passing unit 630. One embodiment of the invention allows branch instructions to be sent to a dedicated unit 632 in order to facilitate SIMD divergence, and eventually convergence.

“One embodiment of the graphics execution unit 608 contains one or more SIMD floating points units (FPU(s),) 634 that perform floating-point operations. The FPU(s), 634 can also perform integer computation in one embodiment. SIMD can execute 32-bit floating point (or integer) operations up to M, SIMD executes up to 2M 16 bit integer and 16-bit floating point operations. One embodiment provides an extended math capability that supports high-throughput transcendental mathematics functions and double precision 64 bit floating-point. There may also be a set 8-bit integer SIMD 635 that are present in some embodiments. These ALUs can be optimized for machine learning operations.

“In one embodiment arrays of multiple instances the graphics execution unit 608 may be instantiated within a graphics subcore grouping (e.g. a subslice). Product architects can choose the number of execution units per subcore grouping to ensure scalability. One embodiment of the execution unit 608 is capable of executing instructions on a variety of execution channels. A further embodiment allows each thread to be executed on a different channel.

“FIG. “FIG.7” is a block diagram that illustrates a graphics processor instruction format 700 according to certain embodiments. The graphics processor execution units can support multiple instruction formats in one or more of the embodiments. The components in the solid-lined boxes are those that are included in the execution unit instructions. However, the dashed lines indicate components that are not included or are only found in a subset of the instructions. Instruction format 700, described and illustrated in some embodiments, are macro-instructions. They are instructions that are supplied to the execution units, rather than micro-operations that result from instruction decode after the instruction has been processed.

“In some embodiments, graphics processor execution units support instructions in a 128 bit instruction format 710. Some instructions can be executed in a 64-bit compacted format 730 depending on the instruction selected, number of operands and instruction options. All instruction options are available in the native 128-bit format 710. Some options and operations are limited in the 64-bit format 730. There are many variations in the native instructions that can be found in the 64-bit format 730. Some embodiments compress the instruction using part a set index values in an index area 713. Execution unit hardware refers to a set compaction tables based upon the index values. The compaction table outputs are used by the execution unit hardware to reconstruct a native instruction using the 128-bit instruction format 710.

“For each format instruction opcode 712 specifies the operation that the execution units is to perform. Execution units execute each instruction simultaneously across multiple data elements within each operand. An example: In response to an add command, the execution unit executes a simultaneous addition operation across each color channel that represents a texture element and picture element. The execution unit executes every instruction across all operands by default. Instruction control field 714 in some embodiments allows control over specific execution options such as channel selection (e.g. predication) or data channel order (e.g. swizzle). Instructions in the 128-bit instruction formats 710 have an exec size field 716 that limits the number data channels that can be executed simultaneously. Some embodiments of the compact 64-bit instruction format 730 do not allow for the use of exec-size 716.

“Some execution unit instructions can have three operands. These include two source operands (src0720, src1722) and one destination 718. Some embodiments support dual destination instructions. In this case, one of the destinations can be implied. Data manipulation instructions may have a third source operand, such as SRC2 724, where the instruction opcode 702 determines how many source operands. A source operand for an instruction can be an instant (e.g. hard-coded) value that is passed along with it.

“In certain embodiments, the 128 bit instruction format 710 contains an access/address type field 726 indicating, for example, whether direct or indirect register address mode is used. Direct register addressing mode means that the register address of one operand is provided directly by bits in an instruction.

“In some embodiments, a 128-bit instruction file 710 contains an access/address field 726. This specifies an address mode or an access mode for instruction. The access mode can be used to determine the data access alignment of the instruction in one embodiment. Some embodiments support access modes, including a 16-byte aligned and 1-byte aligned mode. The byte alignment of an access mode determines which access mode the instruction operands will use. In a first mode, an instruction might use byte-aligned address for source and destinations operands, while in a second mode, it may use 16-byte aligned addressing to address all source and destination operands.

“In one embodiment, address mode in the access/address field 726 determines if the instruction will use direct or indirect addresses. If direct register addressing is used bits in an instruction provide the register address for one or more operands. Indirect register addressing mode may be used. The register address of one or several operands can be calculated based on the address register value and the address immediate field.

“In some embodiments, instructions are grouped according to opcode 712 bit fields to simplify Opcode decode 740. Bits 4, 5, and 6, for an 8-bit Opcode, allow the execution unit determine the type. This is just an example of the opcode grouping. A move and logic opcode number 742 may include data movement and logic instruction (e.g. compare (cmp), move (mov), etc.). Some embodiments share the five most important bits (MSB) of move and logic group 742. In these cases, move (move) instructions are given in the form 0000xxxxb while logic instructions are given in the form 0001xxxxb. Instructions in flow control instruction group 744, such as call, jump (jmp), are given in the form 0010xxxxb (e.g. 0x20). Miscellaneous instruction 746 contains a mixture of instructions. It includes instructions for synchronization (e.g. wait, send) and is in the form 0011xxxxb (e.g. 0x30). Parallel math instruction group 748 contains component-wise arithmetic directions (e.g. add, multiply (mul),) in the form 0100xxxxb. (e.g. 0x40). Parallel math group 748 allows you to perform parallel arithmetic across multiple data channels. The vector math group 748 includes arithmetic instructions in the form 0101xxxxb (e.g. 0x50). The vector math group does arithmetic, such as dot product calculations using vector operands.

“Graphics Pipeline”

“FIG. 8 is a block diagram for another version of the graphics processor 800. FIG. Elements of FIG. 8 with the same references numbers (or names as any other figure herein) can operate or function similarly to that described elsewhere. However, they are not limited to these.

“In some embodiments, graphics process 800 includes a geometry pipeline 820 and a media pipeline 840. A display engine 840 is included. Thread execution logic 850 is also included. And a render output pipeline 870. Graphic processor 800 may be a graphics processor in a multi-core processor system with one or more general purpose cores. Register writes to one or several control registers (not illustrated) are used to control the graphics processor. Commands to graphics processor 800 can be issued via a ringinterconnect 802. In some cases, the ring interconnect 802 links graphics processor 800 with other processing components such as general-purpose processors or other graphics processors. A command streamer 803, which provides instructions to the individual components of the media pipeline 830 or the geometry pipeline 802, interprets commands from ring interconnect 802.

“In some embodiments command streamer 803 directs operation of vertex fetcher 805. This reads vertex data out of memory and executes vertex processing commands from command streamer 803. Vertex fetcher 805 may provide vertex data to vertex shader 807, which performs coordinate-space transformations and lighting operations on each vertex. Vertex fetcher 805 or vertex shader 807 execute vertex processing instructions in some embodiments by sending execution threads to execution units 852A-852B through a thread dispatcher 831.

“In some embodiments execution units 852A-852B can be described as an array of vector processors with an instruction set for performing graphics or media operations. Execution units 852A-852B may have an L1 cache 851 attached that is either specific to each array or shared among them. You can configure the cache to be a data cache or an instruction cache. Or, it could be a single cache partitioned to store data and instructions in multiple partitions.

“Some embodiments of geometry pipeline 820 include tessellation parts to perform hardware-accelerated 3D object tessellation. A programmable hull shadingr 811 can be used to configure the tessellation operations in some embodiments. A programmable domain shadingr 817 allows back-end evaluations of the tessellation output. The tessellator 803 operates in the direction of hull shading 811 and includes special purpose logic to generate detailed geometric objects. This is done by using a coarse model as an input to geometry pipeline 820. If tessellation is not required, some embodiments can be bypassed (e.g., domain shader 817) and hull shader 811.

“In certain embodiments, complete geometric objects may be processed by a geo shader 819 through one or more threads dispatched at execution units 852A-852B or directly to the clipper829. The geometry shader can operate on whole geometric objects in some embodiments. This is different from previous stages of the graphics pipeline, which only works on vertices and patches of vertices. The vertex shader 807 will provide input to the geometry shader 819 if the tessellation is disabled. If the tessellation units have been disabled, then geometry shader 819 can be programmable using a geometry shader program.

“Before rasterization, a clipper 829 processes vertex data. The clipper 829 can be either a clipper with a fixed function or a programable clipper that has clipping and geometry shading functions. In some embodiments, the render output pipeline 870 includes a depth test and rasterizer component 873 that dispatch pixel shaders to convert geometric objects into per-pixel representations. In some embodiments, thread execution logic 855 includes pixel shader logic. An application may be able to bypass the depth test and rasterizer components 873 and gain un-rasterized vertex information via a stream out device 823 in some embodiments.

“The graphics processor 800 includes an interconnect bus, interconnect cloth, or another interconnect mechanism that permits data and message to be passed among the main components. Execution units 852A-852B, and any associated logic units (e.g. L1 cache 851, sampler 854 texture cache 858 etc.) may be used in some embodiments. Interconnect via a data channel 856 to access memory and communicate with render pipeline components of processor. Some embodiments have sampler 854, caches 851, 851, 858, and execution units 852A-852B with their own memory access paths. One embodiment allows the texture cache 858 to be used as a sampler cache.

“In certain embodiments, render output pipeline 870 includes a depth test component 873 and a rasterizer that convert vertex-based objects to pixel-based representations. Some embodiments include a windower/masker unit that performs fixed function triangle or line rasterization. In some embodiments, a render cache 878 or depth cache 879 is also available. The pixel operations component 8877 performs pixel-based operations upon the data. However, in certain instances, pixel operations that are associated with 2D operations (e.g. Bit block image transfers with blending are performed by the 2D engines 841 or by the overlay display planes 843 by the pixel operations component 877. A shared L3 cache 875 can be made available to all graphics components in some embodiments. This allows data sharing without main system memory.

“In certain embodiments, the graphics processor media pipeline 830 contains a media engine 837 as well as a video front end 834. Video front-end 834 may receive pipeline commands from command streamer 803. Some embodiments include a separate command streamer in the media pipeline 830. Some embodiments of video front-end 834 process media commands before sending them to the media engine 837. Some embodiments of media engine 837 include thread spawning functionality that spawns threads to be dispatched to thread execution logic 851 via thread dispatcher 831.

“In some instances, graphics processor 800 also includes a display driver 840. Display engine 840 can be external to processor 800. It is connected with processor 800 via the ring interconnect 802, another interconnect bus, or fabric. Display engine 840 may include a 2D engine (831) and a display controller (843). Display engine 840 may contain special purpose logic that can operate independently from the 3D pipeline in some instances. Display controller 843 may be coupled with a display unit (not shown), which could be either a system-integrated display device such as a laptop or an external display unit attached via a connector.

“In some embodiments the media pipeline 820 or geometry pipeline 830 can be configured to perform operations that are based on multiple graphics programming interfaces. They are not limited to one particular application programming interface (API). Driver software for the graphics processing unit converts API calls specific to a particular media library or graphics into commands that can then be processed by the processor. Some embodiments provide support for Open Graphics Library, Open Computing Language (OpenCL), Vulkan graphics, and compute API all from the Khronos Group. Support may be available for Direct3D from Microsoft Corporation in some embodiments. One or more of these libraries might be supported in some cases. Open Source Computer Vision Library (OpenCV) may also be supported. If a mapping is possible between the pipeline of future API and the pipeline of graphics processor, a future API would support a compatible 3D pipe.

“Graphics Pipeline Programming

“FIG. 9A is a block illustration of a graphics processor command structure 900 according to certain embodiments. FIG. FIG. 9B shows a block diagram that illustrates a graphics processor command sequence 910 according an embodiment. FIG. FIG. 9A shows the components that are typically included in a graphic command. The dashed lines indicate components that can be optionally included or only included in a subset of graphics commands. FIG. 9A shows an exemplary graphic processor command format 900. 9A shows the exemplary graphics processor command format 900 of FIG. Some commands also include a sub-opcode 905 as well as a command size of 908

Client 902 in some embodiments specifies the client unit that processes command data. A graphics processor command parser may examine the client field for each command in order to condition further processing and route the command data the appropriate client unit. Some graphics processor client units may include a memory interface, render, a 2D, 3D, and media unit. Each client unit is equipped with a processing pipeline that processes commands. The client unit receives the command and reads sub-opcode 904 to determine what operation it should perform. The command is executed by the client unit using data field 906. Some commands may require an explicit command size 908 to indicate the size of the command. Some embodiments automatically determine the size of some commands based upon the command opcode. Some embodiments align commands via multiples or a double-word.

“Flow diagram in FIG. “The flow diagram in FIG. 9B illustrates an example graphics processor command sequence 910. A version of the command sequence 910 is used by some embodiments of software or firmware for a data processing device that includes a graphics processor. This allows them to execute and terminate various graphics operations. This is a sample command sequence that will be described as an example. Embodiments are not limited to the commands shown or this command sequence. The commands can also be issued in batches of commands within a command sequence so that the graphics processor processes the sequence in at least partial concurrence.

“In some instances, the graphics processor command sequence 910 could begin with a pipeline flush 912 to cause any active pipeline to execute the currently pending orders for the pipeline. The media pipeline 924 and the 3D pipeline 922 may not work simultaneously in some embodiments. To cause the active graphics pipeline complete all pending commands, the pipeline flush is performed. The pipeline flush will cause the command parser of the graphics processor to stop processing commands until the active drawing engines have completed all pending operations. Any data in the render cache marked as ‘dirty’ can be flushed to memory. You can flush to memory any data in the render cache that is marked?dirty? Pipeline flush command 912 may be used in some embodiments to synchronize pipelines or to place the graphics processor into a low-power state.

“In some embodiments, the pipeline select command 913 can be used when a command sequence demands that the graphics processor switch between pipelines. A pipeline select command 913 may only be required once in an execution context before issuing commands to pipelines, except when the context is to issue pipeline commands for both pipelines. A pipeline flush command 912 may be required before a pipeline switch is made via the pipeline select order 913.

“In certain embodiments, the pipeline control command 914 configures the graphics pipeline for operation. It is used to program both the 3D pipeline 922 or the media pipeline 924. Pipeline control command 914 can be used to configure the pipeline state of the active pipeline in some embodiments. One embodiment uses the pipeline control command 914 to synchronize the pipeline and clear data from cache memories.

“In some embodiments return buffer state commands 916 can be used to set up a number of return buffers that will allow the pipelines to write data. Some pipeline operations require configuration, selection or allocation of return buffers that will be used to write intermediate data. Some embodiments also use one or more return buffers for cross-thread communication and storage of output data. The return buffer state 916 may include the ability to select the size and number return buffers that will be used for certain pipeline operations.

“The command sequence’s remaining commands differ depending on the active pipeline for operation. The pipeline determination 920 determines that the command sequence should be tailored to the 3D pipe 922 starting at the 3D state 930 or the mediapipe 924 starting at the 940.

“Commands to configure the 3D pipe state 930 include 3D state setting command for vertex buffer, vertex element, depth buffer, and other state variables. These commands must be used before 3D primitive commands can be processed. These commands are determined at most in part according to the 3D API being used. 3D pipeline state 930 commands can be used to disable or bypass specific pipeline elements in certain embodiments.

“In certain embodiments, the 3D primitive 932 command can be used to submit 3D primitives for processing by the 3D pipeline. The command data and parameters passed to the graphics processor by the 3D primitive 932 command will be forwarded to vertex fetch in the graphics pipeline. To generate vertex data structure, the vertex fetch function uses 3D primitive 932 command information. One or more return buffers are used to store the vertex data structures. 3D primitive 932 command can be used in some embodiments to perform vertex operations via vertex shading. 3D pipeline 922 dispatches shader execution Threads to graphics processor execution unit in order to process vertex shadingrs.

“Some embodiments trigger 3D pipeline 922 via an execute 934 command, event or command. A register write triggers command execution in some embodiments. Execution can be triggered by a?go? or ?kick? or?kick? command in the command sequence. One embodiment triggers command execution by using a pipeline sync command to flush the command sequence through graphics pipeline. The 3D pipeline will do geometry processing for 3D primitives. After operations are completed, the resulting geometric objects will be rasterized. The pixel engine then colors the pixels. These operations might also include additional commands for controlling pixel shading or pixel back-end operations.

“In certain embodiments, the graphics processing command sequence 910 is set to follow the media pipeline 924 path for performing media operations. The media operations or compute operations that are to be done will determine the programming style and the use of the media pipeline 924. During media decode, specific media decode operations can be offloaded to media pipeline. Some embodiments allow for media decode to be done entirely or partially using the resources of one or more general-purpose processing cores. One embodiment of the media pipeline includes elements for general purpose graphics processor unit (GPGPU), operations. In this case, the graphics processor is used to perform SIMD Vector operations using computational shader programmes that are not directly related to rendering graphics primitives.

“Media pipeline 924 may be configured in the same way as 3D pipeline 922 in some embodiments. A list of commands that configure the media pipeline state 940 is dispatched before the media object commands 942. Some embodiments of the commands for the media pipe state 940 include data that configures the elements of the media pipeline to be used in processing the media objects. This data includes data to configure video decode and video code logic within the media pipeline. For example, encode or decode format. Some embodiments support one or more pointers to “indirect” when executing commands for the media pipeline state 940. state elements that include a set of state settings.”

“Media object commands 942 supply points to media objects for processing in the media pipeline. Memory buffers that contain video data are media objects. In certain embodiments, the media object command 942 must be issued from all valid media pipeline states. Once the pipeline state has been configured and queued media object commands 942, the media pipeline 924 can be triggered by an execute command 944, or another equivalent execute event (e.g. register write). The media pipeline 924 output may be processed using operations provided by either the 3D pipeline 922, or the media pipeline 924. GPGPU operations may be configured and executed in a similar way to media operations.

“Graphics Software Architecture.”

“FIG. “FIG. Software architecture may include a 3D graphics app 1010, an operating software 1020 and at least one processor (1030). Some embodiments include a processor 1030 that includes a processor core (1032) and one or more general purpose processor cores (1034). Each of the operating system 1020 and graphics application 1010 are executed in the system memory 1050.

“In some instances, 3D graphics software 1010 includes one or more shader programmes including shader instructions 1012. Shader language instructions can be written in a high-level shader languages such as the High Level Shader Languages (HLSL) and the OpenGL Shader Languages (GLSL). Executable instructions 1014 are also included in the application. These instructions can be executed by the general-purpose core 1034. Graphic objects 1016 are also included in the application, as defined by vertex information.

Click here to view the patent on Google Patents.

How to Search for Patents

A patent search is the first step to getting your patent. You can do a google patent search or do a USPTO search. Patent-pending is the term for the product that has been covered by the patent application. You can search the public pair to find the patent application. After the patent office approves your application, you will be able to do a patent number look to locate the patent issued. Your product is now patentable. You can also use the USPTO search engine. See below for details. You can get help from a patent lawyer. Patents in the United States are granted by the US trademark and patent office or the United States Patent and Trademark office. This office also reviews trademark applications.

Are you interested in similar patents? These are the steps to follow:

1. Brainstorm terms to describe your invention, based on its purpose, composition, or use.

Write down a brief, but precise description of the invention. Don’t use generic terms such as “device”, “process,” or “system”. Consider synonyms for the terms you chose initially. Next, take note of important technical terms as well as keywords.

Use the questions below to help you identify keywords or concepts.

  • What is the purpose of the invention Is it a utilitarian device or an ornamental design?
  • Is invention a way to create something or perform a function? Is it a product?
  • What is the composition and function of the invention? What is the physical composition of the invention?
  • What’s the purpose of the invention
  • What are the technical terms and keywords used to describe an invention’s nature? A technical dictionary can help you locate the right terms.

2. These terms will allow you to search for relevant Cooperative Patent Classifications at Classification Search Tool. If you are unable to find the right classification for your invention, scan through the classification’s class Schemas (class schedules) and try again. If you don’t get any results from the Classification Text Search, you might consider substituting your words to describe your invention with synonyms.

3. Check the CPC Classification Definition for confirmation of the CPC classification you found. If the selected classification title has a blue box with a “D” at its left, the hyperlink will take you to a CPC classification description. CPC classification definitions will help you determine the applicable classification’s scope so that you can choose the most relevant. These definitions may also include search tips or other suggestions that could be helpful for further research.

4. The Patents Full-Text Database and the Image Database allow you to retrieve patent documents that include the CPC classification. By focusing on the abstracts and representative drawings, you can narrow down your search for the most relevant patent publications.

5. This selection of patent publications is the best to look at for any similarities to your invention. Pay attention to the claims and specification. Refer to the applicant and patent examiner for additional patents.

6. You can retrieve published patent applications that match the CPC classification you chose in Step 3. You can also use the same search strategy that you used in Step 4 to narrow your search results to only the most relevant patent applications by reviewing the abstracts and representative drawings for each page. Next, examine all published patent applications carefully, paying special attention to the claims, and other drawings.

7. You can search for additional US patent publications by keyword searching in AppFT or PatFT databases, as well as classification searching of patents not from the United States per below. Also, you can use web search engines to search non-patent literature disclosures about inventions. Here are some examples:

  • Add keywords to your search. Keyword searches may turn up documents that are not well-categorized or have missed classifications during Step 2. For example, US patent examiners often supplement their classification searches with keyword searches. Think about the use of technical engineering terminology rather than everyday words.
  • Search for foreign patents using the CPC classification. Then, re-run the search using international patent office search engines such as Espacenet, the European Patent Office’s worldwide patent publication database of over 130 million patent publications. Other national databases include:
  • Search non-patent literature. Inventions can be made public in many non-patent publications. It is recommended that you search journals, books, websites, technical catalogs, conference proceedings, and other print and electronic publications.

To review your search, you can hire a registered patent attorney to assist. A preliminary search will help one better prepare to talk about their invention and other related inventions with a professional patent attorney. In addition, the attorney will not spend too much time or money on patenting basics.

Download patent guide file – Click here