Invented by Joydeep Ray, Scott Janus, Varghese George, Subramaniam Maiyuran, Altug Koker, Abhishek Appu, Prasoonkumar Surti, Vasanth Ranganathan, Andrei Valentin, Ashutosh Garg, Yoav Harel, Arthur Hunter, JR., Sungye Kim, Mike MacPherson, Elmoustapha Ould-Ahmed-Vall, William Sadler, Lakshminarayanan Striramassarma, Vikranth Vemulapalli, Intel Corp

The market for Sparse Optimizations for a Matrix Accelerator Architecture In recent years, there has been a significant surge in the demand for efficient computing architectures that can handle large-scale data processing tasks. One such architecture that has gained considerable attention is the Matrix Accelerator Architecture. This architecture is specifically designed to accelerate matrix operations, which are fundamental to many computational tasks, such as machine learning, data analytics, and scientific simulations. The Matrix Accelerator Architecture is highly efficient in handling dense matrix operations, where most of the elements in the matrix are non-zero. However, when it comes to sparse matrices, where a significant portion of the elements are zero, the performance of the architecture can be suboptimal. This is where the market for Sparse Optimizations for a Matrix Accelerator Architecture comes into play. Sparse optimizations refer to techniques and algorithms that exploit the sparsity of matrices to improve the performance and efficiency of matrix operations. These optimizations aim to reduce the computational complexity and memory requirements by only considering the non-zero elements of the matrix. By doing so, they can significantly speed up the execution time and reduce the energy consumption of matrix operations on the Matrix Accelerator Architecture. The market for Sparse Optimizations for a Matrix Accelerator Architecture is driven by the growing demand for high-performance computing solutions that can handle large-scale sparse data efficiently. With the increasing availability of big data and the need for real-time analytics, organizations are looking for ways to optimize their computational resources and reduce the time and cost associated with processing sparse matrices. One of the key players in this market is the software industry. Companies specializing in developing software libraries and frameworks for matrix operations are actively investing in research and development to incorporate sparse optimizations into their products. These software solutions provide developers with the tools and algorithms necessary to efficiently handle sparse matrices on the Matrix Accelerator Architecture. Another important player in this market is the hardware industry. Manufacturers of Matrix Accelerator Architecture are continuously working on improving their hardware designs to better handle sparse matrices. This includes developing specialized hardware units and memory architectures that can efficiently store and process sparse data. These advancements in hardware technology are crucial for achieving optimal performance and energy efficiency in sparse matrix operations. Furthermore, research institutions and academia play a significant role in driving innovation in this market. Researchers are constantly exploring new algorithms and techniques to further enhance the performance of sparse optimizations for the Matrix Accelerator Architecture. Their findings and advancements are often published in academic journals and conferences, contributing to the overall growth and development of the market. In conclusion, the market for Sparse Optimizations for a Matrix Accelerator Architecture is experiencing rapid growth due to the increasing demand for efficient computing solutions that can handle large-scale sparse data. With the continuous advancements in software, hardware, and research, the performance and efficiency of matrix operations on the Matrix Accelerator Architecture are expected to improve significantly. This market presents lucrative opportunities for software developers, hardware manufacturers, and researchers to capitalize on the growing demand for sparse optimizations in the computing industry.

The Intel Corp invention works as follows

Embodiments” described herein are software, firmware and hardware logic which provide techniques for performing arithmetic with sparse data using a systolic processor unit. The embodiment described herein provides techniques for skipping computations on matrices or submatrices that are zero-filled. Embodiments also provide techniques for maintaining data compression up to the processing unit. Embodiments also provide an architecture of a sparse-aware logic unit.

Background for Sparse Optimizations for a Matrix Accelerator Architecture

Current parallel graphic data processing” refers to systems and methods that can perform specific operations on graphics data, such as linear interpolation (linear interpolation), tessellation (rasterization), texture mapping, depth test, etc. Graphic processors were traditionally based on fixed-function computational units for processing graphics data. However, recent developments have made portions of the graphics processors programmable. This allows them to perform a greater variety of operations to process vertex and fragment data.

To further improve performance, graphics processors implement processing techniques like pipelining, which attempt to process as much graphics data in parallel as possible across the different parts the graphics pipeline. STMT architectures, which are parallel graphics processors that use single instructions and multiple threads (STMT), maximize the amount of processing done in parallel. SIMT architectures are designed to maximize processing efficiency by executing parallel program instructions as synchronously as possible. “Shane Cook’s CUDA Programming Chapter 3 (2013) provides a general overview of the software and hardware used for KW architectures.

A graphics processor unit (GPU), is communicatively connected to the host/processor/cores to accelerate graphics operations, machine learning operations, pattern analysis and/or other general-purpose GPU functions (GPGPU). The GPU can be coupled to the cores of the host processor via a bus (e.g. PCIe, NVLink) or other interconnect. The GPU can also be integrated into the same package as the cores, and communicate with them over an internal processor interconnect or bus (i.e. internal to the package). Regardless of the manner in which the GPU is connected, the processor cores may allocate work to the GPU in the form of sequences of commands/instructions contained in a work descriptor. The GPU then uses dedicated circuitry/logic for efficiently processing these commands/instructions.

Embodiments” described herein are software, firmware and hardware logic which provide techniques for performing arithmetic with sparse data using a systolic processor unit. The embodiment described herein provides techniques for skipping computations on matrices or submatrices that are zero-filled. Embodiments also provide techniques for maintaining data compression up to the processing unit. Embodiments also provide an architecture of a sparse-aware logic unit.

The following description provides a detailed understanding of the subject. It will be obvious to anyone skilled in the art that embodiments described herein can be practiced without any of these details. Other instances are not described because they may obscure the details of the current embodiments.

System Overview

FIG. “FIG. A processing subsystem 101 is included in the computing system 100. It includes one or more processors 102 and a system storage 104. They communicate via an interconnection path, which may include a memory hub (105). The memory hub 105 can be either a separate component of a chipset or integrated into the processor(s) 102. Through a communication link 106, the memory hub 105 can be connected to an I/O system 111. I/O subsystem 111 also includes an I/O hub 107 which can allow the computing system 100 receive input from one or multiple input devices (108). The I/O hub (107) can also enable a display controller to be connected to the processor(s). 102 to provide outputs to the display device(s). 110A. One embodiment of the I/O hub107 and one or more display devices 110A can be combined with an embedded, local, or internal display device.

The processing subsystem, for example includes one or multiple parallel processors 112 coupled with memory hub 105 through a bus 113 or another communication link. Communication link 113 can be any of a number of standard-based technologies or protocols for communication links, including but not limited PCI Express. It may also be a vendor-specific communications interface or communications fabrics. The one or multiple parallel processors 112 can form a computing-focused parallel or vector processor system, which may include a high number of cores or clusters. For example, a processor with many integrated cores (MIC). The one or multiple parallel processors 112 can, for example, form a graphics subsystem which can output pixels via the I/O Hub 107 to the one display device 110A. The one or multiple parallel processors 112 may also include a display interface and controller (not shown), allowing a direct connection with one or several display devices 110B.

Within I/O subsystems 111, system storage units 114 can be connected to the I/O Hub 107 to provide storage for the computing system. I/O switch(s) 116 are used as an interface to connect the I/O hub to other components such as network adapters 118, wireless network adapters 119, which may be integrated in the platform and various other devices, that can be added by one or more add-in devices 120. Add-in devices 120 can also include external graphics processors and/or computation accelerators. The network adapter can be Ethernet or another wired adapter. Wireless network adapter 119 may include Wi-Fi, Bluetooth or near field communication (NFC) or any other network device with one or more wireless radios.

The computing system 100 may include additional components that are not shown. These could include USB or other ports connections, optical storage drives and video capture devices. The communication paths connecting the components of FIG. 1. may be implemented using any of the suitable protocols, including PCI (Peripheral Complement Interconnect) based protocols (e.g. PCI-Express) or any other bus/point-to-point communication interfaces or protocols(s), like the NV Link high-speed interconnect or interconnect protocols that are known in the art.

The one- or more parallel processors 112 can incorporate circuitry optimized to graphics and video processing including, for instance, video output circuitry and constitute a graphics processing units (GPU). The one or more processors 112 may also include circuitry for general-purpose processing while maintaining the computational architecture described herein. The components of the computing system can be integrated on one integrated circuit with other system elements. The one or more parallel processing units 112, the memory hub 105 and processors 102 can all be integrated in a single integrated circuit (SoC). The components of the computing systems 100 can also be integrated in a single package, forming a “system-in-package” (SIP) configuration. In one embodiment, at least a part of the components of computing system 100 may be integrated into a Multi-chip Module (MCM), and this MCM can then be connected to other multi-chips modules into a modular computer system.

It will be appreciated that the computing system 100 shown herein is illustrative and that variations and modifications are possible. The connection topology, including the number and arrangement of bridges, the number of processor(s) 102, and the number of parallel processor(s) 112, may be modified as desired. For instance, system memory 104 can be connected to the processor(s) 102 directly rather than through a bridge, while other devices communicate with system memory 104 via the memory hub 105 and the processor(s) 102. In other alternative topologies, the parallel processor(s) 112 are connected to the I/O hub 107 or directly to one of the one or more processor(s) 102, rather than to the memory hub 105. In other embodiments, the I/O hub 107 and memory hub 105 may be integrated into a single chip. It is also possible that two or more sets of processor(s) 102 are attached via multiple sockets, which can couple with two or more instances of the parallel processor(s) 112.

Some of the components shown in this document are optional, and may not be included with all implementations 100 of the computing system. Some components, such as add-in cards and peripherals, may not be included in all implementations of the computing system 100. Some architectures use different terms for similar components to those shown in FIG. 1 . In some architectures, for example, the I/O hub may be called a Southbridge, and the memory hub may be called a Northbridge.

FIG. 2A shows a parallel processor. As described in this document, the parallel processor 200 can be a GPU or GPGPU. The parallel processor 200 can be implemented by one or more integrated devices such as programmable processing units, application-specific integrated circuits (ASICs) or field-programmable gate arrays. The parallel processor 200 shown may be one or more of the parallel processors 112 in FIG. 1 .

The parallel processor includes a parallel unit 202.” The parallel processor 200 includes an I/O Unit 204 which allows communication with other devices including other instances the parallel processing unit. I/O units 204 can be directly connected to devices. The I/O unit can be connected to other devices using a hub interface such as the memory hub 105. The communication link 113 is formed by the connections between the I/O Unit 204 and the memory hub. The I/O unit is connected to a host interface (206), and a memory bar (216) within the parallel processing unit.

When the host 206 receives commands via an I/O unit, it can direct the front end 208 to execute those commands. In one embodiment, the front end 208 is coupled with a scheduling 210 that distributes commands or other items of work to a processing array array 212. The scheduler ensures that the cluster array is configured correctly and is in a valid condition before tasks are distributed. The scheduler can be implemented using firmware logic running on a microcontroller. The scheduler 210 implemented on a microcontroller is configured to perform complex work distribution and scheduling operations at fine and coarse granularity. This allows for rapid context switching and preemption of threads running on the processing array. The host software should be able to prove workloads on the processing array via multiple graphics processing doors. The scheduler logic in the scheduler microcontroller can automatically distribute the workloads across the array 212.

The processing cluster array 212 may include as many as?N? Processing clusters are grouped together (e.g. clusters 214A-214N). Each cluster 214A to 214N in the processing cluster array can run a large number concurrent threads. The scheduler can distribute work among the clusters 214A – 214N in the processing cluster array using different scheduling algorithms and/or work distribution methods, depending on workload for each program or computation. The scheduler 210 can handle the scheduling dynamically, or compiler logic can assist in part during compilation of the program logic for execution by processing cluster array 212. The processing cluster array can allocate different clusters (214A-214N) to process different types or computations.

The processing cluster array 212 is capable of performing various parallel processing operations. The cluster array 212, for example, is configured to perform general purpose parallel computation operations. The processing cluster array can, for example, include logic that executes processing tasks such as filtering video or audio data, performing modeling and physics operations and performing data transformations.

The processing cluster array is configured to perform parallel graphic processing operations. The processing cluster array can be configured in such embodiments to include additional logic for supporting the execution of graphics processing operations. This logic may include, but is not limited to, texture sampling logic, which performs texture operations, and tessellation, or other vertex processing logic. The processing cluster array can also be configured to run shader programs related to graphics processing, such as vertex shaders and other shaders. The I/O unit can be used to transfer data for processing from the system memory. The transferred data may be written to the on-chip memory (e.g. parallel processor memory 222) and then returned to system memory.

The scheduler 210 can be configured in embodiments where the parallel processing unit is used for graphics processing to split the workload into roughly equal-sized tasks. This will allow the graphics processing to be distributed to the multiple clusters (214A-214N) of the processing array 212. In some embodiments, different portions of the processing array 212 may be configured to perform various types of processing. A first portion of the processing cluster array 212 may be configured for vertex shading and topology creation, while a second may be configured for tessellation, geometry shading, or pixel shading. The intermediate data generated by the clusters 214A to 214N can be stored in buffers, allowing the data to be sent between the clusters 214A to 214N to undergo further processing.

The processing cluster array 212 may receive processing tasks during operation. This is done via the scheduler210. The scheduler receives commands from the front end 208 defining processing tasks. Processing tasks are data that describe how data will be processed. These include data indices, such as surface (patch), primitive, vertex, and/or pixels data. Also, state parameters and commands can be used to define the process (e.g. what program is to run). The scheduler 210 can be set up to retrieve the indices for the tasks, or it may get them from the front-end 208. Front end 208 can also be configured to ensure that the processing cluster array (212) is in a valid state prior to the workload specified by incoming commands buffers (e.g. batch buffers, push buffers etc.).

Click here to view the patent on Google Patents.