NVIDIA and AMD have given almost all the details of their latest GPU Gaming architectures, but items are often kept under lock and key. Most of them are black boxes that will never be documented, others have not been revealed, but they can be measured. When executing instructions, how are the latencies in the RX 6000 and the RTX 30, which ones? are the differences?
Instruction latencies are not information that GPU manufacturers usually give in their public specifications, as they are of interest to game and application developers. That is why they are often unknown to the general public. However, it is information that hardware fans are always interesting to know.
Instruction latency on a GPU
The latency of instructions in every processor is the existing time, measured in clock cycles, between the processing unit and the data it needs to perform an operation. If the data is not found in the registers then it is necessary for the capture mechanism to go through the entire cache structure until it reaches a data.
Because GPUs are made up of a huge number of cores compared to CPUs, the number of requests to VRAM is huge. Hence, their cores have a different composition and are more based on executing several contexts or threads than on serial execution. This allows them to switch from one context to another while waiting for data from one of the threads.
However, the latency of instructions in a GPU is also important, although to a lesser extent than in a CPU, since a GPU does not always have the level of tasks to do to keep it busy. In addition, there are moments in the 3D pipeline such as Pixel / Fragment Shaders, where access to the VRAM is done continuously and low latency is necessary to solve most of the execution threads.
AMD RX 6000 vs NVIDIA RTX 30 in latencies
The Chips and Cheese website has decided to measure the latencies of the latest gaming GPUs, NVIDIA RTX 30 and AMD RX 6000, for this they have used the pointer chasing test in OpenCL. This test consists of copying entire blocks of data from the VRAM to the caches. Depending on the size of the block, this data can be copied to the different levels of caches on the GPU, to later measure the access time to the entire block, which will be at different levels depending on its size.
The test shows how AMD has tweaked its cache hierarchy in RDNA architectures, since this was one of the points where its previous graphics architecture, GCN, was well below NVIDIA. Keep in mind that without counting the Infinity Cache, the AMD architecture has 3 cache levels: L0 within the Compute Unit, L1 through Shader Array and then L2. As for the Infinity Cache, it adds only an additional 20 nanoseconds to access compared to L2.
As for NVIDIA, its cache structure has not evolved since Maxwell, GTX 900. The time to go from the cache inside the SM in the RTX 30 to the L2 cache is 100 nanoseconds, while in AMD it goes from its equivalent to the L2 cache is only 66 ns, with one level in the middle. One cause could be the huge size of NVIDIA’s GPUs compared to AMD’s. Being the cache one of the improvements that NVIDIA could have applied in Lovelace.