A closer look at ARM’s new Cortex-A75 and Cortex-A55 CPUs
ARM recently unveiled its next-generation CPU cores, the Cortex-A75 and Cortex-A55, which are the first processors to support the company’s also new DynamIQ multi-core technology. The A75 is the successor to ARM’s high performance A73 and A72, while the new Cortex-A55 is a more power efficient replacement for the popular Cortex-A53.
Starting with the Cortex-A75, this CPU is more inspired by the Cortex-A73 rather than a direct upgrade of it. ARM states that there’s been a much larger number of micro-architecture changes this time around compared to the introduction of the A73, or even the move from the A57 to the A72.
The result is that ARM has made performance improvements across the board, resulting in a typical 22 percent boost to single threaded performance over the Cortex-A73 on the same process node and running at the same frequency. More specifically, ARM quotes a 33 percent boost to floating point and NEON performance, while memory throughput sees at 16 percent boost.
Clock speed wise, the Corex-A75 is likely to top out at 3 GHz on 10 nm, but could be pushed a little higher on future 7 nm designs. ARM says that for the same workload, the A75 won’t consume any more power than the A73, but it can be pushed further if extra performance is required, at the expense of some extra energy consumption. Although in mobile implementations, we aren’t likely to see SoC manufacturers push the power consumption any higher than they already do.
ARM has accomplished these improvements via a number of major microarchitecture changes. The Cortex-A75 moves two a 3-way superscalar design, from 2-way in the Cortex-A73. What this means is that, given a specific workload, the Cortex-A75 is able to execute up to 3 instructions in parallel per clock cycle, essentially increasing the core’s maximum throughput. The A75 boasts 7 execution units, two load/stores, two NEON and FPU, a branch, and two integer cores.
Speaking of NEON, ARM has also introduced a dedicated renaming engine for NEON FPU instructions. There’s now support for FP16 half-precision processing, which offers double the throughput for limited resolution processing examples, such as image processing. There’s also support for the Int8 dot product number format too, which offers a boost to a number of neural network algorithms.
To help keep the processor’s out-of-order pipeline well fed, ARM has adopted 4-wide instruction fetching to grab four instructions per cycle. The processor is now also able to perform single cycle decode with instruction fusing and micro-ops too. The core’s branch predictor has also been given a tune-up to keep up with the wider out-of-order execution capabilities of the A75. However, it’s still based on the same 0-cycle design as the A73, which uses a large Branch Target Address Cache (BTAC) and Micro-BTAC.
Finally, the Cortex-A75 now features a private L2 cache, implementable as either 256KB or 512KB, with a shared L3 cache available when implementing a DynamIQ multi-core solution, and most of the data in these caches will be exclusive. This change results in a much lower latency for hitting the L2 cache, down from 20 cycles with the Cortex-A73 to just 11 cycles in the A75.
Put simply, all of this means that ARM is not only boosting the performance of the A75 by allowing for additional instructions to be executed in a single cycle, but has also designed a micro-architecture better capable of keeping the core fed with instructions. As we mentioned in our overview of DynamIQ, the Cortex-A75 also implements the new DynamIQ Shared Unit as part of its design. This introduces new cache stashing, low latency access to peripherals, and fine-grain power management options into the core as well.
The Cortex-A55 represents a notable but less drastic overhaul to ARM’s power efficient processor design, with a number of important changes from last generation’s hugely popular Cortex-A53 core. Energy efficiency remains a top priority with this tier of ARM CPUs, and the A55 boasts a 15 percent improvement to power efficiency over the A53. At the same time, ARM has been able to boost performance two fold in certain memory bound situations, with a typical 18 percent performance improvement over an A53 running at the same speeds and on the same process node.
The range of configuration options present with the Cortex-A55 also makes this ARM’s most flexible core design yet. In total, the company estimates that there are over 3000 different possible configurations, due in part to the optional NEON/FPU, asynchronous bridges, and Crypto arrangements, plus the configurable L1, L2, and L3 cache sizes.
The A55 sticks with an inorder design and a short 8-stage pipeline, just like the A53. As such, processor frequencies are expected to be roughly similar to before on the same node, which currently offers a good balance for performance and efficiency. So most A55 solutions will likely be running at 2.0 GHz on a 10nm process, but extreme cases could see 2.6 GHz solutions. However, such a frequency boost would defeat the purpose of DynamIQ, which allows for more cost effective implementations of a single big core where extra performance is required. In reality, we may actually see this LITTLE core run at lower speeds to save power when implemented in DynamIQ systems.
In terms of micro-architecture changes, the A55 now separates the load/store pipe allowing for the dual issue of loads and stores in parallel. The pipeline is also now able to more quickly forward ALU instructions to the AGU, reducing the latency by 1 cycle for common ALU operations. ARM has also made improvements to the prefetcher, which is now able to spot more complex cache patterns beyond existing step patterns and can prefetch to L1 or L3 caches.
Furthermore, the 0-cycle branch predictor boasts a fancy sounding new “neural network” or conditional prediction algorithm. However, this is a more limited branch predictor than the one inside the Cortex-A75, as there’s little purpose in building a huge branch predictor for a small in-order pipeline core. Instead, ARM’s new design makes uses a main conditional predictor in conjunction with “micro-predictors” positioned where needed for accurate back-to-back predictions. The predictor has also been updated with a new loop termination prediction improvement. This should help avoid mispredicting the end of loop programs to scavenge a little bit of extra performance.
ARM has made a number of more specific performance optimizations inside the Cortex-A55 as well. The extended 128-bit NEON pipeline is now able to handle eight 16-bit operations per cycle using FP16 instructions or four 32-bit operations per cycle when using dot product instructions. Fused multiply-add instruction latency has also been halved to just four cycles. In other words, a number of math operations can be executed more quickly on the A55 compared with the A53, which we can see from the 38 percent boost to floating point and NEON benchmarks.
Perhaps the most important performance boost for the Cortex-A55 comes from the major changes that ARM has made to its memory system. The use of a private L2 cache, configurable up to 256KB, again improves the cache miss capability of the core and lowers the latency for data intensive applications. ARM states that L2 latency has been reduced by 50 percent compared with a shared L2 configuration often used with an A53, down to just 6 cycles. The 4-way set associative L1 cache is also more configurable this time around, in either 16KB, 32KB, or 64KB sizes.
Combined with a shared L3 cache when used with DynamIQ and the new prefetcher, these latency sensitive cores should be kept better fed with data, allowing better utilization of their peak performance. Not only that, but the lower latency communication inside a DynamIQ cluster, compared with higher latency communicating between clusters, should lend further improvements in multi-core task management. Again, the emphasis on this redesign has been to keep the core better fed with data.
The Cortex-A55 also benefits from attributes of the new DynamIQ Shared Unit, including cache stashing, low latency access to peripherals, and fine-grain power management options.
On their own, both the Cortex-A75 and Cortex-A55 offer notable improvements over the company’s last generation cores, both in terms of peak performance and energy efficiency. Even on current processing nodes, we can expect better single threaded performance and lower power drain for less demanding tasks than today’s A73/A53 big.LITTLE processors.
Of course, both of these new chips also mark the introduction of ARM’s DynamIQ multi-core technology, which further optimizes the balancing of power and performance that is so essential for mobile products. Not only that, but DynamIQ brings much more flexibility to the design table, and will empower particularly mid-range SoCs to eke out extra performance with very few extra costs. Backed up by the individual improvements brought to the A75 and A55, this is looking like a potent combination for future smartphones.
We most likely won’t see any mobile products featuring these new CPU cores arrive on the market until early 2018, but we may see SoC announcements based around these products as early as the closing quarter of this year.