Cortex-M7 Takes Aim at the IoT High Ground
ARM’s Cortex-M3 and Cortex-M4 have been very successful but there was a performance gap between the top end Cortex-M4 and the Cortex-A series. The new Cortex-M7 fills out their microcontroller line. It is code compatible with the Cortex-M4 but it offers improved performance and scalability. The Cortex-M7 adds features like code and data caches (Fig. 1). Its low power and high performance are a combination that should fare well in a range of application areas like the Internet of Things (IoT) and wearable technology.
The architecture employs a 6-stage superscalar pipeline. It also has branch prediction support. This allows it to deliver 5.04 CoreMark/MHz using a 40-nm process (Fig. 2). The 28-nm node will allow performance to double.
The Cortex-M7 supports the ARMv7-M instruction set. This includes bit manipulation instructions. It includes single cycle, DSP extensions using a 16-/32-bit MAC. It also has a single cycle, dual 16-bit MAC and 8-/16-bit SIMD instructions. It supports single and double floating point although the latter is an option that is likely to be implemented only for select parts.
The instruction and data caches can be up to 64 Kbytes. The instruction cache is 2-way associative while the data cache is 4-way. The system can also incorporate up to 16 Mbytes of tightly coupled memory (TCM). All of these support optional ECC. The number of memory protection regions has been doubled to 16 compared to the Cortex-M4 and the 64-bit AMBA4AXI interconnect can be linked to an AHB peripheral port.
The debug and trace support is optional. Debug support can be serial or JTAG. Trace support includes instruction and data embedded trace module (ETM), Data Trace (DWT), and Instrumentation Trace (ITM).
Adoption of the Cortex-M7 has been swift with many Cortex-M4 vendors releasing their version of the Cortex-M7 now. For example, STMicroelectronics STM32 series now includes the STM32 F7 (Fig. 3).
STMicroelectronics is enhancing the performance by coupling the core with a bus matrix fabric. This allows concurrent access to memory and peripherals. The 320 Kbytes scattered RAM approach splits into memory into independent regions that can be accessed simultaneously. There is 64 Kbytes of data TCM and 16 Kbytes of code TCM. There is also 4 Kbytes of battery backed memory. The chips also provide 0-wait-state performance by using ST’s Adaptive Real-Time (ART) Accelerator for internal flash memory and the L1 cache for internal and external memories. Versions are available with 512 Kbytes and 1 Mbyte of flash. The chips support off-chip memory including Quad SPI (QSPI) devices via two QSPI ports.
Hardware graphics acceleration is provided by ST’s Chrom-ART Accelerator. This 2D accelerator handles chores like bitmap decoding, blending and output support.
The 90-nm family runs at 200 MHz. This delivers performance of 1000 CoreMarks. That is twice the performance of the STM32 F4.
ST will have a number of development tools and platforms available like the STM32 Discovery board (Fig. 4). ST was able to deliver this quickly because the STM32 F4 and F7 share package pinouts.
The Cortex-M7 expands performance envelope and vendors will be challenging many existing, high performance DSP solutions using the Cortex-M7. The architecture provides growth as well as higher performance across the board. It provides a microcontroller solution that fills the gap between the Cortex-M and Cortex-A series. The Cortex-M7 is also ideally suited for the mobile and wearable device space when low power and high performance are needed.