Typical general-purpose symmetrical multiple-processor
(SMP) multicore designs contain about eight cores. Specialized
architectures, on the other hand, push the number
of cores into the hundreds.
Tilera ups the ante for SMP with its 64-
core/tile Tile64 chip (see the figure). Its
iMesh interconnect incorporates five different
packet networks with five switches per
tile (see the table). Chips with 35 and 120
tiles are on the horizon.
Go With the Flow
The SMP nonuniform
memory access (NUMA) architecture
is similar to the HyperTransport system
used by AMD for its Opteron series. As
with AMD's approach, location of peripherals
and memory are not important to the
application, except at a low level of the
operating system.
The big difference is that AMD uses the
same HyperTransport interface for all traffic,
while Tilera splits the traffic into different
networks. This enables memory transfers
to occur in parallel with other
transfers, such as peripheral data. Data
moves through non-blocking switches at
one cycle per hop.
By splitting the traffic, different types of
transfers can be optimized. For example,
memory and stream transfers tend to be
larger, while interrupts and UDP-style
(User Datagram Protocol) transfers are
usually smaller. High-level language support
permits socket-style communication
between nodes.
Communication can occur between any
node. Each has a matrix address. Some
nodes, such as the memory controllers, feature
more than one address to provide higher
throughput. The source node determines
which address to use. Typically, the system
that initializes the operating systems on
each core will distribute the addresses to
prevent one from becoming a bottleneck.
My Cache, Your Cache
Each tile incorporates an L1 and
larger L2 cache. A core's L3 cache is the sum of the other cores'
L2 caches. The memory controllers keep track of where information is located in the L2 cache. Accesses
from a different node are provided with the
location so subsequent accesses can be
made via the remote L2 cache.
The response characteristics of this
approach are different from a conventional
SMP L3 cache. But the efficiency is much
better than accessing main memory from
a speed as well as a power point of view.
Off-chip accesses require hundreds of
cycles and 500 pJ. An L3 access will take
20 to 30 cycles and consumes only about
3 pJ. Hardware handles cache operation
and virtual memory support. Its operation
is transparent to applications.
Virtual Partitions
A bank of
64 cores can be handy, but multiple subsets
are often used instead. Tilera's Hardwall
technology logically partitions the system
into sets of tiles. Traffic can flow
through any region to memory controllers
and peripherals. However, this prevents
communication between cores in different
regions. Of course, the L3 caching will
be within a region too. Rectangular
regions are currently supported.
A hypervisor runs on each core, providing
virtual-machine support. Access to peripherals
is still controlled at the software level.
Still, this is relatively easy to handle at the
hypervisor level. Moreover, the hypervisor
has control over a tile's switches.
The Tile64 can support a range of operating
systems, but its initial flavor is Linux.
Support also includes the Eclipse-based
Multicore Development Environment
(MDE), including the GDB debugger. The
current mix of software includes opensource
tools as well as some proprietary
software, such as the C/C++ compiler.
Many Cores, Fewer Watts
Power management can be a significant
advantage in multicore environments. In
this case, it's possible to power down
individual cores while the switches continue
to operate. The design also makes
extensive use of clock gating, minimizing
power requirements for sections of the
system that are inactive.
Soft Tiles
Software support includes
tools specific to the Tile64, such as a highlevel
and cycle-accurate simulator. A whole
application model for collective debugging
can single-step multiple cores. Also, a runtime
library for socket-style streams provides
access to the tile-to-tile hardware
support mentioned earlier.
The architecture has had time to
mature. A similar system was developed
in 1994 at the Massachusetts Institute of
Technology, but it required a rack of hardware.
Meanwhile, external links between
Tile64 chips can be established using the
Ethernet or PCI Express interfaces. For
now, iMesh operates only within the chip.
The Tile64 should provide 40 times the
performance of dual-core DSPs and 10
times the performance of dual-core Xeon
processors while using less power. Of
course, these are 32-bit cores, not 64-bit
cores. Likewise, applications that run on
an SMP platform should work well without
modification on the Tile64.
New designs can take advantage of
more intimate hardware support. But
gaining access to such a large number of
cores opens new possibilities for parallel
programming. And while the Tile64 targets
network and video applications, it
should equally suit other applications
amenable to parallel programming.
Tilera
www.tilera.com