[Design View / Design Solution]
Apply Virtualization To Storage I/O
Richard Solomon
ED Online ID #21154
May 21, 2009
Copyright © 2006 Penton Media, Inc., All rights reserved. Printing of this document is for personal use only.
Reprints
Virtualization is receiving lots
of attention these days. Behind
the buzz are some simple,
time-tested concepts. But the
movement of this technology from the
mainframe to the mainstream has brought
it into the limelight.
At its heart, virtualization is about making
something “look” like something else.
Typically, this means making an operating
system “think” it’s running alone on
a computer, when in fact that computer is
shared by several operating systems—each
referred to here as a system image (SI).
Since mainstream computers started
incorporating memory-management
units, this type of virtualization has been
possible, albeit not terribly popular due
to the extreme performance hit required
to emulate every device in the computer.
Recent hardware and software technology
has boosted emulation speed, but plenty
of room for improvement remains.
I/O virtualization (IOV) is simply the
ability to make one device look like
multiple devices, each assignable to a
unique SI. Moving the virtualization to
the device level can provide dramatic performance
increases by freeing the system
processor(s) from the cumbersome task of
emulating those devices.
The PCI-SIG has defined a mechanism
for providing these virtual device interfaces
on the PCI Express (PCIe) bus. Now that
these efforts are complete, some needed
standardization is finally available in mainstream
IOV design, enabling multi-vendor
silicon solutions to work on multiple platforms
under multiple operating systems.
Problem solved, right? Take a chip’s current
PCI Express front-end logic, slap on
this I/O virtualization stuff, and “poof!”
It’s virtualizing like mad. Well, yes and no.
Yes, it creates a chip assignable to multiple
SIs at once. However, chances are that
virtualizing the back end of the device is
actually the harder part of the problem.
Also, consider a storage controller—
clearly, one would want to partition the
connected storage so System Image X
(perhaps an online banking system) has
space allocated only to itself and isn’t
accessible by System Image Y (e.g., the
Web server for www.hackers-are-we.org).
PCI EXPRESS I/O VIRTUALIZATION
Let’s take a brief look at the system
view of IOV. The term “system image”
refers to a real or virtual system of
CPU(s), memory, operating systems, I/O,
etc. Multiple SIs may run on one or more
sets of actual hardware.
One example today might be a hypervisor
like VMWare running Windows XP
and Linux simultaneously on a single-CPU
desktop PC. In that case, two SIs exist,
each sharing a single CPU, memory, disk
drive, etc. Another example would be a blade server running Windows XP on one
blade and Linux on another blade. There,
each SI isn’t sharing any of the CPU
blade’s hardware, though it could potentially
share hardware on an I/O blade.
Regardless of the physical assignments,
each SI needs to “see” its own PCI hierarchy.
Even if no end devices are actually
shared (e.g., two Fibre Channel controllers
on the I/O blade, one assigned to the Linux blade and one assigned to the XP blade),
some control over the hierarchy’s visibility
is required. If end devices are shared, each
SI must be restricted to seeing only its
“portion” of shared end devices.
The device needs to make its one physical
set of hardware appear to be multiple
virtual devices, which appear completely
independent to outside observers. Those
devices may:
• occupy different PCI memory ranges
• have different settings for various PCI
configuration registers
• potentially each be a particular PCI multifunction
device
Furthermore, the device needs to keep
cross-“device” traffic isolated internally
so no data spillover occurs between virtual
devices.
As seen in the examples above, a clear
distinction can be drawn between systems
having a single point of attachment to the
PCI hierarchy and those with multiple
points of attachment. The traditional
single-CPU desktop computer and even
the traditional n-way multi-CPU server
previously had a “single” logical point of
attachment to the PCI hierarchy (Fig. 1).
By contrast, blade systems enable a new
hierarchy view where some upper-level
enhanced PCI Express switch could allow multiple root complexes to attach to the
total PCI hierarchy (Fig. 2). Here, some
new mechanisms are clearly required to
enable each root complex to access only its
assigned portion of that hierarchy.
Given the large separation between
these two types of systems, both from a
complexity and market segmentation perspective,
the PCI-SIG chose to break IOV
up into two separate specifications. Since
each root complex (Fig. 2, again) could
also be utilizing single-root IOV, the two
specifications will necessarily be interdependent.
Thus, the so-called “concentric
circles” model was adopted, whereby the
single-root specification builds on the
PCI Express base specification, and the
multi-root specification builds on both the
single-root specification and PCI Express
base specification.
Continue to page 2 SINGLE-ROOT I/O VIRTUALIZATION
Single-root I/O virtualization’s primary
target is existing PCI hierarchies, where
single-CPU and multi-CPU computers
have the traditional single point of
attachment to PCI (Fig. 1, again). One of
the significant constraining goals of the
single-root spec was to enable the use of
existing or absolutely minimally changed
root-complex (i.e., chipset) silicon. Likewise,
enabling existing or minimally
changed switch silicon was a constraint.
Given those requirements, there can still
only be a single memory address space
from the bus perspective. Partitioning and
allocation for the virtualized SIs is performed
at a level above the root-complex
attachment point. Some type of address
translation logic is generally presumed
to exist in or above the root complex to
enable a “virtualization intermediary”
(commonly referred to as a hypervisor) to
perform that mapping. New IOV endpoint devices will be required, of course, with
their associated non-trivial design and
support challenges.
The “don’t change the chipset!” philosophy
opens the virtualization market to
significant numbers of existing or simply
derived systems (e.g., might need new
BIOS or chip-set revision). However, it
shifts a substantial burden to software
performing the virtualization intermediary
function.
MULTI-ROOT I/O VIRTUALIZATION
The most obvious example implementation
of the multiple attachment point
hierarchy (Fig. 2, again) is a blade server
with a PCI Express backplane, though the
PCI Express Cable specification opens up
a number of other possibilities. This is a
new PCIe hierarchy construct—effectively
a (mini) fabric.
Here, the PCI-SIG target was “small”
systems with 16 to 32 root ports as likely
maximums, though the architecture allows
many more. (One of the workgroup’s sayings
was “Our yardstick is a yardstick,”
i.e., the typical implementation is expected
to be a system occupying not more than
about three feet cubed.)
Again, retaining the use of existing
or absolutely minimally changed rootcomplex
(i.e., chipset) silicon was a key
goal. Unlike single root, however, no
virtualization intermediary is assumed and
the complexity of partitioning the system
moves into a new enhanced type of PCI
Express switch (Fig. 2, again), which is
called “multi-root aware.”
The key difference in a multi-root
system is the partitioning of the PCI hierarchy
into multiple virtual hierarchies
all sharing the same physical hierarchy.
Where single-root systems are stuck with
a single memory address space being partitioned
among their SIs, multi-root systems
actually have a full 64-bit memory
address space for each virtual hierarchy.
Configuration management software,
working in conjunction with the enhanced
switch(es) and IOV devices, programs
the hierarchy so each root complex from Figure 2 “sees” its portion of the entire
multi-root hierarchy as if it were a singleroot
hierarchy as in Figure 1. Each of
those “views” of the hierarchy is called a
virtual hierarchy. Each virtual hierarchy of
a multi-root system can be independently
enabled for single root or not. Therefore,
endpoint devices in a multi-root system
face the challenge of layering both modes.
Every SI should see its own virtualized
copy of the configuration space and
address map for a given device being virtualized.
Effectively, the device needs “n”
sets of PCI configuration space to support
“n” of these virtual functions. The singleroot
specification defines lightweight virtual
function definitions to reduce the gate
count impact, while the multi-root specification
relies on a full configuration space
per device usable virtual hierarchy.
The various “flavors” of configuration
spaces are too detailed for this article,
which is focused on virtualization at a
high level. For the purposes of this discussion,
it’s sufficient to note that every SI
interacting with an IOV device will have
its own device address range and configuration
space. Thus, the IOV device can
associate work with a particular SI based
on which address space was accessed.
VIRTUALIZING THE STORAGE SIDE
At this point in our hypothetical development
process, an IOV device was
enabled to respond as if it were multiple
devices and provided with a mechanism
to distinguish between two different SIs.
If the implementation were stopped at this
point, the model would look like Figure 3.
Note that the depictions of SIs don’t
attempt to distinguish whether they’re
single-root or multi-root. At this point,
there’s really only concern that they’re different
images. The precise means of connection
is unimportant.
Effectively, all SIs see all of the disks
connected to the IOV storage controller.
In some environments, this model might
actually be okay. If the SIs were cooperative,
they could divide up the pool of
storage themselves. Likewise, if there
were some software intermediary between
each SI and the storage controller, it could
divide up the pool of storage and allow an
SI to see only a portion of the pool.
Considering the example at the beginning
of this article, users could be
uncomfortable with their banking system
“cooperating” with the crew at www.hackers-are-we.org. While the software
intermediary idea would be okay, it would
eliminate a lot of the performance savings
of doing IOV in hardware, and it would be
a rather complex piece of software needing
intimate knowledge of each controller’s
hardware and device driver. Clearly, then,
for most environments, hardware virtualization
of the storage side is desirable.
SAS TO THE RESCUE
Therefore, it’s not a difficult stretch to
imagine that a creative IOV storage controller
designer could add a straightforward
table mechanism to filter out disk drives by
their ID and only let certain SIs “see” certain
disk drives. Such a system would look
like Figure 4, where each colored SI has
access to the same color disk drive(s).
Historically, this could have been done
fairly easily in an SCSI environment—
where SCSI even provided facilities for
sub-dividing a single disk drive. Even a
SATA controller today could probably handle
this sort of per-disk drive “masking.”
Continue to page 3 Like the free-for-all model of Figure 3,
the per-drive-masking model of Figure
4 might be usable in certain controlled
environments. As long as the number
of disk drives connected is small (for
example, the 1 to 15 drives SCSI supported),
then this model is quite workable.
Once the system grows beyond the
bounds of directly connected disk drives,
however, the complexity of this mechanism
becomes cumbersome.
Furthermore, implementing the software
to support a proprietary mechanism for a
dozen or so disk drives is probably irritating
but not prohibitive. Extending that
software to tens, hundreds, or thousands
of disk drives is likely more than any sane
developer would take on.
Luckily, SAS provides a standard mechanism
for access control, called zoning,
which is nearly perfect for storage virtualization.
SAS zoning is very analogous
to similarly named mechanisms in Fibre
Channel and other storage-area network
(SAN) technologies.
SAS is a point-to-point serial protocol
designed as the successor to parallel SCSI,
which utilizes devices called expanders
to enable the connection of additional
devices. A typical SAS host adapter might
implement eight ports, allowing the direct
connection of eight disk drives. (Actually,
SAS disk drives may use multiple ports
to provide additional bandwidth, so those
eight ports could even be fully utilized by
having four higher-performance two-port
disk drives attached.)
To provide more connectivity, SAS
expanders would be used as shown in Figure 5, ignoring the colors for the
moment. Each of these expanders is logically
a switch, though without the high
dollar cost associated with Fibre Channel
switches. SAS expanders can optionally
support a zoning capability, providing a
means to limit access from specified hosts
to specified targets, such as disk drives.
In SAS zoning, access is controlled per
connection point on the expander (called
a “PHY” in SAS-speak). Each expander
maintains a table of which PHYs can
communicate with other specific PHYs
on that expander. By manipulating these
tables on its zoning expanders, a SAS system
can provide full access control.
SAS zoning is configured via special
SAS messages that extend the existing SCSI Management Protocol inherent in
SAS already. The protocol already comprehends
the idea of a protected “supervisor”
as the only agent allowed to reconfigure
the zones.
Because SAS zoning is done per connection
point, adding or removing devices
automatically triggers zone re-analysis
and potentially zone reconfiguration.
Thus, new disk drives may be added to a
zone without disrupting other zones—or
even alerting them that the system configuration
changed.
Several articles could be written about
SAS zoning alone. But for the purposes
of this article, suffice it to say that zoning
provides full host to disk isolation and
access control (Fig. 5, again), with colors
representing each zone.
Following these steps, it’s clear that
mapping SAS zones to the SIs of PCI
Express I/O virtualization provides a
full-featured implementation of storage
virtualization. Figure 6 shows the full
picture of a SAS IOV controller. The SAS
controller provides one or more logical
SAS expander(s) internally with slight
tweaks to map SIs as if they were PHYs.
Each SI then sees only a portion of the
total storage pool, without the need for a
software intermediary filter. Furthermore,
this has been accomplished using existing
standardized mechanisms.
While this example used a plain SAS
controller, a SAS RAID controller could
be used as well. Such a RAID controller
would likely present its RAID sets as if
they were simple disk drives behind the
same type of internal logical SAS expander
as was used in the controller.
|