Amateur Blogging: Quad-Core Opteron finally unveiled

After four years of tantalizing tidbits about AMD's K10 microarchitecture, its first implementation, in the form of quad-core dual- and quad-socket CPUs for workstations and servers, is here.As expected, the first iteration of products AMD launched run at 2 GHz (standard-performance variant) and 1.9 GHz ("HE" energy-efficient variant) speeds, with a 2.5-GHz speed bin "coming in December".

AMD's aspirations for success versus Intel center on three simple words:

Front
Side
Bus

Some key points:

Each core is individually clock-controlled (AMD monicker: "Independent Dynamic Core Technology"). Portions of each core are selectively clock-gated for additional power savings when their functions are not needed ("CoolCore Technology"). Two distinct voltage planes feed the cores and memory controllers ("Dual Dynamic Power Management").
Each core mates to 128 kbytes of dedicated L1 cache (not shown) and 512 kbytes of L2 cache. The four cores share a pool of 2 Mbytes of L3 cache.
A fully populated crossbar switch gives the cores access to two 72-bit (i.e. parity-cognizant) DDR2-667 memory controllers, and to HyperTransport 2.0 links to other CPUs (along with other external system resources).

My FSB (front side bus) focus will be clear, I think, when you compare AMD's architecture against Intel's two-CPU (Xeon 5000 series) and four-and-more CPU competitors. Intel's CPUs are not single-die, quad-core designs; instead, they bundle two dual-core die within a single package. They also don't contain on-die memory managers; instead, the DRAM controller resides within a standalone "Northbridge" IC in the core-logic chipset. And they don't contain dedicated inter-CPU links.

What this all means is that across the common Intel FSB, which connects die-to-die, CPU-to-CPU, and CPU-to-chipset, flows a variety of time-critical traffic, such as:

CPU-to-CPU communications
CPU accesses to cache information residing on another die or CPU's cache memory, and
CPU accesses to system memory via the chipset's Northbridge DRAM controller.

At first glance, this feature disparity would seem to leave Intel at a substantial competitive disadvantage. However, consider the following points:

Each dual-core die within a Xeon CPU contains two 64-kbyte L1 cache arrays (one per core) plus 4 Mbytes of core-shared L2 cache, and no L3 cache. Compare those cache amounts and types (keeping in mind that L1 cache is usually higher performing than L2, and L2 is faster than L3) against those of the "Barcelona" competitor, especially L1-versus-L1 and L2-versus-L2.
The Xeon 5300 series comes in both 1066- and 1333-MHz FSB options, with the Xeon 7300 series to date only offered with a 1066-MHz FSB .

Before I showcase AMD's performance claims, it's important that you be aware of two additional feature enhancements in the company's latest quad-core Opteron line versus the dual-core K8 microarchitecture-based precedessor:

This time around, the chips' virtualization support broadens to encompass hardware acceleration of virtual-to-physical address translation, versus slower software-centric translation in the past
AMD has substantially beefed up its per-core floating-point performance, in migrating from a single-issue 64-bit FPU to a dual-issue 128-bit FPU ("Wide Floating Point Accelerator").

While this may or may not reflect a fundamental chip or process issue, as I inferred might be the case last Thursday, it also reflects the fact that AMD intended for its quad-core follow-on to not only drop into existing system designs from a socket-pinout standpoint but also from power-consumption and thermal-emissions perspectives. Clock-speed disparity aside, keep in mind that the newer chip has both enhanced virtualization hardware "hooks" and twice as many CPU cores; a 79% virtualization performance boost isn't unexpected in such a scenario.Now that AMD's competing against a much more power-efficient competitor in Intel by virtue of the latter's NetBurst-to-Core microarchitecture evolution, AMD is attempting to evolve the rules of the power game.

Remember that, in comparing AMD's CPUs' power draw against their Intel counterparts, Opterons embed DRAM controllers absent from Intel's Xeon chips. AMD also continues to rely on DDR2 SDRAM, whereas Intel employs a Rambus-reminiscent serial interface scheme called FB-DIMM. A number of recent AMD-versus-Intel power-consumption comparisons I've recently seen (here's another) reveal that FB-DIMMs burn much higher power than DDR2 SDRAM modules when the systems containing them are at idle or lightly loaded. FB-DIMM conversely becomes far more attractive both from performance-per-watt and absolute power consumption metrics when the system containing it is heavily loaded .

More observations:

AMD's playing the same "rate" game that I previously pointed out to you with respect to Intel and its OEM partner, Apple. The "rate" versions of SPECint (integer) and SPECfp (floating point) run as many concurrent copies of the benchmark as there are CPU cores in the system. In the AMD-versus-AMD SPECint comparisons, the quad-core CPU in each socket runs at a 33% CPU clock deficiency but also has twice as many cores as its dual-core alternative; a ~50% resultant performance boost is par for the course.
Note that AMD doesn't run SPECint comparisons between itself and Intel, focusing instead on SPECfp and on other benchmarks that are floating-point-centric. This emphasis is reflective of the fact that, as I've already noted, AMD focused lots of attention on its per-core FPU as part of the K8-to-K10 microarchitecture evolution.
Predictably, AMD also showcases benchmarks that make extensive use of core-to-core, CPU-to-CPU, and CPU-to-memory interactions, thereby attempting to swamp the shared FSB of competitor Intel's products. Whether or not such scenarios are reflective of the real-life workloads your application sees is up to you to determine.
In AMD's Sept. 4 foil set, the company benchmarks its current high-end, 2-GHz CPU against a 2.33-GHz Intel processor that, while at 80W TPD arguably matches up against AMD from a power-consumption standpoint, doesn't reflect the top end of Intel's Xeon 53xx series. Intel offers two 120W products, the 2.66-GHz Xeon 5355 (whose mention in AMD's Aug. 27 foil set was, I suspect, a typo) and the 3-GHz Xeon 5365. Keep in mind, too, that Intel's 65-nm-based 53xx series launched last November, and that performance and power-consumption specs on the follow-on (and coming in one month) 45-nm-based Intel Xeon XPUs are not yet published (but are rumored to include an up-to-1600-MHz FSB, along with further substantial increases in per-die L2 cache).

A few lucky folks received quad-core Opteron-based systems that AMD shipped last Friday evening, and their early reviews are beginning to hit the Internet.

The fact that all four cores on a single Opteron die garner local access to each others' L2 cache resources and the common L3 cache pool, versus requiring FSB traversal in some scenarios in the dual-die Intel Xeon case, doesn't (so far at least) seem to be manifesting as a performance advantage in the testing. This is due in part to the lengthy AMD L3 cache latency. Intel's overall cache bandwidth also seems to exceed that of AMD's three-tier cache (AMD marketing moniker: "Balanced Smart Cache") approach.
Intel seems to be retaining an integer-performance lead, whereas AMD's beefed-up FPU's potential doesn't appear to be translating into the floating-point-performance advantage over Xeon that I and others expected would manifest.
AMD's non-reliance on a shared FSB, in combination with its crossbar-switch-fed and integrated DRAM controllers, and its continued reliance on low-latency (compared to FB-DIMM) DDR2 SDRAM, does produce a performance advantage for Opteron in "stream"-type benchmarks that extensively access system main memory.

AMD's questionable quad-core Opteron performance lead, combined with very aggressive launch pricing (left), forms the crux of my continued fundamental concern with the company's long-term fiscal health.

AMD has been able to hold onto a moderate amount of market share in the face of the Intel Core onslaught by slashing prices. Ideally, the launch of a new architecture would be mated to premium pricing, reflective both of the new chip's perceived added value and of the inevitable low yields early in product and process life. Unfortunately, AMD hasn't been able to pull off a price increment over Intel's nine-month-old CPUs. Will quad-core Opteron still be profitable?

What AMD will do in the coming months:

How quickly will AMD be able to fill its various sales channels with 2-GHz product?
How quickly will AMD be able to ramp clock rates (with solid and sustainable yields) to 2.5 GHz and beyond?
What will more in-depth testing, by myself and others, in the weeks ahead reveal?
And how quickly, and with what specifications, will AMD launch the consumer-tailored "Phenom" spin of the K10 microarchitecture?

Hang on for the exciting ride to come, because the next few months are going to be unpredictable, occasionally frustrating, and, ultimately exhilarating.

Amateur Blogging

Quad-Core Opteron finally unveiled

Archives