PhysX, GPUs, and the future of supercomputing

By Jon Stokes | Published: 12 March, 2007 - 08h25PM CST

Asking questions

What if I told you that most of what the tech press thinks they know about Ageia and the PhysX PPU is completely wrong?  What if I told you that there's more to PhysX than physics?  And what if I said that there exists a grand unified theory of stream computing and the high-performance computing (HPC) market that's simple and perhaps even a bit obvious, but it makes sense of all the stream-computing-related press releases from NVIDIA, AMD/ATI, Ageia, Peakstream, and others that have been coming down the wire in past year?

Maybe you'd think I'm crazy, but you should hear me out first.

Sieve provides a window into PhysX

A company called Codeplay showed up at this week's GDC to talk about their new auto-parallelizing compiler, called Sieve.  Sieve takes in single-threaded C/C++ code, examines it for dependencies and parallelization opportunities, and turns it into multithreaded code for use on multicore processors.

The programmers in the audience are going to be immediately skeptical that this actually works as advertised, because multithreading an application is hard enough for humans to do right now, and in many previous articles on the topic I've talked about why this is the case.  But I'm going to skip over all that because I don't really want to focus on Codeplay or Sieve.  Instead, I want to talk about what Codeplay can tell us about one of the most mysterious and misunderstood chips currently on the market: Ageia's PhysX PPU.

Ageia hasn't released any architectural details about PhysX, so the company has only itself to blame for the current state of confusion surrounding its product.  But Codeplay's technology reveals quite a bit more about PhysX, and about Ageia's plans for it, than Ageia itself has ever let on.

Codeplay's Sieve compiler targets three platforms: multicore x86, Ageia's PhysX, and IBM's Cell.  If PhysX were what most people think it is—i.e., a physics toaster that's designed specifically to accelerate a number of popular physics algorithms — then it would certainly be the odd one out in this list.  After all, the other two chips feature multiple identical cores (I'm talking about Cell's SPEs, and not its PPE), with the latter having access to per-core storage.  So multicore x86 and Cell are fairly generalized multicore designs that can host pretty much any kind of data-parallel, multithreaded, kernel-based algorithm that you want to throw at them.  At this point, some of have already figured out where I'm going with this, but let me bring up one other thing before I spell it out.

According to one of Codeplay's white-papers, the ideal Sieve platform is a multicore system where each core has its own separate memory space.  Sieve divides a block of code up into kernels, and it puts each kernel into each core's local storage.  Of course, the compiler can and does make do with a unified pool of memory, as is the case with multicore x86, but non-unified memory architectures (NUMA) is where multicore, data-parallel computing is clearly headed, and it's what Sieve does best with.  So Cell is ideal for Sieve.. and so is PhysX.

Based on things I've heard and on my reading of Codeplay's white-papers, I'm certain that the PhysX PPU consists of multiple identical cores (maybe with one or two unique cores, like Cell's PPE, but probably not), and that each of those cores has its own local storage.  In other words, PhysX almost certainly looks a lot like Cell, but with more cores, and probably more per-core local storage.  Game developers can use this type of architecture to implement any number of physics models, but that's just one application for the chip.

Because the PhysX architecture is a fairly non-physics-specific multicore NUMA design, it's my belief that Ageia really has their eye on the HPC market with this part.  The company probably started out pitching PhysX as a physics accelerator for games, because the gaming market is where the volume is.  The idea is that once PhysX gets traction in the game market, the volume goes up and the per-part price goes down, and when the price gets low enough then it the company can make a real play for a coprocessor spot in the growing number of HPC clusters that are coming online in sectors as disparate as academia, oil and gas, finance, and medicine.

You might be skeptical of Ageia's "games first, HPC later" business plan, and rightly so.  Just because I think this is their plan doesn't mean I think it'll work.  Getting enough volume out of the gaming market to price yourself into the HPC cluster market is hard, even when your particular HPC chip is picked as the brains of Sony's Playstation 3 (*ahem* IBM).

Alternately, you may be skeptical that I've copped to Ageia's plan at all.  Why would they think a play like this would work, or why would they risk it?  Why not just pitch to HPC directly, and forget about all this gaming stuff?  An answer to the former question would involve more speculation and mind-reading than I want to get into here, but the answer to the latter question becomes clear when you understand how the HPC market works.

A grand unified theory of the HPC market

A glance at the recent evolution of the Top 500 Supercomputers list nicely illustrates the fact that the supercomputing market is becoming increasingly dominated by systems built from common off-the-shelf (COTS) components and connected together in a cluster configuration; the only specialty hardware in many of these clusters is the system interconnect.  The rise of the COTS cluster has made owning a supercomputer much less costly, and as supercomputer ownership gets cheaper it gets more widespread.  But as supercomputers get more widespread, they get cheaper, and thus the virtuous circle called "economies of scale" works its magic and causes supercomputers to proliferate.

If you're a silicon vendor who wants a spot in the growing number of HPC clusters, the price of entry is a COTS part where the emphasis is on "C" for commodity.

So let's say that you're a company with a great design for a multicore coprocessor chip that would offer the HPC market remarkable speedups on data-parallel workloads vs. today's clusters of general-purpose x86 processors.  The problem you face is that you can't just produce a chip specifically for the HPC market, because the chip has to be available in enough volume to keep the price down, and the HPC market just isn't big enough for that yet.  The HPC market is growing, though, and you want to be in on it because nothing helps stock prices like the prospect of growth (conversely, nothing depresses a stock price like the perception that you've run out of room to grow).

If you're IBM or Ageia, then the way to get your multicore coprocessor chip into COTS clusters is clear: sell the chip in the gaming market first, because the gaming market has a unique combination of high volume and an insatiable appetite for parallelism.  Then, when the chip is selling widely enough to be profitable (fabs are expensive, and you have to keep them busy or you lose money) then you can break into the small but profitable and growing market for HPC clusters.  This is why IBM got Sony to help them foot the bill of designing and fabbing Cell, and it's why Ageia wants gamers to buy a physics acceleration board.

From the opposite end of the spectrum, if you're NVIDIA or AMD/ATI and you're already selling a large number of parallel coprocessors in the gaming market, then you have the hardest part—the challenge of ramping up volume—licked.  Now all you need in order to get into HPC clusters is a developer toolset and a PR campaign.

Challenges for the current players

Ultimately, IBM and Ageia still face that volume-related pricing hurdle.  They're still hoping that gamers will latch onto their chips and pull them far enough up the volume curve to make their HPC play really profitable.  Because its Cell processor occupies the prime gaming real-estate of Sony's PlayStation 3 console, IBM is much further up this hill than Ageia is.  The latter company is still struggling to gain traction in a gaming market that's still eyeing with skepticism the prospect of spending a few hundred dollars and a PCI slot on a physics accelerator.

NVIDA and AMD/ATI, on the other hand, face a different set of challenges altogether.  For all their talk of stream processing and generalized data-parallel computation, the G8800 and the R600 are still graphics processing units, with plenty of graphics-specific logic and with microarchitectures that are designed for ultra-fast real-time 3D rendering.  Not only are all of a GPU's architectural decisions made in favor of fast 3D rendering, but power efficiency hasn't really entered the GPU picture yet.  The vast majority of these chips are sold to gamers who don't factor their electric bill into their hardware purchases; the GPU market is about raw performance, and is relatively insensitive to performance/watt considerations.

Because NVIDIA and AMD/ATI still live and die by 3D gaming benchmarks, and because their sales volume is tied to a customer base that doesn't care about power consumption, neither NVIDIA nor ATI can afford to sacrifice any 3D rendering performance in favor of improving their performance-per-watt ratio on more general kinds of data-parallel workloads.  Unfortunately, HPC cluster buyers who have to power a room full of systems do care about power consumption, so per-chip performance/watt matters for them in a way that it does not for the Opposable Thumbs crowd.

Aside from performance/watt considerations, GPUs face another major challenge in the HPC cluster market: a rapid pace of micro-architectural evolution that can spoil investments in hand-optimization.  Like GPUs' large die sizes and voracious appetites for wattage, the high degree to which GPU microarchitectures can change from one generation to the next is a direct result of the fact that their sales are tied to their dominance in 3D gaming benchmarks.  A GPU maker will change almost anything about a design if it means a boost in frames-per-second on top games.  These changes mean that if you get really invested in low-level hand optimizations, then you'd better plan to stick with that exact same GPU for the life of your cluster.  Of course, companies like Peakstream and Rapidmind have middleware layers aimed at alleviating the effects of this kind of GPU product churn, so for HPC customers that use such tools maintaining software compatibility across multiple hardware upgrade cycles will be less of a challenge.

To finish off where we started—with the PhysX PPU—Ageia faces a steep uphill battle in the HPC coprocessor game.  Not only does PhysX not have volume at the moment, but if its architecture is as I've described it, then it also faces direct competition from both established vendors like IBM (and probably Intel, before long) and other startups like Clearspeed.  In this kind of competitive environment, Ageia had better post some killer performance/watt numbers if their product is going to have any sort of chance against GPUs on the one hand and other massively multicore coprocessors on the other.

Article jacked from: