State of the Cell
The Cell Broadband Engine Architecture (which we shall refer to as simply the Cell architecture) was designed as a compromise between the general-purpose but slower CPU and the specific-purpose and faster GPU. It is a heterogeneous architecture: it contains processing units that specialize in different tasks. However, critics (and even some fans) of the Cell architecture claim that it is incredibly difficult to produce good, fast code on it. Having spent the last quarter working with the Cell architecture, we agree with this sentiment. But why?
Our paper, ‘Lack of Abstraction Considered Harmful’, provides a review of work on Cell as well, work we have done on Cell, and an analysis of this issue. This web page serves as a summary of that paper.
Page Layout
Introduction
In order to test how easy it is to program fast code on the Cell architecture, we implement two different algorithms: the Hadamard product and matrix multiplication. We implement these algorithms on our Cell system at Fred Chong’s Architecture Lab (the Archlab). Our Cell system is a PlayStation 3 with Yellow Dog Linux 5 and the Cell SDK 2.0.
Algorithms and Technologies Used
Yellow Dog Linux 5 and Cell SDK 2.0
Detailed instructions exist for installing Yellow Dog Linux 5 and Cell SDK 2.0 on the PlayStation 3, as well as for Fedora Core 7 and Cell SDK 3.0 on the IBM Blade Servers, but none exist for getting the new Cell SDK 3.0 on the PlayStation 3 (Update). The company that supports Yellow Dog Linux, Terra Soft, offers the SDK with support, but for $4000, it is far out of our availability. Furthermore, although the SDK can be downloaded for free and used on other Linux boxes as a cross-compiling system, we were unable to get it to work on Gentoo Linux or Debian Linux. They seem to specifically require Fedora Core, which we did not have access to. Since these newer SDKs contain the optimized IBM XL C Compiler (IBM cross-compiler for C), we used the versions of gcc for the PPE and SPEs.
RapidMind
RapidMind is a metaprogramming language for C++ that allows the programmer to take their existing C++ code and port it to the Cell architecture (or the GPU architecture) with seemingly little work involved. The programmer annotates the parts of their code where parallelization can be performed, and the RapidMind framework converts this code to “optimized” C++ code. The speed of the resulting code varies vastly on the skill of the programmer, as we will show.
Cell Broadband Engine Architecture
We defer a detailed analysis of the Cell architecture to its Wikipedia article and to the presentation we gave at UC Santa Barbara on contributions to the IBM cross compiler, as the novel architecture has been discussed extensively in the various papers we have reviewed this quarter.
Hadamard product and Matrix Multiplication Algorithms
Both of these algorithms are relatively simple as far as computer science goes. Although matrix multiplication will be familiar to computer scientists, the Hadamard product may not be. It takes two matrices and produces a new matrix of the same dimensions, with each element being the product of the two elements given as input.
Design Process
For our comparisons we use three renditions of the Hadamard product algorithm and three renditions of the matrix multiplication algorithm. For the Hadamard product, we have constructed one version that only runs on the PPE, a version that uses the PPE and four SPEs, and a final version that uses RapidMind. We only use four of the six SPEs since the matrices we are multiplying are square and thus it is easier to split up into four blocks instead of six. For the matrix multiplication algorithm, we use one version that only runs on the PPE. We also wrote a version that uses RapidMind, but is a more naive version compared to the optimized RapidMind version distributed by RapidMind. The code for the optimized RapidMind version can be found at their website, while the versions we have implemented are available here.
Results
For the six algorithms we have run on Cell, we see the following performance:
We see that in both cases, our PPE only code runs faster than our naive RapidMind implementation by three orders of magnitude (a factor of 1000 difference). This scales up exponentially with the number of rows in the matrices (since the y-axis is on a log scale), and we also see that the optimized RapidMind and our PPE / 4 SPE programs scale up much better. Unfortunately, both compilers / frameworks were difficult to work with (although in different ways). We were unable to send more than 16 KB of data to the SPEs, so as a result we have no data for the Hadamard product with more than 16 KB of data. Next, we discuss the difficulties and successes we have with Cell’s gcc / spu-gcc and RapidMind.
Pros and Cons of Cell’s gcc / spu-gcc
Pros:
Cons:
Pros and Cons of RapidMind
Pros:
Cons:
Conclusion
We have seen on Cell that it is currently difficult to write fast code, and difficult to write any code at all. This goes double for developing with gcc and spu-gcc. In order for Cell to be a viable language, tools will need to be developed that abstract away the various parts of the hardware we’ve encountered difficulties with (memory alignment, sending data to SPEs, etc). Doing so would not have been out of reach for the Cell developers, yet it appears they wanted the potential to get the maximum performance out of the hardware. Unfortunately, this decision makes it brutually difficult for the average programmer to get any real programming done. It also makes the programs dependent on the specifics of this version of the architecture. To illustrate this point: what do you have to change in your algorithm if a new Cell chip is designed with 10 SPEs and 512 KB local store instead of the current layout (8 SPEs and 256 KB local store). You could probably keep the same performance as before easily, but to get the most out of the architecture, you need to refactor your entire program. Why not let the compiler / scheduler / whatever do this for you? Computers were meant to make our lives easier, remember? I guess I finally see what the big deal about Ruby is, with Matsumoto’s “crazy” idea:
“Often people, especially computer engineers, focus on the machines. They think, ‘By doing this, the machine will run faster. By doing this, the machine will run more effectively. By doing this, the machine will something something something.’ They are focusing on machines. But in fact we need to focus on humans, on how humans care about doing programming or operating the application of the machines. We are the masters. They are the slaves.”
Hope is not lost for Cell. In fact, its quite the opposite. The potential for Cell is astounding. We just need to start harnessing that power the right way. Software complexity is already increasing on its own. We shouldn’t be making it even more complex. Here’s a more concise article that covers the same issue.
Chris Bunch
Update 12/22/07: So it turns out there are a couple guides for installing Fedora Core 7 on the PS3. I’m gonna try them out next week and see how they go.