Byzantine Reality

Searching for Byzantine failures in the world around us

State of the Cell

Originally posted at http://cs.ucsb.edu/~cgb/stateOfTheCell.html, and thus, looks much better there.

The Cell Broadband Engine Architecture (which we shall refer to as simply the Cell architecture) was designed as a compromise between the general-purpose but slower CPU and the specific-purpose and faster GPU. It is a heterogeneous architecture: it contains processing units that specialize in different tasks. However, critics (and even some fans) of the Cell architecture claim that it is incredibly difficult to produce good, fast code on it. Having spent the last quarter working with the Cell architecture, we agree with this sentiment. But why?

Our paper, ‘Lack of Abstraction Considered Harmful’, provides a review of work on Cell as well, work we have done on Cell, and an analysis of this issue. This web page serves as a summary of that paper.

Page Layout

  • Introduction
  • Algorithms and Technologies Used
  • Design Process
  • Results
  • Pros and Cons of Cell
  • Introduction

    In order to test how easy it is to program fast code on the Cell architecture, we implement two different algorithms: the Hadamard product and matrix multiplication. We implement these algorithms on our Cell system at Fred Chong’s Architecture Lab (the Archlab). Our Cell system is a PlayStation 3 with Yellow Dog Linux 5 and the Cell SDK 2.0.

    Algorithms and Technologies Used

    Yellow Dog Linux 5 and Cell SDK 2.0

    Detailed instructions exist for installing Yellow Dog Linux 5 and Cell SDK 2.0 on the PlayStation 3, as well as for Fedora Core 7 and Cell SDK 3.0 on the IBM Blade Servers, but none exist for getting the new Cell SDK 3.0 on the PlayStation 3 (Update). The company that supports Yellow Dog Linux, Terra Soft, offers the SDK with support, but for $4000, it is far out of our availability. Furthermore, although the SDK can be downloaded for free and used on other Linux boxes as a cross-compiling system, we were unable to get it to work on Gentoo Linux or Debian Linux. They seem to specifically require Fedora Core, which we did not have access to. Since these newer SDKs contain the optimized IBM XL C Compiler (IBM cross-compiler for C), we used the versions of gcc for the PPE and SPEs.

    RapidMind

    RapidMind is a metaprogramming language for C++ that allows the programmer to take their existing C++ code and port it to the Cell architecture (or the GPU architecture) with seemingly little work involved. The programmer annotates the parts of their code where parallelization can be performed, and the RapidMind framework converts this code to “optimized” C++ code. The speed of the resulting code varies vastly on the skill of the programmer, as we will show.

    Cell Broadband Engine Architecture

    We defer a detailed analysis of the Cell architecture to its Wikipedia article and to the presentation we gave at UC Santa Barbara on contributions to the IBM cross compiler, as the novel architecture has been discussed extensively in the various papers we have reviewed this quarter.

    Hadamard product and Matrix Multiplication Algorithms

    Both of these algorithms are relatively simple as far as computer science goes. Although matrix multiplication will be familiar to computer scientists, the Hadamard product may not be. It takes two matrices and produces a new matrix of the same dimensions, with each element being the product of the two elements given as input.

    Design Process

    For our comparisons we use three renditions of the Hadamard product algorithm and three renditions of the matrix multiplication algorithm. For the Hadamard product, we have constructed one version that only runs on the PPE, a version that uses the PPE and four SPEs, and a final version that uses RapidMind. We only use four of the six SPEs since the matrices we are multiplying are square and thus it is easier to split up into four blocks instead of six. For the matrix multiplication algorithm, we use one version that only runs on the PPE. We also wrote a version that uses RapidMind, but is a more naive version compared to the optimized RapidMind version distributed by RapidMind. The code for the optimized RapidMind version can be found at their website, while the versions we have implemented are available here.

    Results

    For the six algorithms we have run on Cell, we see the following performance:

    hadamard.png

    matrixmultiply.png

    We see that in both cases, our PPE only code runs faster than our naive RapidMind implementation by three orders of magnitude (a factor of 1000 difference). This scales up exponentially with the number of rows in the matrices (since the y-axis is on a log scale), and we also see that the optimized RapidMind and our PPE / 4 SPE programs scale up much better. Unfortunately, both compilers / frameworks were difficult to work with (although in different ways). We were unable to send more than 16 KB of data to the SPEs, so as a result we have no data for the Hadamard product with more than 16 KB of data. Next, we discuss the difficulties and successes we have with Cell’s gcc / spu-gcc and RapidMind.

    Pros and Cons of Cell’s gcc / spu-gcc

    Pros:

  • Super-fine level granularity over the architecture assures the best performance out of the architecture. Some of the papers we’ve seen this quarter have shown the amazing speedups Cell can achieve. If you don’t mind spending the extra time to really work out your algorithm, its speed will certainly end up first-class.
  • Can program in assembly or C as needed. Certainly most of you will prefer C, but to get the fastest programs, the masochist in you will want to write this program in assembly. Take my advice though: don’t do it. We’ve argued that Cell needs more abstraction, so don’t go with less by programming in assembly. But if you really wanted to, at least you could.
  • Being able to use I/O (esp. printf) inside of the SPEs kicks ass. It’s much slower but gives the bad debugger some extra data to work with. Contrast this with CUDA (see Multigrid on CUDA), where you are not allowed to hit the screen’s I/O at all and have to save outputs in variables that the CPU can print out later.
  • Cons:

  • Having to manage mandatory factors such as memory alignment and data transfer is obscenely difficult. It literally should be a crime to make the programmer do this. There’s no reason the compiler can’t, especially given the fact that all other compilers seem to do this.
  • Having to manage optional factors such as branch prediction is ridiculous. Just like before, there’s no reason that the compiler can’t look at my loops and guess what’s needed next most of the time. This goes double considering the fact that branches are predicted not-taken ALWAYS. Come on! Programmers write code expecting it to get used! We don’t write branches knowing they’ll never get used! Even guessing always taken would be better than always guessing not-taken.
  • Inconsistency between data types on the PPE and SPEs. I know, they have a different instruction set and are optimized for different things, blah blah blah. But I should be able to define a vector of a certain type and have that code work on the PPE or SPEs, even if it gets translated differently or performs differently. I shouldn’t have to memorize the different types of vectors for the PPE and SPEs just to get some use out of them.
  • Pros and Cons of RapidMind

    Pros:

  • Since RapidMind is built on top of C++, it is very simple to reuse small programs with the framework.
  • RapidMind abstracts the difficult-to-deal-with hardware away from the programmer. This is surely RapidMind’s greatest strength, as the lack of abstraction hurts the programmer’s productivity more than anything else we saw.
  • RapidMind is free for academic use! It’s easy to set up and has good documentation, so if you’re an academic type and have a PS3, go check it out!
  • Cons:

  • Resolving compiler issues is virtually impossible. See the screenshot from using RapidMind on the GPU in the Multigrid Methods on CUDA project to see more about that.
  • You need to be an expert at RapidMind to write fast RapidMind code. We constructed our program by following the recommendations at the developer’s site, but our final code for matrix multiplication was far different from the RapidMind team’s. They give many ways to solve problems but no “best practices”, leaving us with an algorithm that is substantially slower than theirs.
  • Some RapidMind variable types do not work as expected. RapidMind’s boolean type cannot be used alone in conditional statements, and having to refactor code that uses that eats a lot of time and readability out of the program.
  • Conclusion

    We have seen on Cell that it is currently difficult to write fast code, and difficult to write any code at all. This goes double for developing with gcc and spu-gcc. In order for Cell to be a viable language, tools will need to be developed that abstract away the various parts of the hardware we’ve encountered difficulties with (memory alignment, sending data to SPEs, etc). Doing so would not have been out of reach for the Cell developers, yet it appears they wanted the potential to get the maximum performance out of the hardware. Unfortunately, this decision makes it brutually difficult for the average programmer to get any real programming done. It also makes the programs dependent on the specifics of this version of the architecture. To illustrate this point: what do you have to change in your algorithm if a new Cell chip is designed with 10 SPEs and 512 KB local store instead of the current layout (8 SPEs and 256 KB local store). You could probably keep the same performance as before easily, but to get the most out of the architecture, you need to refactor your entire program. Why not let the compiler / scheduler / whatever do this for you? Computers were meant to make our lives easier, remember? I guess I finally see what the big deal about Ruby is, with Matsumoto’s “crazy” idea:

    “Often people, especially computer engineers, focus on the machines. They think, ‘By doing this, the machine will run faster. By doing this, the machine will run more effectively. By doing this, the machine will something something something.’ They are focusing on machines. But in fact we need to focus on humans, on how humans care about doing programming or operating the application of the machines. We are the masters. They are the slaves.”

    Hope is not lost for Cell. In fact, its quite the opposite. The potential for Cell is astounding. We just need to start harnessing that power the right way. Software complexity is already increasing on its own. We shouldn’t be making it even more complex. Here’s a more concise article that covers the same issue.

    Chris Bunch

    Update 12/22/07: So it turns out there are a couple guides for installing Fedora Core 7 on the PS3. I’m gonna try them out next week and see how they go.