State of the Cell
Originally posted at http://cs.ucsb.edu/~cgb/stateOfTheCell.html, and thus, looks much better there.
The Cell Broadband Engine Architecture (which we shall refer to as simply the Cell architecture) was designed as a compromise between the general-purpose but slower CPU and the specific-purpose and faster GPU. It is a heterogeneous architecture: it contains processing units that specialize in different tasks. However, critics (and even some fans) of the Cell architecture claim that it is incredibly difficult to produce good, fast code on it. Having spent the last quarter working with the Cell architecture, we agree with this sentiment. But why?
Our paper, ‘Lack of Abstraction Considered Harmful’, provides a review of work on Cell as well, work we have done on Cell, and an analysis of this issue. This web page serves as a summary of that paper.
Page Layout
Introduction
In order to test how easy it is to program fast code on the Cell architecture, we implement two different algorithms: the Hadamard product and matrix multiplication. We implement these algorithms on our Cell system at Fred Chong’s Architecture Lab (the Archlab). Our Cell system is a PlayStation 3 with Yellow Dog Linux 5 and the Cell SDK 2.0.
Algorithms and Technologies Used
Yellow Dog Linux 5 and Cell SDK 2.0
Detailed instructions exist for installing Yellow Dog Linux 5 and Cell SDK 2.0 on the PlayStation 3, as well as for Fedora Core 7 and Cell SDK 3.0 on the IBM Blade Servers, but none exist for getting the new Cell SDK 3.0 on the PlayStation 3 (Update). The company that supports Yellow Dog Linux, Terra Soft, offers the SDK with support, but for $4000, it is far out of our availability. Furthermore, although the SDK can be downloaded for free and used on other Linux boxes as a cross-compiling system, we were unable to get it to work on Gentoo Linux or Debian Linux. They seem to specifically require Fedora Core, which we did not have access to. Since these newer SDKs contain the optimized IBM XL C Compiler (IBM cross-compiler for C), we used the versions of gcc for the PPE and SPEs.
RapidMind
RapidMind is a metaprogramming language for C++ that allows the programmer to take their existing C++ code and port it to the Cell architecture (or the GPU architecture) with seemingly little work involved. The programmer annotates the parts of their code where parallelization can be performed, and the RapidMind framework converts this code to “optimized” C++ code. The speed of the resulting code varies vastly on the skill of the programmer, as we will show.
Cell Broadband Engine Architecture
We defer a detailed analysis of the Cell architecture to its Wikipedia article and to the presentation we gave at UC Santa Barbara on contributions to the IBM cross compiler, as the novel architecture has been discussed extensively in the various papers we have reviewed this quarter.
Hadamard product and Matrix Multiplication Algorithms
Both of these algorithms are relatively simple as far as computer science goes. Although matrix multiplication will be familiar to computer scientists, the Hadamard product may not be. It takes two matrices and produces a new matrix of the same dimensions, with each element being the product of the two elements given as input.
Design Process
For our comparisons we use three renditions of the Hadamard product algorithm and three renditions of the matrix multiplication algorithm. For the Hadamard product, we have constructed one version that only runs on the PPE, a version that uses the PPE and four SPEs, and a final version that uses RapidMind. We only use four of the six SPEs since the matrices we are multiplying are square and thus it is easier to split up into four blocks instead of six. For the matrix multiplication algorithm, we use one version that only runs on the PPE. We also wrote a version that uses RapidMind, but is a more naive version compared to the optimized RapidMind version distributed by RapidMind. The code for the optimized RapidMind version can be found at their website, while the versions we have implemented are available here.
Results
For the six algorithms we have run on Cell, we see the following performance:
We see that in both cases, our PPE only code runs faster than our naive RapidMind implementation by three orders of magnitude (a factor of 1000 difference). This scales up exponentially with the number of rows in the matrices (since the y-axis is on a log scale), and we also see that the optimized RapidMind and our PPE / 4 SPE programs scale up much better. Unfortunately, both compilers / frameworks were difficult to work with (although in different ways). We were unable to send more than 16 KB of data to the SPEs, so as a result we have no data for the Hadamard product with more than 16 KB of data. Next, we discuss the difficulties and successes we have with Cell’s gcc / spu-gcc and RapidMind.
Pros and Cons of Cell’s gcc / spu-gcc
Pros:
Cons:
Pros and Cons of RapidMind
Pros:
Cons:
Conclusion
We have seen on Cell that it is currently difficult to write fast code, and difficult to write any code at all. This goes double for developing with gcc and spu-gcc. In order for Cell to be a viable language, tools will need to be developed that abstract away the various parts of the hardware we’ve encountered difficulties with (memory alignment, sending data to SPEs, etc). Doing so would not have been out of reach for the Cell developers, yet it appears they wanted the potential to get the maximum performance out of the hardware. Unfortunately, this decision makes it brutually difficult for the average programmer to get any real programming done. It also makes the programs dependent on the specifics of this version of the architecture. To illustrate this point: what do you have to change in your algorithm if a new Cell chip is designed with 10 SPEs and 512 KB local store instead of the current layout (8 SPEs and 256 KB local store). You could probably keep the same performance as before easily, but to get the most out of the architecture, you need to refactor your entire program. Why not let the compiler / scheduler / whatever do this for you? Computers were meant to make our lives easier, remember? I guess I finally see what the big deal about Ruby is, with Matsumoto’s “crazy” idea:
“Often people, especially computer engineers, focus on the machines. They think, ‘By doing this, the machine will run faster. By doing this, the machine will run more effectively. By doing this, the machine will something something something.’ They are focusing on machines. But in fact we need to focus on humans, on how humans care about doing programming or operating the application of the machines. We are the masters. They are the slaves.”
Hope is not lost for Cell. In fact, its quite the opposite. The potential for Cell is astounding. We just need to start harnessing that power the right way. Software complexity is already increasing on its own. We shouldn’t be making it even more complex. Here’s a more concise article that covers the same issue.
Chris Bunch
Update 12/22/07: So it turns out there are a couple guides for installing Fedora Core 7 on the PS3. I’m gonna try them out next week and see how they go.
Multigrid Methods with CUDA
Originally posted at http://cs.ucsb.edu/~cgb/multigridCUDA.html, and thus, looks much better there.
Over the last quarter I have been investigating how to implement a multigrid algorithm with CUDA, a new language aimed at making General Purpose Computation on GPUs (GPGPU) easier. This web page documents this experience.
Introduction
It is not always possible to solve a partial differential equation (PDE). In fact, exact solutions are not known for most PDEs. With the advent of computers, however, we can obtain good approximations to these solutions. One class of methods that are used to solve PDEs are the Multigrid Methods. Compared to other types of methods used to solve PDEs, they require less iterations and thus are generally faster. For this project, we have elected to solve a common PDE, the heat equation in three-dimensions, given by the following equation:
Implementing multigrid methods is not a new problem. Therefore, we choose to implement them on a new platform, the Graphics Processor Unit (GPU). This has become possible only recently with the advent of GPGPU languages such as NVIDIA’s CUDA (Compute Unified Device Architecture) and ATI’s CTM (Close To Metal), amongst others. Since we already have an NVIDIA GeForce 8800 in our lab, which is CUDA-capable, we used CUDA to implement this project.
Algorithms and Technologies Used
CUDA
CUDA is a C-like language that allows programmers to “easily” port their code for use on the GPU. The language is mostly a super-set of the C language, allowing the programmer to flag which methods are run on the GPU and where variables are stored on the GPU. Although it is presented as a high-level language, low-level parts of the architecture are exposed to the user. As is the case with C, the programmer can allocate memory and de-allocate it (via cudaMalloc and cudaFree), but the programmer can choose specifically on the GPU where they want to put it. The following figure is from the CUDA documentation, showing the various memory locations available to the user:
Here, we can see that we can allocate memory in the graphics card’s shared memory, constant cache, texture cache, and device memory. Although the device memory is the slowest to access, it has the most space and allows us to not have to worry about cache coherence in the shared memory. Thus, we use the device memory. A critique of the CUDA language is given in the “Pros and Cons of CUDA” section.
Multigrid Methods
Typical methods used to solve PDEs represent the simulated area as a three-dimensional grid (or matrix). Therefore, it is easy to represent on a computer. Furthermore, we can represent the heat equation as solving a linear system of equations. The general and specific representations are as follows:
For our implementation, k is 1, A is the Laplace operator, x is u, and b is Ut. Three simple methods exist for solving this system of equations: The Jacobi Method, the Gauss-Seidel Method, and the Red-Black Gauss-Seidel Method. Our explanation for choosing the Red-Black Gauss-Seidel Method is given later, since this section focuses on Multigrid Methods in general.
As was previously stated, regular methods use one grid and solve it using any of the aforementioned methods. Multigrid methods differ in the sense that they solve many smaller grids and use those as approximations to the solution of the full size grid. This is necessary because solving only one large grid does not give the fine-grained accuracy that solving the smaller grids does. Furthermore, solving many small grids once and the large grid a small number of times is computationally faster than solving the large grid many times. The methods needed to implement this will be described in the design section.
Red-Black Gauss-Seidel Method
The simple methods used to solve a linear system of equations are the Jacobi Method, the Gauss-Seidel Method, and the Red-Black Gauss-Seidel Method. As shown in A Multigrid Tutorial, the Red-Black Gauss-Seidel method converges the fastest out of these three methods (requires the least number of iterations). An added benefit specific to CUDA is that this method is easily parallelizable. In the standard Gauss-Seidel method, grid points must be updated one at a time. Although this works fine for serial computation, it is a poor choice for the parallel case. For Red-Black Gauss-Seidel, the odd points are all updated in parallel, and then the even points are all updated in parallel. The computation used to update each point is given by the average of that point’s neighbors.
Design Process
This being my first academic project since reading Code Complete, I decided to develop a prototype first. My C++ prototype simulates a 2D grid situation. It was mainly used to get a feel for the methods needed for the CUDA version, and as a result, is incomplete. It ended up mapping over to CUDA well, as follows:
| C++ Method Name | CUDA Method Name | Purpose |
| Constructor | Initialize / MG_CUDA_AllocateMem | Creates a grid. The C++ version needs a grid size; the CUDA version creates all the subgrids from the given size to a preset limit. Each subgrid is half the size of the previous grid. |
| Destructor | Close / MG_CUDA_Free | Frees the memory a grid has allocated. The CUDA version uses cudaFree to release all the memory it has been given via cudaMalloc. |
| setupInitialConditions | HeatToggle | The grid is initialized to 0 degrees at all points by the constructor. The C++ version of this method sets all points in the first row to be 100 degrees, and the CUDA version sets the innermost area of the cube to be 200 degrees. This is done so that we will be able to see the heat flow throughout the geometry. |
| restrictGrid | MG_CUDA_RestrictGrid / MG_CUDA_RKernel | Creates a new grid half the size of the original grid. The C++ version copies down the points from the bigger grid to the smaller one, while the CUDA version sets each point to be the weighted average of the points around it. |
| interpolateGrid | MG_CUDA_InterpolateGrid / MG_CUDA_IKernel | Creates a new grid twice the size of the original grid. Both versions sets each point to be the weighted average of the points around it. |
| solveGrid | MG_CUDA_SolveGrid / MG_CUDA_SGKernel | Both versions solve the given grid by using the Red-Black Gauss-Seidel Method. The odd numbered points are updated to be a weighted average of the points around them, and the process is repeated for the even numbered points. The C++ version does this until 50 iterations are done, while the CUDA version does 1 iteration. |
| computeResidual | MG_CUDA_ComputeResidual / MG_CUDA_CRKernel | Takes the difference between each point in two grids and returns a new grid containing this difference. |
| applyCorrection | MG_CUDA_ApplyCorrection / MG_CUDA_ACKernel | Adds in each point in a grid to the given grid and returns a new grid. |
| printGrid | {none} | Used for debugging purposes only while the visualization scheme was being developed. |
| main | Simulate | Performs one iteration of our multigrid solver, which consists of restricting the grids with data from the previous iteration, solving the smaller grids, and solving the largest grid. |
The code for the multigrid prototype can be downloaded here.
Results
For the sake of timing purposes, we consider each execution of the ‘Simulate’ method to be one step, and we time how many steps can be performed per second. We test this for both the CUDA and C multigrid implementations for grids of size 16 x 16 x 16 through 256 x 256 x 256. We attempted to go larger (to 512), but the CPU’s memory allocation function (malloc) failed and could not allocate us the 512 MB of memory we needed for each of the four arrays we manipulate on the CPU (note the 512 MB matching up with the grid size 512 is coincidental). That being said, the speedup we achieve is given as follows:
For the purposes of this experiment, the % difference is also the speedup between the CUDA and C versions of the multigrid method.

What is interesting to note is that the CPU version is actually faster on the grid of dimension 64, and that the GPU is not notably faster as we have seen in other CUDA GPU applications. The reason for this is two-fold. First, we were unable to resolve issues we were having getting the Red-Black Gauss-Seidel method to work on our largest grid, so it had to be done on the CPU, and the integer modulo operator we use to get the even and odd points for Red-Black Gauss-Seidel works very fast on the CPU and very slow on the GPU. Furthermore, the grid was stored as a one dimensional array in memory, and to recover the three dimensional array structure, each point had to perform a swizzling method, which also involved using integer modulo operations. Add in the fact that we do this 16 million times per step (once per point on the grid, so 16 million is for the 256 grid) and you see why our speedup was not as good as it could have been. A low resolution video of our CUDA multigrid implementation has been recorded along with a high-resolution screen capture. Both are provided here in that order:
Unfortunately, the video recording software we used does not record the high-frame rates of our program very well and thus lags much more than the actual version does. Furthermore, our version is more brilliant in terms of color as the screenshot on the right reveals. We use the a simple 3D rotation technique based on which axes the user is manipulating for rotating the cube, and we provide mouse combinations to allow the user to zoom in and out of the cube and translate themselves in space about the cube. These features are all documented in the above video. As we have refactored Brent Oster’s C code to produce our CUDA version, we have kept the interaction mechanisms he has implemented intact. We have added functionality to switch rendering on and off to be able to accurately measure how many steps are computed per second, as the rendering negatively impacts it.
On December 14, 2007, this material was presented to our ‘GPGPU and 3D User Interfaces’ class. A PDF of the presentation can be found here. I’m currently working on putting the code up. The code used for this project consists of three files: NC_Multigrid.cpp, NC_Multigrid.h, NanoCAD2.cpp, and the CUDA source code, Multigridcu.cu. They can be found at the location “Projects/NanoCAD2/Modules/”, with the exception of the driver file, which is at “Projects/NanoCAD2/NanoCAD2.cpp”.
Pros and Cons of CUDA
CUDA provides a novel way for the programmer to use the graphics hardware to its fullest. It is both a high and low level language, but because it is a relatively new language, it is not without its downsides.
Pros:
Cons:

This wouldn’t be so bad except for the fact that this happens almost every time something goes wrong, not just with the blocks. In the last four hours of programming in CUDA, this has happened to me FIVE times. Future implementations of CUDA NEED to fix this for CUDA to become even close to viable as a good programming language. My computer shouldn’t die just because I used memory that I allocated and set to zero but forgot to copy the host’s version of to. That leads into the next issue…


Trying to recompile the code gives no errors and the code runs fine. So what’s the deal? I can’t even get the blue screen of death consistently! Sometimes Visual Studio gives me a cudaError_enum, sometimes the computer locks up, and sometimes it gives me the blue screen. Why can’t it just be consistent, and consistently the first result?
Conclusion
Whew! What a long review! It’s been a great quarter full of new opportunities, and thanks to all for letting me use what I’ve needed to get this done. Special thanks go out to Tobias Hollerer and Brent Oster for use of the hardware and software, respectively, that was used or refactored in some way for this project. CUDA has the potential to go far for sure, but there’s definitely some huge kinks it needs to work out before it’s ready for the general public.
My Favorite Error Message Ever
While running a Poisson Equation Finite Difference approximation algorithm (whew, a mouthful there!) today on Cell, I got the greatest error message ever: [cgb@cell pfd]$ ./withRapidMind What the hell indeed. Much thanks to the RapidMind guys for an awesome function called “what” and not supporting the double type (although I’m sure there’s a long explanation about that I haven’t found yet).
terminate called after throwing an instance of ‘rapidmind::Exception’
what(): The type ‘double’ is not supported by the Cell backend.
Aborted
Setting up Linux on the Cell (Part 2)
So…Yellow Dog Linux doesn’t like SDK 2.1 very much. Turns out I must have done something wrong while trying to install it because it only wants to work on Fedora and I fucked up my ability to compile code for the SPEs and PPE while doing so. So I reinstalled Yellow Dog and this time, followed the very simple explanation at the RapidMind website which said just to install the libspe2 rpm and not mess around with anything else. It worked great, but let’s step back a second here. So reinstalling Yellow Dog and RapidMind by their instructions works fine. But I do have a couple gripes which may just be standard Linux issues that I never realized until now for some reason: I’m sure there are other things, and I’ll put them up as I remember them. Asides from those little gripes, the environment is actually very nice. Here’s a nice how-to that helped me out a bit: CellPerformance.Com. I’d link directly to the RapidMind page too, but you have to login for that (although it is free to sign up). Anywho, we’re gonna benchmark the Cell versus a slightly-faster-than-average PC and do some other cool stuff with that. Check back later for how that turns out.
SDK 2.1 (and I presume the new SDK 3.0) install a Cell Simulator on your machine. What the hell is the point of that? Why is it installing a Cell Simulator on my Cell? Yes, this works fine if you’re on a Fedora box that you want to develop code for Cell on that isn’t a Cell box. But if I want to develop on my Cell with the new SDK, don’t make me go get another box that isn’t a Cell to go do it! To quote what I’ve been hearing a lot lately, “What were they thinking?” The IBM official instructions say I can do this on a Cell box, but what the hell is the problem here!
/rant
If I sudo up to root, then I should be able to do anything that root could do if root logged in! But I can’t! I can’t boot into the ps3 game OS, I can’t even add groups! What the hell! Completely ridiculous.
(Not a Linux problem) X/Gnome looks like shit on my monitor. Holy crap I never realized how bad the analog connections on the PS3 look until I fired up X. If you have the choice and can get a digital connection and a monitor that suports HDCP, buy it. I’ll put up a screenshot of how miserable this looks sooner or later.
Setting up Linux on the Cell (Part 1)
So we got a PS3 for research in the lab the other day and we got to set up Linux on it. It’s actually pretty painless, although if you can, get a monitor that supports DVI and supports HDCP so you can get the digital connection. We only have analog in the lab so it looks like crap. Seriously though, it looks like feces and I have to use SSH to get anything done. That being said, here’s how to do it:
This actually isn’t a how-to or anything. We just followed the directions at the IBM website and it was all set up. I actually wanted to set up RapidMind to fiddle with parallelizing applications in an “easier” fashion (check back later to see if it actually was easier) and needed to set up libspe2. Libspe1 comes with Cell, but to set up libspe2, you’ll need to install the Cell SDK 2.1 (2.0 comes with Cell and the IBM link above, use this link for 2.1). I’m installing it right now, but if it doesn’t work, I’ll let you know, and if it gets really bad, it will become a how-to. Either way, I’m looking at using Cell for multigrid method solving and visualization, or maybe something unrelated, but I’ll let you know.



