CloudComp 2010 Field Report
This week I am at CloudComp 2010 in Barcelona, Spain to present our group’s recent paper on Active Cloud DB. So I figured that while I was here, I had might as well report on the keynote talks given as well as the best of the papers presented there. Enjoy!
Day One
Keynote Talk One:
Ignacio Llorente, project lead over at OpenNebula, gave the first talk of the conference, discussing the various features that the OpenNebula cloud infrastructure offers its users. One of the more interesting features he discussed involved hybrid cloud support – specifically, OpenNebula can act as a broker between clouds, providing a single, unified interface between any number of private clouds you may happen to have and public clouds. Unfortunately, he didn’t get into many technical details and really just listed all of OpenNebula’s features, which I guess is ok since it was a keynote talk, but for an hour long keynote I was expecting any kind of depth and not just breadth.
Top Paper from Day One:
Alexander Reinefeld gave an interesting talk named Data Management in Clouds, which (1) is a really generic paper title and inevitably (2) IMHO was quite a bit of a misnomer, since it was really aboutXtreemFS, a file system targeted at the grid computing world. He said we needed a cloud file system, but didn’t really make the case that this was “cloudy” in any way – there didn’t seem to be any discussions about elasticity or anything else we seem to hear about in the cloud world. Otherwise the talk seemed pretty solid – it doesn’t really seem to be a research project as much as being a production-level-ish piece of software.
Keynote Talk Two:
Alvaro Arenas gave the second day’s keynote talk, discussing the XtreemOS project. He went for the opposite approach of the previous keynote speaker – lots of depth on one part of XtreemOS (the security layer) and very little on the rest of the system. Unfortunately, it didn’t seem to be really talking about cloud computing at all – the project is firmly rooted in the grid computing world and doesn’t really talk about challenges or benefits in the cloud world. Specifically, I was looking for him to say anything about how an operating system works when the number of nodes in the system changes (a common cloud use case) or anything about SLAs (also extremely common to the cloud) but didn’t hear anything about either.
Top Paper from Day Two:
(Asides from my paper, which was also on day two, of course!)
Guillaume Pierre talked about CloudTPS, their way to get ACID-transactions in NoSQL databases. It basically puts a transaction manager in the system which handles this on behalf of the database. Of course, this component is distributed (many nodes) and implements the well-known two-phase commit protocol. The presentation was very interesting but there were a few things I would have liked to seen that weren’t there:
- How the evaluation is performed was a bit vague – he was a bit low on time at this point and would have liked any explanation of the graphs more than “yay it scaled!”
- We sharply disagreed on what “consistency” was – they claim that Amazon SimpleDB was eventually consistent even with consistent reads and that NoSQL datastores favor availability over consistency – which just isn’t so (see HBase, Cassandra with consistent reads/writes, MemcacheDB, and so on). After a bit of discussion I believe he was trying to say that consistency to him meant ACID – but clearly it is not the case that these datastores favor availability over consistency just because they don’t have ACID transactions.
Regardless the talk was very interesting and the general idea was sound.
Keynote Talk Three:
Kate Keahey, project lead at the Nimbus Project, gave the last keynote of the conference, about the pros and cons of using cloud computing for running scientific experiments. She gave a breakdown of how Nimbus is implemented as well as the tools out there that sit on top of it.
I found her talk to be the most straightforward and the best talk of the conference (again, outside of mine of course). I liked her approach of talking in some depth about many topics to be far preferable to the other keynotes’ styles of no depth and all breadth (keynote 1) or all depth and no breadth (keynote 2).
She also talked about how Nimbus is used in real science, with a number of cool use cases and a good but brief discussion of how they run their open-source world. It was fairly simple – a few core committers on the project and a few more on github, but since it usually isn’t talked about too much in these settings, I found it to be insightful.
Top Paper from Day Three:
Burkhard Neidecker-Lutz, one of the conference’s program chairs, stepped in and gave a talk on a paper that the authors were unable to present themselves, on a framework for information and billing in the cloud. I found this paper to be unique not because of the actual paper itself – the slides in fact were blobs of text and somewhat inpenetrable, but what was interesting was how Burkhard was able to take a different group’s paper and not-so-great slides and really turn it around. He was able to use examples from other papers seen at the conference to really save that paper and make for a very interesting discussion, at the least. So my thanks go out to him to show that it can be done – others who took up presenting papers that weren’t theirs just read it real fast and ran for the hills, so this was good proof that it can be done the right way.
Wrapping Things Up
Unfortunately I couldn’t find a list of the papers online anywhere, but it looks like most of the papers can be found via our friend the Google. While mulling things over at the conference and in this hotel room, I also have a number of new interesting cloudy ideas, so stay tuned!
Large Scale Data Analysis Talk
As part of the seminar I’m in on Large Scale Data Analysis, I gave a talk on the continuing battle in the MapReduce world between DeWitt and Stonebraker on the side of parallel databases versus Dean and Ghemawat on the side of MapReduce. For those of you not interested in reading these long articles, it basically boils down to this: DeWitt and Stonebraker originally claimed that MapReduce allowed for fast data movement but was slow for actual computation, so you should use parallel databases instead (they suggested Vertica and “DBMS-X”, a mystery database).
They now say that for “quick and dirty” one-off jobs you should use MapReduce due to the fast data movement but in all other cases you should use Vertica. Dean and Ghemawat responded by saying that all the faults DeWitt and Stonebraker accused MapReduce of having are really faults of Hadoop’s MapReduce implementation and not MapReduce as an algorithm (that is, Google MapReduce doesn’t suffer from these problems). Specifically, DeWitt and Stonebraker’s MapReduce numbers turned out to be really slow because they stored the data as strings and parsing the data out was extremely expensive (often more expensive than the actual computation involved). To remedy this problem the data can simply be stored as Protocol Buffers, which DeWitt and Stonebraker were unable to do since Hadoop MapReduce doesn’t support it (although Google MapReduce was). There is an open ticket for this feature in Hadoop MapReduce but it appears to be orphaned long ago. If we actually get this feature in it will make the comparison extremely interesting.
The second half of my talk covered two recent papers on virtual machine migration, which is really handy if you need to reboot the physical machine for upgrades and maintenance or migrate the virtual machines for load balancing or power management, but as far as my research goes, none of those really help me out. Sysadmins will love these features, but the rest of us are really more concerned with reacting to VM failures and not so much to proactive VM failures.
Either way, I uploaded the slides as usual for your enjoyment. Hope you find them useful!
Major Area Exam
Just in case you were interested, I gave a talk last week at UCSB covering everything having to do with cloud computing. Unfortunately, I was only able to tape the first ten minutes of it, so I pretty much just cover virtualization and some basic introduction stuff as well. Here’s the slides if you want to follow along at home or see what I talked about after the video ended. Enjoy!
Smashing the Stack
While reading new books for my security class, I stumbled upon a very well-thoughtout, well-explained article on how buffer overflows / stack overflows work. It’s got cool nerdy pictures and code along with it. And the best part is…it’s got assembly code! Yay! How long has it been since you’ve seen assembly code? Have you actually seen assembly code? Smashing the Stack for Fun and Profit by Aleph One Back in my undergrad days they taught us about the fun times of Pep/7, a language similar to assembly used to teach…assembly. It turns out to be surprisingly similar to x86, and once you know one you end up with a great handle on the other. But it seems there are not any links to it from Google, so you’re on your own if you want to dig it up and use it.
libnet and libpcap
In having to research libnet and libpcap, I’ve found that there aren’t really a whole lot of great resources on them. But for those who, like me a week ago, have no idea what these things are, some clarification is helpful. Libnet is the standard packet construction library for C, which can be used to make TCP and UDP packets, amongst others. Libpcap does the opposite: it captures packets off the network. And now that you know the tools I need to pick up in a short amount of time, let’s talk about the resources I eventually found. Libpcap doesn’t have a whole lot of tutorials that can be easily reached through Google, but I think the best is directly linked from libpcap’s homepage itself. Appropriately titled ‘Programming with pcap‘, this tutorial gives a hands-on approach that shows how to implement a packet sniffer and gives you the source code to the whole thing along with it. Libnet unfortunately didn’t have great material on its homepage nor easily through Google. Also appropriately titled, the Libnet 1.1 Tutorial for Beginners is exactly what the name would imply. It’s broken up into multiple, discrete sections and provides tons of code and examples. Check out these great links and, as always, drop a line if you’ve found better links about these libraries. Also, check out the wtfpl (wtf public license); it’s full of fun and begs to be used.
Upcoming Topics for Fall 2008
Hi all! I start up my next quarter of grad classes today and will likely be taking these classes: So as normally happens, I’ll be slanting that way as far as my blog posts go for the next two or three months. However, since I’m on my Ruby craze right now, I’ll do my best to mix-in Ruby with these topics.
First There Was Java...
For the lion’s share of my undergraduate education (80%-90%) we got to program in Java. We did half a semester in C# (close enough), a semester in C, a semester in assembly, and everything else was Java. Although I picked up other languages before I got to undergrad life, Java was the first language that I really learned well, and a lot of how Java does things seeped into my head about how to do things in every other language. It’s not a bad thing, but it made my head spin when I ended up seeing Ruby for sure. This article has been in the works in my head ever since I read The Perils of JavaSchools, since I went to one. Early on in the article Joel pretty much captures the problem with only teaching Java: Java is not, generally, a hard enough programming language that it can be used to discriminate between great programmers and mediocre programmers. This is right on. And that’s not to say Java shouldn’t be taught in schools. But I don’t know if it should be the first language to teach students (I’ll get back to you on that) and it certainly shouldn’t be the only language that students end up really knowing. We can debate what it means to “really know” a language, but as far as I’m concerned, it’s knowing the language’s standard library. You can easily “program” in a language like Perl even if you don’t know Perl but know another imperative language (e.g., C or Java), but you don’t really know Perl unless you know all those special metacharacters and Perl’s regex syntax and blah blah blah. I think what I’m trying to get at is that I really wish I got to learn two languages from different paradigms. Something like Java and Lisp, or C and ML, but to a degree where if a problem comes up, I can take the extra five minutes to think about which language solves the problem best, rather than having to do it in Java because I’m just not productive enough in anything else. Part of this problem is relieved by me trying to learn Ruby well and insisting that I do everything in Ruby until I get a good grip on it (or unless Java is such a blatantly better choice). I think the problem with having Java as the only programming language is that, until I really got a good look at Ruby, I looked at the basic way we were taught to get input from the user and thought it just had to be that complicated: As I’ve said, Java is not my only language. I know it’s a lot less work to do this in C or Perl. But since I don’t really have the same grip on C or Perl that I have on Java, it’s not something that really ever hit me. Yet now that I’m looking to know as much about Ruby as I do about Java (hopefully more), all these little things are raising red flags. Like in Ruby, to get a line of input from the user, it’s: Which is closer to how it’s done in Perl or C and much shorter than Java. And of course I could go on about how Ruby is way less verbose than Java (which I will do next time), but Ruby’s not the focus of today. Java is. Java has been a great language for me over the last how-many-years. But I think we can come to a consensus that, no matter how you feel about Java, it can’t be the only tool in your toolbox. I also think that the only thing worse than not knowing a particular language is knowing it not that well. I spent about a year maintaining O’Caml applications, and I still can’t say that I know O’Caml. I can subconsciously look at O’Caml code and say what it’s doing, but I don’t know the O’Caml standard library. Google and the O’Caml docs are my best friend when I touch O’Caml, which just isn’t something that happens when I’m using Java. Sure, I still use Google, but I don’t need to go on message boards and forums to find out what a particularly weird error I’m having means. But this is getting off-topic. What I mean to say is that I don’t really know all the amazing features behind O’Caml. I know some subset of it that corresponds to the style of the coders whose code I maintained, and that’s about it. I’ve decided for Ruby that that’s totally unacceptable. It shows a lot of promise and I when I talk about Ruby, I want to at least appear like I know what I’m talking about. I want to approach a Ruby conversation with the same gusto that I have when I’m talking Java. And I wish that I had that knowledge with another language from my undergrad days. But there’s no time like the present, so off we go on Ruby!BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
String response = in.readLine();foo = gets()
State of the Cell
Originally posted at http://cs.ucsb.edu/~cgb/stateOfTheCell.html, and thus, looks much better there.
The Cell Broadband Engine Architecture (which we shall refer to as simply the Cell architecture) was designed as a compromise between the general-purpose but slower CPU and the specific-purpose and faster GPU. It is a heterogeneous architecture: it contains processing units that specialize in different tasks. However, critics (and even some fans) of the Cell architecture claim that it is incredibly difficult to produce good, fast code on it. Having spent the last quarter working with the Cell architecture, we agree with this sentiment. But why?
Our paper, ‘Lack of Abstraction Considered Harmful’, provides a review of work on Cell as well, work we have done on Cell, and an analysis of this issue. This web page serves as a summary of that paper.
Page Layout
Introduction
In order to test how easy it is to program fast code on the Cell architecture, we implement two different algorithms: the Hadamard product and matrix multiplication. We implement these algorithms on our Cell system at Fred Chong’s Architecture Lab (the Archlab). Our Cell system is a PlayStation 3 with Yellow Dog Linux 5 and the Cell SDK 2.0.
Algorithms and Technologies Used
Yellow Dog Linux 5 and Cell SDK 2.0
Detailed instructions exist for installing Yellow Dog Linux 5 and Cell SDK 2.0 on the PlayStation 3, as well as for Fedora Core 7 and Cell SDK 3.0 on the IBM Blade Servers, but none exist for getting the new Cell SDK 3.0 on the PlayStation 3 (Update). The company that supports Yellow Dog Linux, Terra Soft, offers the SDK with support, but for $4000, it is far out of our availability. Furthermore, although the SDK can be downloaded for free and used on other Linux boxes as a cross-compiling system, we were unable to get it to work on Gentoo Linux or Debian Linux. They seem to specifically require Fedora Core, which we did not have access to. Since these newer SDKs contain the optimized IBM XL C Compiler (IBM cross-compiler for C), we used the versions of gcc for the PPE and SPEs.
RapidMind
RapidMind is a metaprogramming language for C++ that allows the programmer to take their existing C++ code and port it to the Cell architecture (or the GPU architecture) with seemingly little work involved. The programmer annotates the parts of their code where parallelization can be performed, and the RapidMind framework converts this code to “optimized” C++ code. The speed of the resulting code varies vastly on the skill of the programmer, as we will show.
Cell Broadband Engine Architecture
We defer a detailed analysis of the Cell architecture to its Wikipedia article and to the presentation we gave at UC Santa Barbara on contributions to the IBM cross compiler, as the novel architecture has been discussed extensively in the various papers we have reviewed this quarter.
Hadamard product and Matrix Multiplication Algorithms
Both of these algorithms are relatively simple as far as computer science goes. Although matrix multiplication will be familiar to computer scientists, the Hadamard product may not be. It takes two matrices and produces a new matrix of the same dimensions, with each element being the product of the two elements given as input.
Design Process
For our comparisons we use three renditions of the Hadamard product algorithm and three renditions of the matrix multiplication algorithm. For the Hadamard product, we have constructed one version that only runs on the PPE, a version that uses the PPE and four SPEs, and a final version that uses RapidMind. We only use four of the six SPEs since the matrices we are multiplying are square and thus it is easier to split up into four blocks instead of six. For the matrix multiplication algorithm, we use one version that only runs on the PPE. We also wrote a version that uses RapidMind, but is a more naive version compared to the optimized RapidMind version distributed by RapidMind. The code for the optimized RapidMind version can be found at their website, while the versions we have implemented are available here.
Results
For the six algorithms we have run on Cell, we see the following performance:
We see that in both cases, our PPE only code runs faster than our naive RapidMind implementation by three orders of magnitude (a factor of 1000 difference). This scales up exponentially with the number of rows in the matrices (since the y-axis is on a log scale), and we also see that the optimized RapidMind and our PPE / 4 SPE programs scale up much better. Unfortunately, both compilers / frameworks were difficult to work with (although in different ways). We were unable to send more than 16 KB of data to the SPEs, so as a result we have no data for the Hadamard product with more than 16 KB of data. Next, we discuss the difficulties and successes we have with Cell’s gcc / spu-gcc and RapidMind.
Pros and Cons of Cell’s gcc / spu-gcc
Pros:
Cons:
Pros and Cons of RapidMind
Pros:
Cons:
Conclusion
We have seen on Cell that it is currently difficult to write fast code, and difficult to write any code at all. This goes double for developing with gcc and spu-gcc. In order for Cell to be a viable language, tools will need to be developed that abstract away the various parts of the hardware we’ve encountered difficulties with (memory alignment, sending data to SPEs, etc). Doing so would not have been out of reach for the Cell developers, yet it appears they wanted the potential to get the maximum performance out of the hardware. Unfortunately, this decision makes it brutually difficult for the average programmer to get any real programming done. It also makes the programs dependent on the specifics of this version of the architecture. To illustrate this point: what do you have to change in your algorithm if a new Cell chip is designed with 10 SPEs and 512 KB local store instead of the current layout (8 SPEs and 256 KB local store). You could probably keep the same performance as before easily, but to get the most out of the architecture, you need to refactor your entire program. Why not let the compiler / scheduler / whatever do this for you? Computers were meant to make our lives easier, remember? I guess I finally see what the big deal about Ruby is, with Matsumoto’s “crazy” idea:
“Often people, especially computer engineers, focus on the machines. They think, ‘By doing this, the machine will run faster. By doing this, the machine will run more effectively. By doing this, the machine will something something something.’ They are focusing on machines. But in fact we need to focus on humans, on how humans care about doing programming or operating the application of the machines. We are the masters. They are the slaves.”
Hope is not lost for Cell. In fact, its quite the opposite. The potential for Cell is astounding. We just need to start harnessing that power the right way. Software complexity is already increasing on its own. We shouldn’t be making it even more complex. Here’s a more concise article that covers the same issue.
Chris Bunch
Update 12/22/07: So it turns out there are a couple guides for installing Fedora Core 7 on the PS3. I’m gonna try them out next week and see how they go.
Multigrid Methods with CUDA
Originally posted at http://cs.ucsb.edu/~cgb/multigridCUDA.html, and thus, looks much better there.
Over the last quarter I have been investigating how to implement a multigrid algorithm with CUDA, a new language aimed at making General Purpose Computation on GPUs (GPGPU) easier. This web page documents this experience.
Introduction
It is not always possible to solve a partial differential equation (PDE). In fact, exact solutions are not known for most PDEs. With the advent of computers, however, we can obtain good approximations to these solutions. One class of methods that are used to solve PDEs are the Multigrid Methods. Compared to other types of methods used to solve PDEs, they require less iterations and thus are generally faster. For this project, we have elected to solve a common PDE, the heat equation in three-dimensions, given by the following equation:
Implementing multigrid methods is not a new problem. Therefore, we choose to implement them on a new platform, the Graphics Processor Unit (GPU). This has become possible only recently with the advent of GPGPU languages such as NVIDIA’s CUDA (Compute Unified Device Architecture) and ATI’s CTM (Close To Metal), amongst others. Since we already have an NVIDIA GeForce 8800 in our lab, which is CUDA-capable, we used CUDA to implement this project.
Algorithms and Technologies Used
CUDA
CUDA is a C-like language that allows programmers to “easily” port their code for use on the GPU. The language is mostly a super-set of the C language, allowing the programmer to flag which methods are run on the GPU and where variables are stored on the GPU. Although it is presented as a high-level language, low-level parts of the architecture are exposed to the user. As is the case with C, the programmer can allocate memory and de-allocate it (via cudaMalloc and cudaFree), but the programmer can choose specifically on the GPU where they want to put it. The following figure is from the CUDA documentation, showing the various memory locations available to the user:
Here, we can see that we can allocate memory in the graphics card’s shared memory, constant cache, texture cache, and device memory. Although the device memory is the slowest to access, it has the most space and allows us to not have to worry about cache coherence in the shared memory. Thus, we use the device memory. A critique of the CUDA language is given in the “Pros and Cons of CUDA” section.
Multigrid Methods
Typical methods used to solve PDEs represent the simulated area as a three-dimensional grid (or matrix). Therefore, it is easy to represent on a computer. Furthermore, we can represent the heat equation as solving a linear system of equations. The general and specific representations are as follows:
For our implementation, k is 1, A is the Laplace operator, x is u, and b is Ut. Three simple methods exist for solving this system of equations: The Jacobi Method, the Gauss-Seidel Method, and the Red-Black Gauss-Seidel Method. Our explanation for choosing the Red-Black Gauss-Seidel Method is given later, since this section focuses on Multigrid Methods in general.
As was previously stated, regular methods use one grid and solve it using any of the aforementioned methods. Multigrid methods differ in the sense that they solve many smaller grids and use those as approximations to the solution of the full size grid. This is necessary because solving only one large grid does not give the fine-grained accuracy that solving the smaller grids does. Furthermore, solving many small grids once and the large grid a small number of times is computationally faster than solving the large grid many times. The methods needed to implement this will be described in the design section.
Red-Black Gauss-Seidel Method
The simple methods used to solve a linear system of equations are the Jacobi Method, the Gauss-Seidel Method, and the Red-Black Gauss-Seidel Method. As shown in A Multigrid Tutorial, the Red-Black Gauss-Seidel method converges the fastest out of these three methods (requires the least number of iterations). An added benefit specific to CUDA is that this method is easily parallelizable. In the standard Gauss-Seidel method, grid points must be updated one at a time. Although this works fine for serial computation, it is a poor choice for the parallel case. For Red-Black Gauss-Seidel, the odd points are all updated in parallel, and then the even points are all updated in parallel. The computation used to update each point is given by the average of that point’s neighbors.
Design Process
This being my first academic project since reading Code Complete, I decided to develop a prototype first. My C++ prototype simulates a 2D grid situation. It was mainly used to get a feel for the methods needed for the CUDA version, and as a result, is incomplete. It ended up mapping over to CUDA well, as follows:
| C++ Method Name | CUDA Method Name | Purpose |
| Constructor | Initialize / MG_CUDA_AllocateMem | Creates a grid. The C++ version needs a grid size; the CUDA version creates all the subgrids from the given size to a preset limit. Each subgrid is half the size of the previous grid. |
| Destructor | Close / MG_CUDA_Free | Frees the memory a grid has allocated. The CUDA version uses cudaFree to release all the memory it has been given via cudaMalloc. |
| setupInitialConditions | HeatToggle | The grid is initialized to 0 degrees at all points by the constructor. The C++ version of this method sets all points in the first row to be 100 degrees, and the CUDA version sets the innermost area of the cube to be 200 degrees. This is done so that we will be able to see the heat flow throughout the geometry. |
| restrictGrid | MG_CUDA_RestrictGrid / MG_CUDA_RKernel | Creates a new grid half the size of the original grid. The C++ version copies down the points from the bigger grid to the smaller one, while the CUDA version sets each point to be the weighted average of the points around it. |
| interpolateGrid | MG_CUDA_InterpolateGrid / MG_CUDA_IKernel | Creates a new grid twice the size of the original grid. Both versions sets each point to be the weighted average of the points around it. |
| solveGrid | MG_CUDA_SolveGrid / MG_CUDA_SGKernel | Both versions solve the given grid by using the Red-Black Gauss-Seidel Method. The odd numbered points are updated to be a weighted average of the points around them, and the process is repeated for the even numbered points. The C++ version does this until 50 iterations are done, while the CUDA version does 1 iteration. |
| computeResidual | MG_CUDA_ComputeResidual / MG_CUDA_CRKernel | Takes the difference between each point in two grids and returns a new grid containing this difference. |
| applyCorrection | MG_CUDA_ApplyCorrection / MG_CUDA_ACKernel | Adds in each point in a grid to the given grid and returns a new grid. |
| printGrid | {none} | Used for debugging purposes only while the visualization scheme was being developed. |
| main | Simulate | Performs one iteration of our multigrid solver, which consists of restricting the grids with data from the previous iteration, solving the smaller grids, and solving the largest grid. |
The code for the multigrid prototype can be downloaded here.
Results
For the sake of timing purposes, we consider each execution of the ‘Simulate’ method to be one step, and we time how many steps can be performed per second. We test this for both the CUDA and C multigrid implementations for grids of size 16 x 16 x 16 through 256 x 256 x 256. We attempted to go larger (to 512), but the CPU’s memory allocation function (malloc) failed and could not allocate us the 512 MB of memory we needed for each of the four arrays we manipulate on the CPU (note the 512 MB matching up with the grid size 512 is coincidental). That being said, the speedup we achieve is given as follows:
For the purposes of this experiment, the % difference is also the speedup between the CUDA and C versions of the multigrid method.

What is interesting to note is that the CPU version is actually faster on the grid of dimension 64, and that the GPU is not notably faster as we have seen in other CUDA GPU applications. The reason for this is two-fold. First, we were unable to resolve issues we were having getting the Red-Black Gauss-Seidel method to work on our largest grid, so it had to be done on the CPU, and the integer modulo operator we use to get the even and odd points for Red-Black Gauss-Seidel works very fast on the CPU and very slow on the GPU. Furthermore, the grid was stored as a one dimensional array in memory, and to recover the three dimensional array structure, each point had to perform a swizzling method, which also involved using integer modulo operations. Add in the fact that we do this 16 million times per step (once per point on the grid, so 16 million is for the 256 grid) and you see why our speedup was not as good as it could have been. A low resolution video of our CUDA multigrid implementation has been recorded along with a high-resolution screen capture. Both are provided here in that order:
Unfortunately, the video recording software we used does not record the high-frame rates of our program very well and thus lags much more than the actual version does. Furthermore, our version is more brilliant in terms of color as the screenshot on the right reveals. We use the a simple 3D rotation technique based on which axes the user is manipulating for rotating the cube, and we provide mouse combinations to allow the user to zoom in and out of the cube and translate themselves in space about the cube. These features are all documented in the above video. As we have refactored Brent Oster’s C code to produce our CUDA version, we have kept the interaction mechanisms he has implemented intact. We have added functionality to switch rendering on and off to be able to accurately measure how many steps are computed per second, as the rendering negatively impacts it.
On December 14, 2007, this material was presented to our ‘GPGPU and 3D User Interfaces’ class. A PDF of the presentation can be found here. I’m currently working on putting the code up. The code used for this project consists of three files: NC_Multigrid.cpp, NC_Multigrid.h, NanoCAD2.cpp, and the CUDA source code, Multigridcu.cu. They can be found at the location “Projects/NanoCAD2/Modules/”, with the exception of the driver file, which is at “Projects/NanoCAD2/NanoCAD2.cpp”.
Pros and Cons of CUDA
CUDA provides a novel way for the programmer to use the graphics hardware to its fullest. It is both a high and low level language, but because it is a relatively new language, it is not without its downsides.
Pros:
Cons:

This wouldn’t be so bad except for the fact that this happens almost every time something goes wrong, not just with the blocks. In the last four hours of programming in CUDA, this has happened to me FIVE times. Future implementations of CUDA NEED to fix this for CUDA to become even close to viable as a good programming language. My computer shouldn’t die just because I used memory that I allocated and set to zero but forgot to copy the host’s version of to. That leads into the next issue…


Trying to recompile the code gives no errors and the code runs fine. So what’s the deal? I can’t even get the blue screen of death consistently! Sometimes Visual Studio gives me a cudaError_enum, sometimes the computer locks up, and sometimes it gives me the blue screen. Why can’t it just be consistent, and consistently the first result?
Conclusion
Whew! What a long review! It’s been a great quarter full of new opportunities, and thanks to all for letting me use what I’ve needed to get this done. Special thanks go out to Tobias Hollerer and Brent Oster for use of the hardware and software, respectively, that was used or refactored in some way for this project. CUDA has the potential to go far for sure, but there’s definitely some huge kinks it needs to work out before it’s ready for the general public.
Yay!
Grades are in! GPGPU and 3D User Interfaces: A Parallel Architectures: A+ Directed Research: S Cognitive Science Seminar: S I don’t know what an S means (satisfactory?), but the first quarter rocked! I’ll put up versions of my work here soon!



