August 19th, 2008
There has been some buzz about Microsoft’s Windows HPC offering. Now, I’m not at all a fan of Windows. I’ve mostly not ever been happy with it’s performance on my desktop and usually laugh at the trials people running Windows servers have to go through. Though, maybe, Microsoft can make a difference with this product.
Where I work, people generally buy into really large clusters. Their nodes are added into the queueing system and they run jobs. Sometimes if people have very special needs, they get their own cluster. The big idea behind my group making big clusters is that researchers can share the resources they’ve bought and the big pie in the sky can be much larger.
I think aggregating resources is a great thing, but scaling every little piece of the infrastructure to handle a thousand node cluster does not seem to be easy. Network gear that can provide great performance to that many nodes is expensive. Cooling that much gear is difficult in old data centers. Powering that much gear is just a headache. Even making sure the NTP server can handle the load needs to make a quick stop on the neuron trail in the minds of admins.
Scaling clusters to large numbers is expensive. That’s why I wonder if Windows HPC can make a small dent. It can bridge that gap between having fast, cheap hardware and no one to run it. Many people “use Ubuntu daily,” but it seems few undergrad/grad students really know how it all goes. Sure, maybe a Professor can find a knowledgeable student to make a Linux cluster go.. But, students leave, and students with Linux knowledge to that depth are rare.
I wonder if for the crowd that needs a “small” amount of cores or prefers dedicated access to their resources, if Microsoft may actually come through to shape the future to come. Or, they may have just unleashed the next generation of botnet…
Tags: hpc, windows
Posted in Purdue | No Comments »
July 21st, 2008
Scientists who write simulation software seem to always take the quick and dirty approach rather than hiring a Computer Scientist and some coders. I recently ran into this when compiling a wave propagation program on Purdue’s SiCortex. The dependencies were a mile long and there were cross-dependencies in the list. Most of these packages were smallish and compiled without any trouble at all. Others had absolute paths encoded in header files and incomplete or very old documentation. One library was actually just a library of other libraries and was just not wanting to be sprung into life.
After corresponding with the author a bit, I got the propagation code to start compiling without giving me too much grief. At about this time, warnings began to fly. It’s almost as if someone took a copy of this program from C and ported it to C++ without much thought about what was going on. But, most aren’t that scary so we can overlook them for now. However, the compile kept crashing. Sometimes without much warning or with spectacular errors from the compiler itself (”this isn’t right, send my output to my authors for assistance….”)
After speaking a little bit with the Pathscale guys, they pointed out that the compiler was using /tmp to keep its state during the compiling of large files and that that filesystem was becoming full. Doh! Who knew it would take more than 2GB of storage for /tmp to keep the statement of a single compile job?
A node within a SiCortex more or less “netboots” and mounts the root file system read only from a network block device. Anything that is writable goes into a tmpfs, this included /tmp. Every node has only ~8gb, so one can certainly understand why the writable file systems are kept a little small.
In the end, I set TMPDIR to point to an NFS mounted file system. The compile took ages to complete, but it did complete somewhere successfully.
Tags: gentoo, netbooting, Sicortex
Posted in Purdue | No Comments »
July 20th, 2008
In a move that may be quite unwise, I obtained a Sun Enterprise 6500 system and put it in my basement. I’m not exactly sure why I made such a decision, but the hardware has been interesting to play with.
The system was billed as the ultimate Enterprise system of the time, featuring massive expansion capabilities and maximum uptimes. The system was engineered to be hot swappable so one could repair or upgrade without trouble. Normally, the system came in a rack by itself with a little bit of room to add near line storage to it. Purdue used several of these systems in various departments, including the mail group where my machine came from. Although, my system does not sit in the large, purple-ish colored rack. Instead, it was pulled out and now sits on a shelf humming away.
The rack required a 30amp 208v plug, the system rated at taking upto 24amps of that power. On the advice of a coworker, the system is powered from a 15amp 120v circuit. Of course, most of the system is bare. My system is configured with the standard I/O board so I can boot from either a wide scsi disk or a CD-rom and four 400MHz CPU boards. The system contains 14GB of memory. The chassis can hold 16 boards, 15 of which can hold CPUs and memory. Making the system able to contain upto 30 487MHz UltraSparcII’s and 60GB of memory.
Thankfully, it doesn’t appear the computer is consuming too much power (as the lights don’t yet dim when I turn it on…)
I just got Debian Etch installed and serving up a 7GB ram disk over NFS to my iMac. Sadly, the system only comes with 100Mbps Ethernet, with only two expensive options for 1Gbps: a rare SBus adapter or the PCI I/O board. The system backplane is capable of 2.8GBps of traffic, so hopefully one day I can obtain the PCI board.
Since I’ve been playing with my employer’s SiCortex machine, I decided to run my PI estimation program against this beast. On a per-core basis, the code runs at about the same speed on both systems. Though, I’m guessing my system is using a few more watts than the equivalent 8 cores in a SiCortex…
Tags: e6500, sun
Posted in Tinkergeek | No Comments »
July 7th, 2008
My last few posts have been about this strange machine sitting on the floor at Purdue. It is quite an interesting machine for me. I have a thing for weird-ish machine (I was initially excited about Purdue getting an SGI Altix.) Unlike the Altix, the Sicortex has provided several unique opportunities for fun and adventure.
The Altix system at Purdue has gone into production use. This means that tinkering inside it is pretty much not happening. Also, the system seems kinda fragile. Not only that, but it runs SuSE which is one of the most retarded Linux distributions I’ve ever used. I hear the people running on the Altix seem to love it’s large amount of memory. Although, now one can get a Sun x4600 with a similar amount of RAM and Opteron processors for much less money (Purdue has some of these now too). So, the Altix is fun but limited.
The Sicortex runs many instances of Linux on “real” nodes. It is trivial to get a node to yourself (or fifty). I’ve started to implement fairly trivial things in Python using pyMPI, which is my first real use of MPI. It is interesting to see how programs scale as you go from 10, 20, 50, 100, 2000 processors. Over at BigNcomputing.org, Matt Reilly posts a tarball of code about every week for people to try. As usual, the code does exactly what Matt says it does on a Sicortex. Whenever I try finding compute resources to try it on “normal” cluster, I’m constantly having trouble. Getting 60 processors at any one moment is harder than finding the end of a rainbow unless one pulls out the Big Hammer and takes job scheduling into his own hands. Putting on the Root Hat to run one these codes is just more trouble than its worth, so I generally just give up on the normal clusters unless there’s an obvious lull in activity someplace.
I’ll probably post some code and performance numbers here once I clean things up. Hopefully someone out there wants to see it. (So, you know, the world gets more use out of my burning through compute cycles faster than 4th of July fireworks.)
Tags: pympi, python, Sicortex
Posted in Purdue | No Comments »
June 25th, 2008
Matt Reilly over at Big N Computing had an interesting idea about why students going through the ranks today do not know or seem much interested in high performance computing. In his blog post he tells of the time when he had to use card punches and wait for his programs to execute on the big machine in the secret room. This seems to be a popular story I hear from all the experienced CS peoples I hear talk. They all talk about having to laboriously write their code and then wait for it to run.
Matt later brought up an idea to us here at Purdue when SiCortex was installing our SC5238. It was that we should carve out a piece of our machine every now and again to allow undergrads access to interactive sessions on a cluster. A SiCortex machine pacts a lot of processors into a small, power-efficient box, so letting a single user tie up a hundred processors is not all that big of a deal. Allowing several users 100 node sessions would still leave a ton of nodes and processors available in our SC5238 for regular jobs to run.
After Matt brought up his idea with people at Purdue, there started to be chatter about it and how would be the best way to go about doing this. The Rosen Center has in the past ran machines that were almost exclusively used for classroom computing before. We even got a grant from Sun Microsystems many years ago for a “High Performance Classroom” concept, originally spearheaded by David Moffett. The machine dedicated to that concept had about a hundred UltraSparc processors in it, to be shared using batch scheduling between a classroom of people. This makes sense since it was used a lot for training people on how to use other resources at Purdue.
After spending a year in Purdue’s CS program, it appears a lot of my peers have a fair amount of programming experience. Almost all of it is in web programming of some sort. The most unfortunate are completely .Net centric but a fair number do PHP stuff on the side. I wonder if clusters were as easy to access as a web server running PHP if students would really start to pay attention to them? (Besides just building a Beowulf of five year old desktops “because it’d be cool!”)
Then, there are always the technologies used to program large clusters.. Most of these tools were designed long ago or are only hopeful wishes in the minds of programmers. How can you debug parallel codes? The answer seems to be “printf’s”. No one likes digging through post-mortems of their code, fixing a bug, and then waiting for an hour before more output comes back. Also, at least at Purdue, students are taught to despise the C programming language for a year or so. This makes getting people interested in “normal” HPC programming difficult (C+MPI seems quite popular and let’s not even touch on Fortran.)
If “the management” allows undergrad students to have access to this valuable resource, what do we expect them to do with it? I’m not sure yet..
Posted in Purdue | No Comments »
June 7th, 2008
I came home from visiting the folks to find a link on Reddit about various Wordpress blogs redirecting people coming from search engines to an add-supported site. It appears that in a fairly simple piece of the Wordpress code that reads in the configuration file, someone inserted an if statement and a base64 encoded string, which made finding the problem harder. It appears to have bitten a few people and caused several site’s traffic to dry up a lot..
Then, let us not forget that terrible bug in Debian’s release of their OpenSSL package where the maintainer commented out a couple very important lines. He even asked on the OpenSSL mailing list if that would cause a problem…
I attempted to find evidence of the Wordpress bug being sneaked into their code base or if it was added by third party plugins that the affected people installed. And, after reading different write ups about the Debian bug, it appears there are several groups who could faulted for failure. Because of various simple mistakes, it appears that today’s software can quickly be turned around from wonderful into disastrous. While I have heard this several times in classes and generally accepted it to be true, there is nothing like having to dig through someone else’s coding mess or spending hours regenerating encryption keys to drive it home.
At least these bugs were found and posted about out on the wide Internet. I prefer not to think about the potentially nasty bugs in the commercial, closed source software that I use..
Posted in Tinkergeek | No Comments »
June 4th, 2008
Today I’m observing the SiCortex install team at Purdue as they and a team of our systems engineers make the big beasty run. Yesterday, the hardware was unboxed, pulled from anti-static bags, and installed. Much of the day was the SiCortex team running special diagnostic software to ensure the system was properly going before software configuration began. Right away, the team found a bad stick of memory which was promptly replaced with a spare from what appeared to be a Crucial memory box.
Now, sitting in our make-shift conference room I found the team huddled around laptops and monitor as the software configuration begins. As can be found from their website, the system run Gentoo on both the HP server in the bottom and on the nodes. There is a copy of the root file system that gets mounted on all the nodes that contains the 64bit MIPS version of Gentoo. Most of the time was configuring up authentication (LDAP) and networking bits. Unlike previous installations of the 5832, we have a set of 10Gbps Ethernet links that allows this big beast to talk with the world and to our special storage systems.
The thing I think I’m finding most interesting is how the software stack was built using custom shell scripts and Gentoo. Nothing really out of the ordinary. The nodes are really just regular net-boot compute nodes. (True, they don’t have ethernet but a wacky interconnect and have six processors per node not a power of two.) This is much more familiar than something like a BlueGene that boots a “Linux”-y kernel and only runs a specific application. This is more doable in terms of people being able to manage and wrap their minds around what all is happening.
It appears we are still just a system daemon away from being ready for our applications folk to begin their beta testing and porting efforts of various codes to get our researchers using this machine. Since it’s Linux (and looks close to what we already know) it should all happen pretty quickly. Now the big question: is it worth the money? I hope so at least..
Edit: I originally said “and have six processors per node !2^n for real n”, but it has been pointed out that I really meant to say: “!2^n for natural number n” because real numbers can turn out the number six as a solution. I guess I was just shunning my Foundations of Computer Science course there.
Posted in Purdue | No Comments »
June 3rd, 2008
This is not so much an interesting tidbit of technical talk, but just really a picture:

Purdue has one now.. We’re cooler than IU.
Posted in Purdue | No Comments »
June 2nd, 2008
Recently, a coworker mentioned that he had started the transition of his mail back into Gmail after moving onto our department’s mail server. It sounded like a decent idea because Google offers up a lot of mail storage, a nice filtering system and that Google Powered Search that we all love. So, I decided to see about also moving my mail into the Googles Data Centers.
Because I like keeping consistent user names all my places, I choose to set up the Google hosting stuff. After filling out a form, creating a dummy sub-domain, and creating a bunch of MX and CNAME records, I had my Google platform all set up. Then, I flicked all my Purdue accounts (of which there are waaay to many) to forward into Tinkergeek, where mail gets dropped into both a dummy account and sent onto the Googles.
After a day of getting mail at the new account and setting up the relevant filtering, it looked like all indicators were a go for lunch. I decided to start importing old mail. That coworker I mentioned above found a script to read in a mbox file and spit messages into Gmail. It worked brilliantly, except it did not preserve date and time of the original message. Doh. That’s only so useful, you know? So, I imported just my current inbox and resigned to merely having only new-ish mail.
Wait! Today, there was a golden ray of sunshine. Another coworker suggested that since Gmail offers up a nifty imap interface (one’s labels turn into imap folders) that one could easily import their old mail into the Googles! It was a great idea and I set at once working on seeing if it’d work. Lo and behold I found this posting which suggests that it does indeed work. Yay! This brought exciting news to my email loving heart. I started at once to move my mail. Things were going smoothly: I got through a folder with 3k messages and moved onto one with about 3.4k messages. After another 1k got moved in, the Googles failed me. It started to time out after about 900 messages at a time.
This new development made me less excited, but I’m still pushing forward and importing my old mail. After all, what else am I going to do with all that 6GB of free mail storage?!
I have been on the Gmail pill for about a business week at this point and am feeling pretty good about the time sink it has become. If anything big comes up, I’m sure I’ll post an update.
Posted in Purdue, Tinkergeek | No Comments »
June 1st, 2008
As we can see from this forum posting a decently big data center in Houston, TX, went out because of their transformer vault going critical and started to deconstruct the building around it. The reaction on Slashdot was pretty funny. There was a mix between people: wanting pictures, angry their servers were down, or happy they had planned ahead to have a disaster recovery plan.
It is funny how many people I talk with who have zero idea about how to make a recovery plan or even why one is necessary.. True, I mostly hear this from random person Q or small business Z. But, it is still surprising (specially being in a place where tornados happen every year) that people don’t think about such issues. The chance of fire or flood or other disaster seem to be constantly a major risk. Even after a disaster has struck, people seem to forget about it several years later. Sigh.
Oh well, next time we go out and buy that extra 1TB hard disk, hopefully we’ll all think about how we’re going to backup our new, large amount of data in case a disaster strikes.
Posted in Tinkergeek | No Comments »