Comment Re:Abusive Relationship with Dr. Chu? (Score 1) 113
If your point is that the NRC were abusive *toward* Dr. Chu and his appointees, then I'll agree with you. But that's not generally what the phrase "abusive relationship with" means.
If your point is that the NRC were abusive *toward* Dr. Chu and his appointees, then I'll agree with you. But that's not generally what the phrase "abusive relationship with" means.
I'm not sure where the idea came from that there was any sort of relationship at all with Dr. Chu himself described in this article.
All he did was gather up some experts in the field and facilitate their advise to the Japanese. That's exactly what the Secretary of Energy should do.
And yes, some of their suggestions were radical. That's what "brainstorming" means. Coming up with all sorts of ideas and determining as a group which are the good ones and which are the bad ones. Has no one ever seen an episode of House before?
And Dr. Chu, as far as I can tell, was not himself directly involved in the "Chu group," the at-best-misleading-at-worst-inaccurate term used in the article. So to say anyone had an "abusive relationship" with Dr. Chu is just silly.
If a US Diplomat gets into a shouting match with a foreign minister, do we accuse Hillary Clinton of being abusive?
buy your own hardware and take advantage of the services provided by your institution.
Your points are great. Something I didn't think to mention in a long post a few minutes ago. IF this is a university project and since the budget is so small, the grad student (I'm assuming) could look into building and sharing a cluster with another similarly sized research group of other grad students.
Then something is wrong in the configuration, either hardware or software. Virtualization by itself should not reduce performance by ~66%. Your hit should more likely be 5-10%. If you're taking a huge hit, it's most likely because you're sharing resources. Don't blame virtualization for that.
To be honest, £4000 isn't going to buy a lot of processing power. Does that amount also cover operational costs such as power? I'd ask about bandwidth, but with the scale possible with this budget, colocation of the servers doesn't make sense. Have you considered BOINC? Are you 100% certain OpenCL and GPGPU won't help? Atom, while cheap, even on a small budget is probably a bad solution. Remember that CPU always ends up being less than 100% of the cost of a node. Increasing the cost per node by 10-25% to have a node that's 400-800% faster makes perfect sense, and the fewer nodes she has to run, the cheaper your network will be. Unless Bulldozer brings incredible performance, Sandy Bridge based CPUs will provide the best bang for the buck if she's buying new. Clock per clock, they're the fastest available cheaply and their energy consumption is excellent. I suggest looking at i5-2300/2400/2500 or Xeon E3-1220. Depends if she wants ECC mostly. She may have enough budget for 6-10 nodes using these CPUs. Reduce the complexity of a node to Motherboard, CPU, RAM, and Power Supply. Go with quality PSUs, but remember there's no need to go overboard on wattage for machines that won't be running a GPU (I'm saying 250-300 watts is optimal if you can find quality in that size). Also, DDR3 is dirt cheap right now, so if there's a possibility 8GB will make a difference at some point over 2GB or 4GB during the life of the nodes, it makes sense to just start with that much.
PXE boot from a head node that contains all the storage... which btw, you (serviscope_minor) didn't mention how much raw storage she's going to need which will eat a good portion of a small budget. You also didn't mention how hard her problem is on a network. Is simple gigabit enough? The closest serviscope_minor came to describing the problem was to use the term "CPU bound" somewhat ambiguously.
Again, I would bring up BOINC. She would accept hardware donations right? How about just asking for them on a worldwide scale? If this is a non-profit venture (her degree doesn't count as a profit if this is a University project as many have assumed) and isn't intensively time-sensitive, you'd be surprised how many people will freely contribute processing power.
The first thing you need to do is realize you all are in over your heads. If you're desperate enough to post to Slashdot for help, you're already there.
The second thing you need to do is look for a consultant to help you out until you can hire permanent help to fill your vacant positions. I can strongly recommend R Systems (http://www.rsystemsinc.com/). It's run by former NCSA HPC gurus. I've worked with them many times; they have the know-how you need to salvage this mess in short order. You can't call them quickly enough; trust me on that.
Third, to answer some questions. The IB vs. 10GbE debate has been pretty well covered, but just to emphasize: if you need low latency (for tightly-coupled massive parallel processing), you *need* IB. Preferably QDR or FDR. For your core switches, go for a blade-style chassis whose backplane can handle FDR even if you opt for QDR for now. If it can handle EDR, even better, but I'm not sure those are shipping yet. FDR IB data rate is 56Gb and latency in the nanoseconds. Ethernet can't touch that yet.
All the scientists working with GPUs here are using nVidia. We've got 2050s and 2070s, so the 2090s are probably the right choice at the moment.
For management, xCat is by far the most scalable solution available right now, though we're working on an alternative. ROCKS does not scale well, largely due to its stateful nature. I'd caution you against using Scyld ClusterWare; it's based on BProc AFAIK, and as one of my friends is the former BProc maintainer, I can tell you that even *he* won't touch it with a ten-foot pole any more. It's too hairy and errorprone; it's also almost impossible to debug. Use something stateless and powerful but still relatively easy to maintain. Most of the large-scale shops (national labs and large academic sites) I know of use xCat or Perceus. Here at LBNL we use both xCat and Perceus with great success.
For Linux distribution, using RHEL or a clone. I'd recommend Scientific Linux 6 at this point. It's the best-run and most professionally-maintained of all the clones.
HTH. Good luck, and condolences on your recent loss(es).
there is a reason they do not update them as regularly as you might like after all.
Most of my textbooks were updated more often/regularly than necessary. No new info added. No better explanations. In many editions the only thing that changed were the numbers in end of chapter problems. That way they could make the old editions, which were available used for more reasonable prices, obsolete. I have an EE (Circuits) textbook that was over US$200 but the class was taught entirely with a chalkboard and notes. The book was only ever used less than a dozen times for assigned homework. I essentially paid $5-10 per page used. My copy of that text is an 8th edition, so they run the scam pretty regularly.
I have other books that I never should have purchased at all because they were never even opened outside of class. Those classes were taught entirely, and I should add poorly, with PowerPoint slides and assignments e-mailed in pdf/doc format.
Classic. I'm really wishing I had mod points right now.
The Rocks approach is nice for quickly regenerating a failed node. And it's Centos under the covers, as noted, so it's RHEL in disguise. If you're running 16 boxes with dual quad-cores, you'll lose the occasional disk drive. If you run 64 cheap desktops with single-socket dual-cores, you'll lose a disk drive every week or two.
Of course, if you're using a modern (read: stateless) provisioning system, "regenerating a failed node" simply requires a power-cycle. And you lose far fewer disk drives since they're not used for the OS. And replacing a dead node with a new one is a single command and a power button.
Systems like ROCKS only seem great if you haven't used anything else.
I'll preface this by saying that I'm an HPC admin for a major national lab, and I've also contributed to and been part of numerous HPC-related software development projects. I've even created and managed a distribution a time or two.
There are two important questions that should determine what you run. The first is: What software applications/programs are you expecting the cluster to run? While some software is written to be portable to any particular platform or distribution, scientists tend to want to focus more on science than on code portability, so not all code works on all distributions or OS flavors. Small clusters like yours often focus on a few particular pieces of scientific code. If that's the case for you, figure out what the scientists who wrote it use, and lean strongly toward using that.
The second question is, who will run it? Many small, one-off clusters are run by grad students and postdocs who work for their respective PI(s) for some number of years and then leave. In this scenario, it's important to make sure things are as well-documented and industry-standard as possible to ease the transition from one set of student admins to the next. (And yes, PI-owned clusters have a surprisingly long lifespan. Usually no less than 5 years, often longer.) To that end, I strongly recommend RedHat or Scientific Linux.
We, and most large-scale computational systems groups, use one of two things: RHEL and derivatives, or vendor-provided (e.g., AIX, Cray). We run CentOS but are moving away from it ASAP. The Tri-Labs (Livermore, Sandia, and Los Alamos) use TOSS, which is based on CHAOS (https://ancillary-proxy.atarimworker.io?url=https%3A%2F%2Fcomputing.llnl.gov%2Flinux%2Fprojects.html), which is based on RHEL. Many other sites use Scientific or CentOS. Older versions of Scientific deviated more from upstream, which caused sites like us to use CentOS instead. That's no longer true with SL6, and since CentOS 6 doesn't even exist yet (and RHEL6.1 is already out!), there are strong incentives to move to SL6.
Let me address some other points while I'm at it:
Why RHEL? If you can run RHEL itself, do so. RHEL isn't built with the same compilers it ships with; the binaries are highly optimized. Back when we were working on Caos Linux, we did some benchmarks that showed RHEL (and Caos, FWIW) to be as much as twice as fast as CentOS running the exact same code. So if performance is a consideration, and you can afford a few licenses, it's definitely worth considering. The support can be handy as well, particularly if this is a student-run cluster.
Why Scientific Linux? If you need a free alternative to RHEL or are running at a scale that makes RHEL licensing prohibitive, SL is the way to go, without a doubt. It's maintained professionally by a team at Fermilab whose fulltime job is to do exactly that. They know their stuff, and they're paid for it by the DOE. Other rebuild projects suffer from staffing problems, personality problems, and lack-of-time problems that SL simply doesn't have.
Why not Fedora? Stability and reliability are critically important. Fedora is essentially a continuous beta of RHEL. It lacks both the life-cycle and life-span of a long-term, production-quality product.
Why not Gentoo? Pretty much the same answer. The target audience for Gentoo is not the enterprise/production server customer. Source-based distributions do not provide the consistency or reproducibility required for a scale-out computational platform. You'll also have a hard time getting scientific code targeted at Gentoo or other 2nd-tier distributions.
Why not Ubuntu or Debian? Ubuntu is a desktop platform, not a server platform. Again, it boils down to their target market. There's really no value-add in the server space with Ubuntu, so why not just run Debian? If Debian's what your admins know best, it's worth considering, but keep in mind that very, very few computational resources run Debian, so you may have to do a lot more fending for yourself if you go that route.
Why not SLES? Mostly a personal choice, but with its uncertain future, I'd be hard-pressed to say it's a safe option. If you have a support contract from, e.g., IBM, that's different. But judging by your cluster size, I'm going to wager that's not the case.
Why not ROCKS? Anyone who runs large systems will tell you that stateful provisioning is antiquated at best, largely because it simply doesn't scale well. ROCKS is firmly locked into the stateful model, and rather than rethinking their design, are trying to find ways to make it faster. You can only say, "It's just a flesh wound!" so many times before the King is going to call it a draw and gallop on by you.
As for the question about user-friendliness, it depends on the people for whom you wish it to be friendly. If you want friendliness for the admin, what I've seen of Bright Cluster Manager looks promising. (I don't know if Scyld still uses BProc, but what I know about it has thoroughly convinced me never to touch the stuff.) IBM also has its Management Suite for Cloud that looked quite friendly at SC10.
For the users, there are a number of portal options you could try, including one from Adaptive (makers of Moab) that greatly simplifies job submission. But the truth is, it's just Not That Hard(tm) to write up a template qsub script and hand it off to your users. You really want to spend more time worrying about how to manage the resource efficiently and competently and make sure you maximize performance and stability. That's what will get the most science done in the least amount of time...and isn't that really the point?
Doubt isn't the opposite of faith; it is an element of faith. - Paul Tillich, German theologian and historian