Home   News   Free Software   Topics   Jobs   LinksAbout
How To Build Your Own Beowulf GPU Cluster
Author: Austen C. Duffy, Florida State University
This is a guide to building your own low cost Beowulf GPU cluster, complete with all the tips and lessons I learned from building my own. Planning is key. How big do you want your cluster to be and how much do you want to spend? These are pretty important, but also keep in mind things like where you will be keeping this cluster. Depending on the size, it can get pretty hot and use a lot of electricity so you likely would not want to put a large GPU cluster in your bedroom. What will you be using this cluster for? I run highly scalable GPU accelerated CFD flow solvers so many CPU cores are desireable, but if you run strictly on the GPU you likely don't care about extra CPU cores and can get away with using cheaper 2 core processors, but at the same time may opt for expensive setups fitting multiple GPUs to a singlee motherboard (e.g. Nvidia SLI). The first thing we will talk about to get you started is the most important, the proper design.
1. Beowulf GPU Cluster Design:We will consider design in terms of nodes, including head nodes and compute nodes. Your head node is the machine that you will be actually working on, e.g. for writing code, papers, visualization of output, etc., and compute nodes are for running code. The first thing you should consider is whether or not your head node will also serve as a compute node. Using a dual purpose head node is the most cost effective route, and if you don't plan on doing much work on the head node, i.e. just using your cluster for running code, then this is a viable choice. If, however, you are like me and plan on doing a substantial amount of work on your head node, you will not want to use it as a compute node.
The next thing to consider is how many GPUs per compute node you will be running. Of course we would all love to have 4 GPUs per node, but this is not practical unless you have a lot of money to spend and a really good power setup. Keep in mind that a modern graphics card can use in excess of 200 watts at peak, so if you are running on a standard 15 or 20 amp breaker then using the 80% rule you are limited to 1440 or 1920 watts respectively. Also, in general the more GPUs per node, the more your motherboards and power supplies will cost, not to mention the additional heat. If you are using low cost graphics cards like I do, you will not have ECC memory support and should concern yourself with the real possibility of bit flip errors which tend to occur when many GPUs are in close proximity. Note that if you decide you want to use the head node as a compute node AND plan on having a monitor attached, you can only run on that GPU with the monitor for a few seconds at a time as it will probably shut down to prevent overheating depending on vendor and model.
Some Recomended Setups for Standard 15-20 amp breakers:
1 head/compute node - 1 GPU, 1 compute node - 1 GPU.
Benefits: Very practical, low cost, low energy consumption (< 800 watts peak), does not require a switch to connect nodes as they can be directly connected, wont produce excessive heat, wont take up much space.
Drawbacks: Not super powerful on the GPU side, could be considerd a 'compute only' machine with lack of true head node.
1 head node, 2-4 compute nodes - 1 GPU.
Benefits: Better 'work' cluster with true head node, 1 GPU per node keeps costs down, can run up to 4 compute nodes on a 20 amp breaker with a 5 port gigabit switch.
Drawbacks: Takes up more space, produces more heat.
1 head/compute node - 4 GPUs, 1 compute node - 4 GPUs.
Benefits: Very powerful, does not require a switch to connect nodes as they can be directly connected, wont take up much space.
Drawbacks: Expensive, could be considerd a 'compute only' machine with lack of true head node, will consume a lot of electricity - other components should be kept to an electric minimum on a 20 amp breaker.
Have a lot of money and a better electric supply? Then you are likely better off just buying a prebuilt GPU cluster, but the brave who wish to save can continue reading.
1 2 3 4   Next Page >>