Message boards :
Number crunching :
Can I run 1 WU across multiple GPUs?
Author |
Message |
|
I know people run multiple WUs on 1 GPU, but can it be done the other way round? Primegrid produces huge tasks that take days on a GPU. Can they be spread across more than one GPU, as a CPU workunit can go across 2 CPUs on a server MB? If the GPUs are connected with SLI/crossfire, will Boinc treat them as one? | |
|
|
I know people run multiple WUs on 1 GPU, but can it be done the other way round? Primegrid produces huge tasks that take days on a GPU. Can they be spread across more than one GPU, as a CPU workunit can go across 2 CPUs on a server MB? If the GPUs are connected with SLI/crossfire, will Boinc treat them as one?
No you can not, Boinc is not setup for that and the Developers haven't publicly said they were interested in doing it either. You might send a note to Richard Haselgrove asking him if the Developers are even thinking about adding it as he's the head of the group. I don't know if he's on here at Prime Grid or not though. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 1,216
                           
|
I know people run multiple WUs on 1 GPU, but can it be done the other way round? Primegrid produces huge tasks that take days on a GPU. Can they be spread across more than one GPU, as a CPU workunit can go across 2 CPUs on a server MB? If the GPUs are connected with SLI/crossfire, will Boinc treat them as one?
Theoretically, yes.
But the app would have to be written specifically for it, and I’m not aware of any app anywhere that does this. “Why not?â€, you might ask. It’s because there’s absolutely no benefit to doing this, while it makes the apps far more complex, and more costly to develop and more costly to maintain. If a task takes an hour on one GPU and half an hour on two GPUs, it doesn’t matter if you run one task (on both GPUs) or two tasks (each on one GPU) ; either way you’re doing two tasks per hour. There’s no benefit here, and your apps are more complex.
If you’re playing a video game, SLI makes sense. Doubling the frame rate can make a huge difference in playability. But that’s not the case with crunching.
Not only would using SLI not be helpful, but you literally can not use it for crunching. If you enable SLI, the apps can’t see the GPUs at all and can’t run.
____________
My lucky number is 75898524288+1 | |
|
Yves GallotVolunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 644 ID: 164101 Credit: 305,010,093 RAC: 0

|
I know people run multiple WUs on 1 GPU, but can it be done the other way round? Primegrid produces huge tasks that take days on a GPU. Can they be spread across more than one GPU, as a CPU workunit can go across 2 CPUs on a server MB?
The main problem is the memory which is intensively used.
With CPUs, multithreading is efficient if all threads run on a single processor and that the shared memory is the L3 cache of the processor. If the threads are executed on two processors, the shared area is the main memory and multithreading is not efficient.
Two GPUs would have to exchange data continuously and the bandwith of PCIe is very low compared to GPU internal memory bus.
If the GPUs are connected with SLI/crossfire, will Boinc treat them as one?
I would have said yes and SLI/crossfire is a fast memory bus. But I never tested this configuration and it depends on how OpenCL detects the GPUs (a single one with twice the number of cores or two).
But two tasks running on each GPU without SLI/crossfire will have a better throughput. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 1,216
                           
|
If the GPUs are connected with SLI/crossfire, will Boinc treat them as one?
I would have said yes and SLI/crossfire is a fast memory bus. But I never tested this configuration and it depends on how OpenCL detects the GPUs (a single one with twice the number of cores or two).
But two tasks running on each GPU without SLI/crossfire will have a better throughput.
Once you turn on SLI, CUDA (and presumably OpenCL) can't see the GPUs at all.
____________
My lucky number is 75898524288+1 | |
|
|
Moo! runs a single task across multiple GPUs. They do RC5-72 for distributed.net, in a BOINC wrapper.
____________
Reno, NV
| |
|
|
No you can not, Boinc is not setup for that and the Developers haven't publicly said they were interested in doing it either. You might send a note to Richard Haselgrove asking him if the Developers are even thinking about adding it as he's the head of the group. I don't know if he's on here at Prime Grid or not though.
Richard has already replied with just the word "no" in this thread: https://boinc.berkeley.edu/forum_thread.php?id=14075&postid=102014#102014
Theoretically, yes.
But the app would have to be written specifically for it, and I’m not aware of any app anywhere that does this. “Why not?â€, you might ask. It’s because there’s absolutely no benefit to doing this, while it makes the apps far more complex, and more costly to develop and more costly to maintain. If a task takes an hour on one GPU and half an hour on two GPUs, it doesn’t matter if you run one task (on both GPUs) or two tasks (each on one GPU) ; either way you’re doing two tasks per hour. There’s no benefit here, and your apps are more complex.
If you’re playing a video game, SLI makes sense. Doubling the frame rate can make a huge difference in playability. But that’s not the case with crunching.
Not only would using SLI not be helpful, but you literally can not use it for crunching. If you enable SLI, the apps can’t see the GPUs at all and can’t run. Ah, I assumed when you enabled SLI, that the game thought you had a GPU with more cores in it. I didn't realise the game had to do anything special to make use of it.
The main problem is the memory which is intensively used.
With CPUs, multithreading is efficient if all threads run on a single processor and that the shared memory is the L3 cache of the processor. If the threads are executed on two processors, the shared area is the main memory and multithreading is not efficient.
Two GPUs would have to exchange data continuously and the bandwith of PCIe is very low compared to GPU internal memory bus.
So am I running my 24 core work units slower than I should on my dual 12 core xeons? Would it even be clever enough to run two 12 core units on seperate CPUs?
Once you turn on SLI, CUDA (and presumably OpenCL) can't see the GPUs at all. Don't games need to use CUDA?
Moo! runs a single task across multiple GPUs. They do RC5-72 for distributed.net, in a BOINC wrapper. Has this been specifically written to do so then? It seems Boinc doesn't have this ability built in. What do you see in Boinc Manager / Boinctasks when you do this? | |
|
|
Moo! runs a single task across multiple GPUs. They do RC5-72 for distributed.net, in a BOINC wrapper. Has this been specifically written to do so then? It seems Boinc doesn't have this ability built in. What do you see in Boinc Manager / Boinctasks when you do this?
Standard BOINC client. Not sure what was done inside the wrapper of the task. Here are a couple of screen shots, from both boincmgr and boinctasks.
____________
Reno, NV
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 1,216
                           
|
Not only would using SLI not be helpful, but you literally can not use it for crunching. If you enable SLI, the apps can’t see the GPUs at all and can’t run.
Ah, I assumed when you enabled SLI, that the game thought you had a GPU with more cores in it. I didn't realise the game had to do anything special to make use of it.
You misunderstand.
Games are different than crunching apps.
SLI is meant for games. It lets the two GPUs work together so you get a better gaming experience.
However, games and crunching apps use completely different interfaces/APIs to communicate with the GPU. The game's interface is for drawing stuff on the screen, while our apps use an interface that doesn't draw on the screen. It's only used for doing calculations.
The interface we use gets turned off when SLI is enabled, so it's literally impossible to use SLI for crunching.
An app can still use more than one GPU, but it can't use SLI, so the GPUs can't share data directly. This limits its usefulness for some/many/most applications. And as I said originally, you would have to write the app specifically to use multiple GPUs.
I'm not a video game programmer, so I don't know if you have to write games specifically to make use of SLI or if it just "works".
____________
My lucky number is 75898524288+1 | |
|
|
FWIW, here is the stderr output from a moo tasks running on two GPUs. Not sure how long it will stick around, so look quickly.
https://moowrap.net/result.php?resultid=124376635
____________
Reno, NV
| |
|
|
Not only would using SLI not be helpful, but you literally can not use it for crunching. If you enable SLI, the apps can’t see the GPUs at all and can’t run.
Ah, I assumed when you enabled SLI, that the game thought you had a GPU with more cores in it. I didn't realise the game had to do anything special to make use of it.
You misunderstand.
Games are different than crunching apps.
SLI is meant for games. It lets the two GPUs work together so you get a better gaming experience.
However, games and crunching apps use completely different interfaces/APIs to communicate with the GPU. The game's interface is for drawing stuff on the screen, while our apps use an interface that doesn't draw on the screen. It's only used for doing calculations.
The interface we use gets turned off when SLI is enabled, so it's literally impossible to use SLI for crunching.
An app can still use more than one GPU, but it can't use SLI, so the GPUs can't share data directly. This limits its usefulness for some/many/most applications. And as I said originally, you would have to write the app specifically to use multiple GPUs.
I'm not a video game programmer, so I don't know if you have to write games specifically to make use of SLI or if it just "works". I assumed that at some point the game does some physics, to calculate your character jumping and falling, the trajectory of a bullet, etc. And that uses the same stuff as Boinc? Or does the game access each GPU individually for that part?
| |
|
|
Moo! runs a single task across multiple GPUs. They do RC5-72 for distributed.net, in a BOINC wrapper. Has this been specifically written to do so then? It seems Boinc doesn't have this ability built in. What do you see in Boinc Manager / Boinctasks when you do this?
Standard BOINC client. Not sure what was done inside the wrapper of the task. Here are a couple of screen shots, from both boincmgr and boinctasks.
Ah, I've never seen "+2NV" in Boinctasks before. I guess the WU is just written that way, the same as some tasks are written to use more than 1 CPU core. I assume the CPU controls everything, and just passes half the calculations to each GPU. Sounds like it should be something easy to implement in any project. Not necessary for something like Einstein (who have said they have no plans to do it), since their work units last about half an hour, but the genefer extremes could get done a lot quicker for people with multiple graphics cards. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 1,216
                           
|
Not only would using SLI not be helpful, but you literally can not use it for crunching. If you enable SLI, the apps can’t see the GPUs at all and can’t run.
Ah, I assumed when you enabled SLI, that the game thought you had a GPU with more cores in it. I didn't realise the game had to do anything special to make use of it.
You misunderstand.
Games are different than crunching apps.
SLI is meant for games. It lets the two GPUs work together so you get a better gaming experience.
However, games and crunching apps use completely different interfaces/APIs to communicate with the GPU. The game's interface is for drawing stuff on the screen, while our apps use an interface that doesn't draw on the screen. It's only used for doing calculations.
The interface we use gets turned off when SLI is enabled, so it's literally impossible to use SLI for crunching.
An app can still use more than one GPU, but it can't use SLI, so the GPUs can't share data directly. This limits its usefulness for some/many/most applications. And as I said originally, you would have to write the app specifically to use multiple GPUs.
I'm not a video game programmer, so I don't know if you have to write games specifically to make use of SLI or if it just "works". I assumed that at some point the game does some physics, to calculate your character jumping and falling, the trajectory of a bullet, etc. And that uses the same stuff as Boinc? Or does the game access each GPU individually for that part?
What part of "I'm not a video game programmer" isn't clear?
____________
My lucky number is 75898524288+1 | |
|
|
Not only would using SLI not be helpful, but you literally can not use it for crunching. If you enable SLI, the apps can’t see the GPUs at all and can’t run.
Ah, I assumed when you enabled SLI, that the game thought you had a GPU with more cores in it. I didn't realise the game had to do anything special to make use of it.
You misunderstand.
Games are different than crunching apps.
SLI is meant for games. It lets the two GPUs work together so you get a better gaming experience.
However, games and crunching apps use completely different interfaces/APIs to communicate with the GPU. The game's interface is for drawing stuff on the screen, while our apps use an interface that doesn't draw on the screen. It's only used for doing calculations.
The interface we use gets turned off when SLI is enabled, so it's literally impossible to use SLI for crunching.
An app can still use more than one GPU, but it can't use SLI, so the GPUs can't share data directly. This limits its usefulness for some/many/most applications. And as I said originally, you would have to write the app specifically to use multiple GPUs.
I'm not a video game programmer, so I don't know if you have to write games specifically to make use of SLI or if it just "works". I assumed that at some point the game does some physics, to calculate your character jumping and falling, the trajectory of a bullet, etc. And that uses the same stuff as Boinc? Or does the game access each GPU individually for that part?
What part of "I'm not a video game programmer" isn't clear? Well you can let someone else answer that part, this is a public forum :-P
Anyway, time for me to go play Fallout 4. And no I haven't even programmed a mod for it. sorry, but one of the genefer extremes will be halted for the rest of the evening. | |
|
|
So am I running my 24 core work units slower than I should on my dual 12 core xeons? Would it even be clever enough to run two 12 core units on seperate CPUs?
yes, although even 12 core units may not be fastest, you need to test it.
I've seen people running tasks on a threadripper using 24 cores that end up slower than my 3700x running on 8.
| |
|
Vato Volunteer tester
 Send message
Joined: 2 Feb 08 Posts: 785 ID: 18447 Credit: 263,436,450 RAC: 1,421
                     
|
Moo!Wrapper makes use of the distributed.net client within the BOINC wrapper.
The dnetc client can divide the multiple packets of work between multiple GPUs, just like the dnetc client can divide the work between multiple CPUs - and can in fact mix GPUs and CPUs, but Moo!Wrapper doesn't make use of this (quite sensibly IMHO).
The BOINC client only knows that it uses multiple GPUs for scheduling purposes - it handles the task as a single process.
In summary, the dnetc client is a complete and fully featured client, rather than just a wrapper.
So expecting BOINC to manage this generically is probably not realistic.
And expecting other project wrappers to implement this is also not realistic IMHO.
Especially since it would be an almost identical outcome to just run 2x single GPU tasks and BOINC and the project apps can already do this.
A lot of effort and complexity for virtually no gain.
____________
| |
|
|
To be clear, I was not advocating that PG should do this. Only that it was technically possible using BOINC.
____________
Reno, NV
| |
|
Vato Volunteer tester
 Send message
Joined: 2 Feb 08 Posts: 785 ID: 18447 Credit: 263,436,450 RAC: 1,421
                     
|
Understood :-)
____________
| |
|
streamVolunteer moderator Project administrator Volunteer developer Volunteer tester Send message
Joined: 1 Mar 14 Posts: 834 ID: 301928 Credit: 488,476,972 RAC: 1
                       
|
FWIW, here is the stderr output from a moo tasks running on two GPUs. Not sure how long it will stick around, so look quickly.
https://moowrap.net/result.php?resultid=124376635
Moo! sends you a packet containing few real distributed.net client tasks.
The distributed.client can use multiple GPUs in single instance, it's not difficult - each GPU is handled by own thread/subprocess. But each GPU is running its own, independent task from input buffer.
Note that at end of the input buffer client have to wait for slowest GPU to finish it's task, and another GPU is idle. So this is a not a most efficient way to run things.
| |
|
|
To be clear, I was not advocating that PG should do this. Only that it was technically possible using BOINC.
Are both gpu's the same in the machine? | |
|
|
To be clear, I was not advocating that PG should do this. Only that it was technically possible using BOINC.
Are both gpu's the same in the machine?
In the one I sent the link to? Yes. Both 1660 Tis. But they don't have to be the same at Moo!
____________
Reno, NV
| |
|
|
Moo!Wrapper makes use of the distributed.net client within the BOINC wrapper.
The dnetc client can divide the multiple packets of work between multiple GPUs, just like the dnetc client can divide the work between multiple CPUs - and can in fact mix GPUs and CPUs, but Moo!Wrapper doesn't make use of this (quite sensibly IMHO).
The BOINC client only knows that it uses multiple GPUs for scheduling purposes - it handles the task as a single process.
In summary, the dnetc client is a complete and fully featured client, rather than just a wrapper.
So expecting BOINC to manage this generically is probably not realistic.
And expecting other project wrappers to implement this is also not realistic IMHO.
Especially since it would be an almost identical outcome to just run 2x single GPU tasks and BOINC and the project apps can already do this.
A lot of effort and complexity for virtually no gain.
The only gain I was thinking of was making the huge genefer extreme tasks finish in a more reasonable time. But I guess they're not in a hurry like some Biology projects are. | |
|
|
FWIW, here is the stderr output from a moo tasks running on two GPUs. Not sure how long it will stick around, so look quickly.
https://moowrap.net/result.php?resultid=124376635
Moo! sends you a packet containing few real distributed.net client tasks.
The distributed.client can use multiple GPUs in single instance, it's not difficult - each GPU is handled by own thread/subprocess. But each GPU is running its own, independent task from input buffer.
Note that at end of the input buffer client have to wait for slowest GPU to finish it's task, and another GPU is idle. So this is a not a most efficient way to run things.
Only if the next pair of calculations depend on the previous ones. If the packet contains 8 pieces of work that don't need the answer from any others, then one GPU can get on with the next one. | |
|
Bur Volunteer tester
 Send message
Joined: 25 Feb 20 Posts: 332 ID: 1241833 Credit: 22,611,276 RAC: 0
               
|
The only gain I was thinking of was making the huge genefer extreme tasks finish in a more reasonable time. But I guess they're not in a hurry like some Biology projects are. Everything that increases total throughput is certainly welcome on PG, I guess.
But in GPU case it might run one task quicker, but unless it takes less than 50% of the time a single-GPU task takes, it will not increase throughput. If you finish a GFN extreme task on one GPU in 3 days, you finish 2 tasks every three days on your 2-GPU-system. If running one task on two GPUs simultaneously makes it finish in 1.5 days you still finish 2 tasks every three days.
And from the answers here I assume it will take more than 1.5 days, thus effectively decreasing throughput.
The only reason why many LLR2 tasks here are run multithreaded is due to limited L3 cache as far as I know. Otherwise single-threading is almost always faster in terms of total throughput.
____________
Primes: 1281979 & 12+8+1979 & 1+2+8+1+9+7+9 & 1^2+2^2+8^2+1^2+9^2+7^2+9^2 & 12*8+19*79 & 12^8-1979 & 1281979 + 4 (cousin prime) | |
|
|
The only gain I was thinking of was making the huge genefer extreme tasks finish in a more reasonable time. But I guess they're not in a hurry like some Biology projects are. Everything that increases total throughput is certainly welcome on PG, I guess.
But in GPU case it might run one task quicker, but unless it takes less than 50% of the time a single-GPU task takes, it will not increase throughput. If you finish a GFN extreme task on one GPU in 3 days, you finish 2 tasks every three days on your 2-GPU-system. If running one task on two GPUs simultaneously makes it finish in 1.5 days you still finish 2 tasks every three days.
True, although it might reduce the number of aborted tasks when people think they're taking too long. I've looked at some of my genefer extreme and genefer 22 tasks to see if they're still waiting to be verified, and found they've been handed out up to 10 times!
The only reason why many LLR2 tasks here are run multithreaded is due to limited L3 cache as far as I know. Otherwise single-threading is almost always faster in terms of total throughput.
I was going to ask about that. I can't be sure I've done one or not, but there are some subprojects listed in preferences which "support MT but don't recommend it" - so why aren't they handed out as single threaded tasks? It's not as though I as a user can choose to run those and only those as single threads, I can only change the global setting for how many threads, unless I fiddle in app_config.
| |
|
Vato Volunteer tester
 Send message
Joined: 2 Feb 08 Posts: 785 ID: 18447 Credit: 263,436,450 RAC: 1,421
                     
|
the "don't recommend" aspect is good general advice, but not necessarily cast in stone.
e.g. SGS tasks are usually not worth multi-threading - massively reduced throughput.
whereas, i have one 4 core machine with 3MB cache that actually works best doing 2x 2-thread tasks.
But although it's in the minority, it doesn't need to be prohibited.
____________
| |
|
|
To be clear, I was not advocating that PG should do this. Only that it was technically possible using BOINC.
Are both gpu's the same in the machine?
In the one I sent the link to? Yes. Both 1660 Tis. But they don't have to be the same at Moo!
Thanks!! | |
|
Bur Volunteer tester
 Send message
Joined: 25 Feb 20 Posts: 332 ID: 1241833 Credit: 22,611,276 RAC: 0
               
|
True, although it might reduce the number of aborted tasks when people think they're taking too long. I've looked at some of my genefer extreme and genefer 22 tasks to see if they're still waiting to be verified, and found they've been handed out up to 10 times
That's true, but the number of people with two GPUs is very small and only a small percentage of these (I guess) will be so impatient that they sacrifice throughput. So coding an MT version of GPU apps just for a handful - if at all - people, doesn't make sense.
For your other question, I only ever run one subproject per computer, so I don't mind. Also the number of cases where it might matter, is small, I guess. Only when running SGS along with a MT subproject, you will run into this. And in that case, why not just run the MT project on all cores?
____________
Primes: 1281979 & 12+8+1979 & 1+2+8+1+9+7+9 & 1^2+2^2+8^2+1^2+9^2+7^2+9^2 & 12*8+19*79 & 12^8-1979 & 1281979 + 4 (cousin prime) | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 1,216
                           
|
This thread has run its course.
For PrimeGrid apps, this isn't possible, so the definitive, authoritative answer is,"no."
____________
My lucky number is 75898524288+1 | |
|
Message boards :
Number crunching :
Can I run 1 WU across multiple GPUs? |