Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Number crunching :
LLR Version 3.8.20 released
Author |
Message |
|
Jean Penne kindly ask one proth number example that have slowdown in PRP mode ( or any proth number can be used)?
____________
92*10^1439761-1 REPDIGIT PRIME :) :) :)
314187728^131072+1 GENERALIZED FERMAT
31*332^367560+1 CRUS PRIME
Proud member of team Aggie The Pew. Go Aggie! | |
|
|
GIMPS have found primes in double check.
FTR... Never happened.
That'll teach me not to post late at night without some checking. Over assumed from them finding Mersenne primes out of order. | |
|
axnVolunteer developer Send message
Joined: 29 Dec 07 Posts: 285 ID: 16874 Credit: 28,027,106 RAC: 0
            
|
Jean Penne kindly ask one proth number example that have slowdown in PRP mode ( or any proth number can be used)?
Any proth number will do, but here's what I used for testing:
146220094124808:P:0:2:1
171 2097152
In proth mode, it was taking 0.7 ms/iter, while in PRP mode, it was taking about 1.0 ms/iter.
Same FFT and everything, so that's not it. Possibly due to comparing with -1 after every iteration in PRP mode (SWAG).
EDIT: 64-bit LLR v3.8.17 on Win7 | |
|
|
Jean Penne kindly ask one proth number example that have slowdown in PRP mode ( or any proth number can be used)?
Any proth number will do, but here's what I used for testing:
146220094124808:P:0:2:1
171 2097152
In proth mode, it was taking 0.7 ms/iter, while in PRP mode, it was taking about 1.0 ms/iter.
Same FFT and everything, so that's not it. Possibly due to comparing with -1 after every iteration in PRP mode (SWAG).
EDIT: 64-bit LLR v3.8.17 on Win7
Thanks! Very useful info!
____________
92*10^1439761-1 REPDIGIT PRIME :) :) :)
314187728^131072+1 GENERALIZED FERMAT
31*332^367560+1 CRUS PRIME
Proud member of team Aggie The Pew. Go Aggie! | |
|
|
GREAT NEWS
Bug was found and in updated version PRP will be faster ( as should be in first place)
Waiting for new version and for little testing
____________
92*10^1439761-1 REPDIGIT PRIME :) :) :)
314187728^131072+1 GENERALIZED FERMAT
31*332^367560+1 CRUS PRIME
Proud member of team Aggie The Pew. Go Aggie! | |
|
|
Patched version of LLR is out ( but officially it will take some time because they are now in way to implement multicore support in LLR)
So far
corrected version ( PRP forced)
Starting probable prime test of 171*2^2097152+1
Using all-complex AVX FFT length 128K, Pass1=128, Pass2=1K, a = 3
171*2^2097152+1, bit: 290000 / 2097160 [13.82%]. Time per bit: 0.660 ms.
non corrected version ( PRP forced)
Starting probable prime test of 171*2^2097152+1
Using all-complex AVX FFT length 128K, Pass1=128, Pass2=1K, a = 3
171*2^2097152+1, bit: 300000 / 2097160 [14.30%]. Time per bit: 0.819 ms.
____________
92*10^1439761-1 REPDIGIT PRIME :) :) :)
314187728^131072+1 GENERALIZED FERMAT
31*332^367560+1 CRUS PRIME
Proud member of team Aggie The Pew. Go Aggie! | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 0
                           
|
Patched version of LLR is out ( but officially it will take some time because they are now in way to implement multicore support in LLR)
So far
corrected version ( PRP forced)
Starting probable prime test of 171*2^2097152+1
Using all-complex AVX FFT length 128K, Pass1=128, Pass2=1K, a = 3
171*2^2097152+1, bit: 290000 / 2097160 [13.82%]. Time per bit: 0.660 ms.
non corrected version ( PRP forced)
Starting probable prime test of 171*2^2097152+1
Using all-complex AVX FFT length 128K, Pass1=128, Pass2=1K, a = 3
171*2^2097152+1, bit: 300000 / 2097160 [14.30%]. Time per bit: 0.819 ms.
Is this version available anywhere, and is the discussion with Jean on the Mersenne forums somewhere?
____________
My lucky number is 75898524288+1 | |
|
|
Yes, you can download it at his personal page ( links below)
http://jpenne.free.fr/llr3/cllr38hc.zip
http://jpenne.free.fr/llr3/llr38hclinux64.zip
And yes: you can read it here
http://www.mersenneforum.org/showthread.php?p=451890#post451890
____________
92*10^1439761-1 REPDIGIT PRIME :) :) :)
314187728^131072+1 GENERALIZED FERMAT
31*332^367560+1 CRUS PRIME
Proud member of team Aggie The Pew. Go Aggie! | |
|
|
http://www.mersenneforum.org/showthread.php?p=451890#post451890
I only see discussion of the multi-threading and Serge's changes for testing Generalized Uniques there. Clearly, it is implemented in the version you have tried though. Probably we should wait for a stable/official release (unless this will be very much later), since I guess he will also be using an upgraded gwnum version to what we are currently running and tested, and so some level of organised testing will be needed.
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 0
                           
|
http://www.mersenneforum.org/showthread.php?p=451890#post451890
I only see discussion of the multi-threading and Serge's changes for testing Generalized Uniques there. Clearly, it is implemented in the version you have tried though. Probably we should wait for a stable/official release (unless this will be very much later), since I guess he will also be using an upgraded gwnum version to what we are currently running and tested, and so some level of organised testing will be needed.
- Iain
Is there a new gwnum library? I'm pretty sure both changes (Serge's and ours) are changes just to LLR, not gwnum.
____________
My lucky number is 75898524288+1 | |
|
|
I made copy&paste of PRP problem and send it to Jean Penne via Mersenne forums message. Then he answer to me, and told me that problem is corrected and links are available on his page. But also told me , that next official release will be when he implement multi-core support.
____________
92*10^1439761-1 REPDIGIT PRIME :) :) :)
314187728^131072+1 GENERALIZED FERMAT
31*332^367560+1 CRUS PRIME
Proud member of team Aggie The Pew. Go Aggie! | |
|
|
Is there a new gwnum library? I'm pretty sure both changes (Serge's and ours) are changes just to LLR, not gwnum.
Yes, those changes are only to LLR code. However, we are currently running LLR 3.8.17 which is based on gwnum 28.8. There are 28.9 and 28.10 discussed on Mersenneforum, I'm not exactly sure of the status of these, but there are some changes there which would affect us e.g. changes to how FFT length is selected, and also some small perf tweaks/fixes to some of the transforms.
In any case, my only point was until we see the source code it's not possible to say what exactly has changed, or how much testing is required from our side.
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
|
I dont think that much more testing will be needed when new LLR is released because patch is already tested and checked many times in last year :)
It is tested on specific Phi function, but it works perfectly. So we will have to wait: and will doesnot need to use new LLR ,we can use PRP corrected version for SoB
Only change in corrected LLR is
I found the bug : I defaulted the global variable "strong" to "TRUE" so, by default, a strong Fermat test is done, which is costly! For double checking numbers that are unlikely to be prime, it was really useless, a simple Fermat test being sufficient!
____________
92*10^1439761-1 REPDIGIT PRIME :) :) :)
314187728^131072+1 GENERALIZED FERMAT
31*332^367560+1 CRUS PRIME
Proud member of team Aggie The Pew. Go Aggie! | |
|
|
I suspect we're going to wait for the new LLR before starting the double check, so we might be starting the double check somewhat later than expected. I'll let you know as events develop, but there's no dates I can give you at this time.
And new LLR is out!
http://www.mersenneforum.org/showthread.php?p=452382#post452382
PRP test is fixed
We finally got multi-core support
____________
92*10^1439761-1 REPDIGIT PRIME :) :) :)
314187728^131072+1 GENERALIZED FERMAT
31*332^367560+1 CRUS PRIME
Proud member of team Aggie The Pew. Go Aggie! | |
|
|
We finally got multi-core support
This part I think is great news and will make bigger tasks easier to swallow, with caution there are times where it should, or shouldn't be used for maximum throughput. I don't want to hijack this thread as it is more general than SoB, but I will help out with organised checking that it works when the call is made. It could be "interesting" how BOINC might handle one task using more than one CPU core... | |
|
|
We finally got multi-core support
This part I think is great news and will make bigger tasks easier to swallow, with caution there are times where it should, or shouldn't be used for maximum throughput. I don't want to hijack this thread as it is more general than SoB, but I will help out with organised checking that it works when the call is made. It could be "interesting" how BOINC might handle one task using more than one CPU core...
I just hope it isn't "interesting" as in "interesting times". | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1893 ID: 352 Credit: 3,142,312,174 RAC: 0
                             
|
Diverted from origianal SoB Double Checking starting soon
____________
My stats
Badge score: 1*1 + 5*1 + 8*3 + 9*11 + 10*1 + 11*1 + 12*3 = 186 | |
|
|
I just hope it isn't "interesting" as in "interesting times".
It is BOINC...
Is there somewhere I can get the current maximum FFT sizes for each LLR based project? I think it was posted before but may be out of date even if I could find it again. I think I have a good enough model to predict where going multi-thread could show gains based on past testing with Prime95, but obviously I'd like to verify it with representative manual LLR tests. | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1893 ID: 352 Credit: 3,142,312,174 RAC: 0
                             
|
It scales out differently on different CPUs.
Note that those are run times, not cpu-time (x86 version on Win x64).
But as with other cases, it is better to run several instances.
Once higher FFT size, it may scale out better (due to CPU cache).
i7-3820, no HT, 1-4 threads.
52206462075*2^333333-1 is prime! (100354 decimal digits) Time : 53.023 sec.
52206462075*2^333333-1 is prime! (100354 decimal digits) Time : 38.400 sec.
52206462075*2^333333-1 is prime! (100354 decimal digits) Time : 29.273 sec.
52206462075*2^333333-1 is prime! (100354 decimal digits) Time : 25.233 sec.
i5-6600
52206462075*2^333333-1 is prime! (100354 decimal digits) Time : 37.021 sec.
52206462075*2^333333-1 is prime! (100354 decimal digits) Time : 25.890 sec.
52206462075*2^333333-1 is prime! (100354 decimal digits) Time : 20.523 sec.
52206462075*2^333333-1 is prime! (100354 decimal digits) Time : 19.279 sec.
Older Intel Xeon E5630 (2,5GHz), 2 CPUs, no HT, 1-7 threads.
Note that best perfomance is using 5 threads.
52206462075*2^333333-1 is prime! (100354 decimal digits) Time : 168.229 sec.
52206462075*2^333333-1 is prime! (100354 decimal digits) Time : 92.560 sec.
52206462075*2^333333-1 is prime! (100354 decimal digits) Time : 88.651 sec.
52206462075*2^333333-1 is prime! (100354 decimal digits) Time : 75.826 sec.
52206462075*2^333333-1 is prime! (100354 decimal digits) Time : 66.724 sec.
52206462075*2^333333-1 is prime! (100354 decimal digits) Time : 69.302 sec.
52206462075*2^333333-1 is prime! (100354 decimal digits) Time : 76.635 sec.
52206462075*2^333333-1 is prime! (100354 decimal digits) Time : 87.253 sec.
____________
My stats
Badge score: 1*1 + 5*1 + 8*3 + 9*11 + 10*1 + 11*1 + 12*3 = 186 | |
|
RafaelVolunteer tester
 Send message
Joined: 22 Oct 14 Posts: 885 ID: 370496 Credit: 334,085,845 RAC: 0
                  
|
Is there somewhere I can get the current maximum FFT sizes for each LLR based project? I think it was posted before but may be out of date even if I could find it again. I think I have a good enough model to predict where going multi-thread could show gains based on past testing with Prime95, but obviously I'd like to verify it with representative manual LLR tests.
This. I remember a post about it, but when I tried to find the find a couple of days ago, I just couldn't. At. All.
Anyhow, here's a quick table I did for some random testing, based on recent WU I completed on my machines (aka this is probably incomplete and / or with outdated sizes):
PPSE: 120
SGS: 128
SR5: 480, 512, 576
321: 768
TRP: 768, 800, 864, 896
ESP: 1000
PSP: 1152, 1280
CUL: 1600
WOO: 1680, 1728
SOB: 2880
*PPS and PPS-Mega: ?
All numbers in K | |
|
|
It scales out differently on different CPUs.
Note that those are run times, not cpu-time (x86 version on Win x64).
But as with other cases, it is better to run several instances.
Once higher FFT size, it may scale out better (due to CPU cache).
Can't write much at the moment, but I wouldn't think SGS is a good use case for this :) There's two ways to look at it: shortest possible time for a single test, vs. best overall throughput. Maybe the shortest possible time is of interest for verification, but I would except the main optimisation to be for best overall throughput.
Based on using Prime95 previously, the optimum point is when the single task with multi-threads substantially fills the processor cache. You will get better throughput than running separate tasks individually. If the processor cache is exceeded, you run back into ram bandwidth limitations. If the tasks are particularly small, presumed overheads will lead to reduced throughput. I need to work out what type of tasks may benefit most and will focus my testing on those. I think the maximum benefit point is up to around 768k for an i5, and 1024k for an i7.
Even if there is no throughput benefit, I would prefer to run one task 4x faster than 4 parallel tasks as current (assuming 4 cores). | |
|
|
*PPS and PPS-Mega: ?
All numbers in K
Fortunately I ran some of these recently.
MEGA seems to be 256.
PPS seems to be 160 and 192. | |
|
RafaelVolunteer tester
 Send message
Joined: 22 Oct 14 Posts: 885 ID: 370496 Credit: 334,085,845 RAC: 0
                  
|
Can't write much at the moment, but I wouldn't think SGS is a good use case for this :) There's two ways to look at it: shortest possible time for a single test, vs. best overall throughput. Maybe the shortest possible time is of interest for verification, but I would except the main optimisation to be for best overall throughput.
Based on using Prime95 previously, the optimum point is when the single task with multi-threads substantially fills the processor cache. You will get better throughput than running separate tasks individually. If the processor cache is exceeded, you run back into ram bandwidth limitations. If the tasks are particularly small, presumed overheads will lead to reduced throughput. I need to work out what type of tasks may benefit most and will focus my testing on those. I think the maximum benefit point is up to around 768k for an i5, and 1024k for an i7.
Even if there is no throughput benefit, I would prefer to run one task 4x faster than 4 parallel tasks as current (assuming 4 cores).
This kinda got me thinking: maybe now HT can actually be useful for LLR. Previously, we would just disable it, for HT didn't provide any sort of boost. Considering a desktop i7, we would do 4 tasks on 4 cores & 4 threads. But with multicore support, I wonder if running each task on 2 threads (aka a physical core + a HT) wouldn't be faster.
Here's the new strat: 4 tasks using 2 cores each on a 4 core, 8 thread machine. | |
|
|
I'd be happy to be proven wrong, but if not bandwidth limited, you're execution unit limited. HT doesn't help with the heavy lifting. Maybe there is a minor benefit to be obtained elsewhere. | |
|
|
I just did a thought exercise based on the earlier FFT sizes, and some testing I did previously at the 2nd chart in post below:
http://www.primegrid.com/forum_thread.php?id=6545&nowrap=true#92525
The take away point is that, for an 8MB cache quad core, the multi-thread throughput benefit zone is when the blue line is above the orange line. At small FFT sizes, running one unit per core is still most efficient.
At a point, going multi-threaded gives a boost. The cores can collectively work on a shared data set which fits in the processor cache, where it doesn't if they work on separate tasks.
Keep going up in unit size, the cache will eventually be filled and ram limitations come in once again. While throughput isn't improved, you could still run the same number of units quickly in serial sequence than slowly in parallel.
So for a typical i7 which is quad core with 8MB cache, I predict there will be a speed benefit from running SR5 or bigger, with the optimal zone around 321, TRP and ESP projects.
This should scale for smaller cache too, where an i5 with 6MB of cache would also benefit from SR5 and bigger tasks, but the sweet spot is reduced to SR5 and 321 projects.
I would need to give more thought on how i3s behave but the smaller cache sizes might mean they never really give a significant speed increase. They may still benefit from running two large tasks sequentially in the same time as two in parallel. Similarly I don't know if it will scale well to more cores, or how it might work with multiple sockets on the same task. | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1893 ID: 352 Credit: 3,142,312,174 RAC: 0
                             
|
Thanks for the link to the charts, well done.
I also wonder how it would work on multiple sockets, may try again later with larger FFT size(s).
Sure, on general, running same number of units quickly in serial sequence than slowly in parallel is a win also for (any) project: faster turn-around, less tasks in progress etc.
*IF* boinc scheduler is ready for it and it may get messy if single-threaded and multi-threaded tasks are combined, number of available cores are changing etc.
____________
My stats
Badge score: 1*1 + 5*1 + 8*3 + 9*11 + 10*1 + 11*1 + 12*3 = 186 | |
|
|
I don't know how BOINC works in enough useful detail, so following is just speculation. I'd imagine a task could say it needs N cores to run, and would then handle it as such. A user setting might be "use up to N cores per task", set per project.
My understanding of the gwnum multicore handling is that it breaks the work into chunks and fires them out until it all gets done. So, it isn't dependant on having all nominated cores available at once, but would run slower if for example one core was busy elsewhere. Presumably the breaking of up work still need regular recombination, so this isn't generalised enough to allow a huge task to be done on multiple machines for example.
My concern about multiple sockets is how they might interact between the work, per CPU cache, and relative bandwidth between sockets and also ram. A safe option would be to run each socket on a single task each to ensure the cache access is local to the socket. I don't know if the combination of LLR/gwnum, BOINC and/or OS schedulers are smart enough to do that. On the Intel side at least, the inter-socket connection isn't that high in bandwidth terms, less than local memory, so I don't think it would end well to rely on that. | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1137 ID: 120786 Credit: 267,535,355 RAC: 0
                    
|
Batalov confirmed Multi core support covers all types of tests.
rebirther has compiled the llr3.8.18 64bit win version.
Now on Jean's site:
http://jpenne.free.fr/
rebirther can't compile the aprcl.exe though. New Wieferich prime search feature seems to need it. Maybe someone with Linux wants to try it out.
First create file weiferich_test_range.abcd:
ABC$a$b$c
1 5000 2
Without the companion executable I get:
>cllr64.exe weiferich_test_range.abcd -d
Starting Wieferich prime search base 2 from n = 1 to 5000
Error 2 while trying to create new process
Wieferich prime searching not available...
There are three companion executables all together, which are compiled under cygwin system, and launched by the llr program as child processes. They are "llrwfsrch.exe", "tw.exe" and "aprcl.exe" ; the "cygwin1.dll" dynamic library is also needed. | |
|
|
I don't know how BOINC works in enough useful detail, so following is just speculation. I'd imagine a task could say it needs N cores to run, and would then handle it as such. A user setting might be "use up to N cores per task", set per project.
...
Or if set in PG prefs rather than computing prefs, could even have different settings for different sub-projects.
The wrinkle is going to be when people run a multi core app using an older client.... does the scheduler refuse to let them download, or do they get a task that over commits their processors, or does the project figure out a way to enforce cores=1 on an older client? Decisions to be made, and I am not promoting any of the above.
There will also be a difference in how Intel and AMD are affected by this. As I understand it, Intel HT is execution unit based, either one vcore is running of the other one is. I believe AMD is different. They have just one Float processor for two Integer/addressing processors. So each virtual core gets its own int unit, but shares its float ops with the virtual core next door. AMD were so proud of that design, pity this brillliant idea was let down by the overall slowness elsewhere in bulldozy (sorry, Bulldozer).
That means if both tasks want t use float, they have to take it in turns like Intel, but that if one is using float the other can still be getting on with doing integer arithmetic, including calculating element addresses in an array, or pointer arithmetic. (Neither of those is useful during vector ops, of course.)
So my semi-informed guess is that AMD would benefit slightly more, but not a lot more, than Intel by going over to multi-threading on a pair of virtual cores. At least till we see Ryzen, it won't be enough to put AMD back in the lead.
R~~
____________
My computers found:
9831*21441403+1 is a quadhectokilo prime prime, ie >400,000 digits ;)
252031090528237591 + 65521*149*23*19*17*13*11*7*5*3*2*n is prime for every n in { 0..20 } (an arithemtic progression of 21 primes) | |
|
|
Or if set in PG prefs rather than computing prefs, could even have different settings for different sub-projects.
I think it is best per-project. The optimum value will be different depending on unit size and CPU type and would be more intuitive to do it on project selection page than try to match up different venue settings in compute preferences.
The wrinkle is going to be when people run a multi core app using an older client.... does the scheduler refuse to let them download, or do they get a task that over commits their processors, or does the project figure out a way to enforce cores=1 on an older client?
How old is such an older client? A safe way around would be for the project to default to cores per task of 1 when implemented, so basically same as current, and it is up to the user to manually change that as desired.
I'd also be interested in the behaviour if you manually try to select a greater number of threads than available cores.
As I understand it, Intel HT is execution unit based, either one vcore is running of the other one is.
I'm not sure that is the case, as there would be no benefit from HT without some level of executing more in the same time. The thing about LLR is it is execution unit limited so there is no significant benefit from attempting to use both. I have been unable to show benefit from running different loads on the pair (e.g. one LLR, one sieve) suggesting once LLR limits, it really limits.
So my semi-informed guess is that AMD would benefit slightly more, but not a lot more, than Intel by going over to multi-threading on a pair of virtual cores. At least till we see Ryzen, it won't be enough to put AMD back in the lead.
This would require micro-management by a user so I think it too advanced a case to be directly supported. The weakness remains the poor FPU so they will remain slow in LLR throughput, but they may still gain some benefit by running one task faster than many slowly. LLR isn't a good use of current AMD processors.
Ryzen is still an unknown in this area but I am not optimistic they will catch up in AVX IPC. I suspect AMD have wanted to move FP works to GPU so they're not concentrating so much on implementing it in CPU. They may offset that by providing more cores for the money. | |
|
|
Some timings from my testing of LLR & Prime 95 ( 2 cores per work)
I do few tests and got average values
Candidate length 200K
LLR 3130 second
Prime95 1950 second
candidate length 240K
LLR 2300 seconds
Prime95 2400 seconds
candidate length 320K
LLR 3850 seconds
Prime95 3900 seconds
candidate length 400K
LLR 6800 seconds
Prime95 7200 seconds
candidate length 480K (FMA 3)
LLR 6300 seconds
Prime95 6850 seconds
candidate length 576K (FMA3)
LLR 12100 seconds
Prime95 12800 seconds
____________
92*10^1439761-1 REPDIGIT PRIME :) :) :)
314187728^131072+1 GENERALIZED FERMAT
31*332^367560+1 CRUS PRIME
Proud member of team Aggie The Pew. Go Aggie! | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 0
                           
|
Just a heads up: It is not likely that we will implement support for multi-core LLR anytime in the foreseeable future. BOINC's support for mutli-core apps is rudimentary and wouldn't be able to do what we'd need it to do.(*)
At the present time, we *think* you could run LLR multithreaded by using app_info.xml to supply a custom llr.ini file. (And, of course, you need app_info.xml to run the new LLR app instead of the stock app.) App_config, which is much simpler to use, won't work at the present time.
With either, you would still need to micro-manage the core utilization on your CPU. BOINC won't help with that.
(*) By "rudimentary" I mean that BOINC doesn't provide a reasonable mechanism by which we could pass a user-specified thread count to the BOINC client. We could certainly tell the app how many cores to use, but the BOINC client wouldn'get that information and wouldn't know how many cores are being used, so it would be unable to schedule tasks correctly.
The other problem is that multi-core apps don't play well with single core apps. BOINC tends not to run any multi-core apps when a single core app is running.
The bottom line is that currently we can't offer a "click here and run multi-core" solution, nor do we expect this to ever happen. I hope that at some point you'll be able to use app_config.xml to specify multi-core via the command line.
____________
My lucky number is 75898524288+1 | |
|
|
Michael, would you have any objection to the use of multi-threads via app_info.xml?
I'm conscious that with past software updates, there has been verification testing before it is deployed. Maybe it would be wise to carry out a similar (manual) exercise for confidence it is working as expected, before use on live tasks. I don't know how much testing it may have had before release.
At the moment I'm doing performance testing using a SR5 prime to confirm the expected speed gains are present, which as a side effect will also provide some data that they get the expected result. Shortly I'll also attempt a test with a TRP unit which I assume isn't prime. I just extracted the test which was run recently. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 0
                           
|
Michael, would you have any objection to the use of multi-threads via app_info.xml?
I'm conscious that with past software updates, there has been verification testing before it is deployed. Maybe it would be wise to carry out a similar (manual) exercise for confidence it is working as expected, before use on live tasks. I don't know how much testing it may have had before release.
At the moment I'm doing performance testing using a SR5 prime to confirm the expected speed gains are present, which as a side effect will also provide some data that they get the expected result. Shortly I'll also attempt a test with a TRP unit which I assume isn't prime. I just extracted the test which was run recently.
I'm perfectly happy if you wish to volunteer to be a guinea pig. :)
My only request is that you let us know the results. Especially if you start getting inconclusive results, but more generally with how efficient it is and how much trouble you encounter getting BOINC to behave.
It's the same gwnum library as the previous LLR, so we really don't expect any calculation errors. As updates go, there's relatively little that's changed in this release. Multi-threaded might seem like a huge change, but Prime95 and gwnum have supported that for a very long time. I understand the coding change to LLR was actually very minor.
And I'm sure Jean (and rebirther) are knowledgeable enough to know they have to link with multi-threaded versions of the run-time libraries. :)
____________
My lucky number is 75898524288+1 | |
|
|
I'm perfectly happy if you wish to volunteer to be a guinea pig. :)
Ok, I'll be guinea pig duck fish, when I have some time to work out how app_info.xml actually works as I've managed to avoid them up to now.
For now I'm still doing some manual performance testing.
I started some old 2600 (non-K) systems earlier on my SR5 prime. I wont know the results until tomorrow. I did get some numbers based on reported iteration times, and two threads was 1.9x one, and four threads was 3.4x one. There is a complication in comparison here, which is the standard turbo was active on the CPU so clocks would be higher when fewer cores are active.
At the moment I'm running a TRP in some scenarios. It is actually this unit I'm re-running manually. So that unit took 7.4 hours CPU time on a 6700k at 4.2 GHz, with 3200 dual channel dual rank ram. Other units were running at the same time.
I ran it with 4 threads on another 6700k system also at 4.2 GHz, this time with 4 threads. Ram irrelevant as it will fit in cache. That took 1.73 hours, or 4.3x faster. Similarly a 6600k at 4.2 GHz, with 3000 dual channel dual rank ram as the cache is too small, took 1.86 hours or 3.98x faster. I'm not surprised the 6600k is slower as the task is a little too big to fit in the processor cache, so some ram dependency will reduce its performance a bit. At some point I need to run that unit by itself to see what an unlimited one thread time is, and I have a two thread test of the same also running on another system. The two results that did finish gave same residue: 1AD8112E39BC2241
| |
|
|
Ok, had a go at app_info.xml, failed. I really don't know what I'm doing and this is a combination of random bits I found elsewhere in the forums, but I've no idea how up to date any of it is, and if anything I changed is relevant.
This is what I have so far:
<app_info>
<app>
<name>llrSR5</name>
<user_friendly_name>SR5 (LLR)</user_friendly_name>
</app>
<file_info>
<name>primegrid_llr_wrapper_7.06_windows_x86_64.exe</name>
<executable/>
</file_info>
<file_info>
<name>primegrid_cllr.exe</name>
<executable/>
</file_info>
<file_info>
<name>llr.ini</name>
</file_info>
<app_version>
<app_name>llrSR5</app_name>
<version_num>706</version_num>
<api_version>6.10.25</api_version>
<avg_ncpus>4</avg_ncpus>
<max_ncpus>4</max_ncpus>
<file_ref>
<file_name>primegrid_llr_wrapper_7.06_windows_x86_64.exe</file_name>
<main_program/>
</file_ref>
<file_ref>
<file_name>primegrid_cllr.exe</file_name>
<open_name>primegrid_cllr.exe.orig</open_name>
</file_ref>
<file_ref>
<file_name>llr.ini</file_name>
<open_name>llr.ini</open_name>
</file_ref>
</app_version>
</app_info>
Primegrid_cllr.exe is renamed cllr64.exe from 3.8.18 package.
llr.ini was llr.ini.6.07 in the project folder, renamed, and with "ThreadsPerTest=4" line added.
The test host has SR5 selected only. BOINC client seems to grab work, but it errors almost immediately. Example at link:
http://www.primegrid.com/result.php?resultid=775185864
I tried some random things with file names but I'm guessing at this point. Is it something to do with the wrapper? Are my versions correct? | |
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 915 ID: 3110 Credit: 183,164,814 RAC: 0
                        
|
If you get it working, please let me know how. At some point I'd like to try llrCUDA on 321, which I think is the only project where it will work.
____________
| |
|
|
Hi This works for me
<app_info>
<app>
<name>llrSR5</name>
<user_friendly_name>Sierpinski/Riesel Base 5 Problem (LLR)</user_friendly_name>
<fraction_done_exact/>
</app>
<file_info>
<name>primegrid_llr_wrapper_7.06_windows_x86_64.exe</name>
<status>1</status>
<executable/>
</file_info>
<file_info>
<name>cllr64.3.8.18.exe</name>
<status>1</status>
<executable/>
</file_info>
<file_info>
<name>llr.ini.6.07</name>
<status>1</status>
</file_info>
<app_version>
<app_name>llrSR5</app_name>
<version_num>705</version_num>
<platform>windows_x86_64</platform>
<avg_ncpus>1.000000</avg_ncpus>
<max_ncpus>1.000000</max_ncpus>
<api_version>6.11.7</api_version>
<file_ref>
<file_name>primegrid_llr_wrapper_7.06_windows_x86_64.exe</file_name>
<main_program/>
</file_ref>
<file_ref>
<file_name>cllr64.3.8.18.exe</file_name>
<open_name>primegrid_cllr.exe</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>llr.ini.6.07</file_name>
<open_name>llr.ini</open_name>
<copy_file/>
</file_ref>
</app_version>
</app_info> | |
|
|
Hi This works for me
Thanks, can't wait to try it out later. Comparing my attempt I can see possible reasons why mine failed, if I based mine off obsolete information. | |
|
|
I have SR5 running on two systems now, one i7-6700k, one i5-6600k. SR5 tasks only need about 4.6MB of storage, so should fit equally in both CPU caches. Both CPUs are also clocked at 4.2 GHz, so I'd expect them to finish in the same time. They don't.
It is early days, and I only have two units each checked to be 576k FFT. The i5 is about 4.4% slower than the i7. I feel this is a bit more than should be expected from a simple measurement variation, and will continue to watch this as more units come though. I have a hunch, could these be L3 cache speed influenced now? The i5 cache happens to run at 3.9 GHz, the i7 cache runs at 4.1 GHz. The difference is 5.1%, so in the same ball park... if this holds up with more units, I'll have to try adjusting the cache speeds and see if it follows.
On the i5, I do have some units passed through it previously. The average was only 1.2% slower for same FFT size. I kinda hoped to see a bigger difference here. This is close enough to 4 threads taking 1/4 the time.
I'm running some units normally on another i7, to check some other recent units which were much faster than the i5 also. I've not run a 4 thread task on this system yet, but if it comes out similar to the other i7, 4-threads would only come out 3.66x faster than one when running 4 at same time, so lower throughput. | |
|
RafaelVolunteer tester
 Send message
Joined: 22 Oct 14 Posts: 885 ID: 370496 Credit: 334,085,845 RAC: 0
                  
|
I have a hunch, could these be L3 cache speed influenced now? The i5 cache happens to run at 3.9 GHz, the i7 cache runs at 4.1 GHz. The difference is 5.1%, so in the same ball park... if this holds up with more units, I'll have to try adjusting the cache speeds and see if it follows.
Make sure all RAM timings and frequencies are the same as well. This could just be the difference; when I was playing with tertiary timings, I found that I could get a little bit of a boost by tweaking them rather than leaving on auto. | |
|
|
Ram timings shouldn't come into it. With only one task it should fit in the L3 cache. Also, the ram in the i7 is slower than the ram in the i5! (2x 2666 2R vs. 4x 3000 1R)
Note where multithread times are mentioned below, they are the reported sum of the 4 threads, so for elapsed time divide by 4.
I've got averages of more units now, and the gap has reduced. The i5 running 4 threads averaged 11781s for 10 of 576k FFT tasks. The i7 running 4 threads averaged 11475s for 11 of 576k FFT tasks, being 2.6% faster on average.
I'm wondering, is there a difference in run times between S and R units? I'm not sure, but the odd unit may be a little faster than others in the same FFT size. The difference between the median values remains similar, at 3.1% faster on the i7.
To confuse matters, I've also been running SR5 units normally on another i7. That is, 4 cores, 4 tasks. This is now confirmed as faster in throughput. Average times for 12 units was 10688s, or 6.9% more throughput than the i7 running 4 threads. This I wasn't expecting. I thought the multithread version would take the lead here.
I need to try two things now, the i7 on 4 threads, switch back to normal. The i7 on normal, switch to 4 threads. Depending on how that goes should point towards the software or hardware. | |
|
|
Ram timings shouldn't come into it. With only one task it should fit in the L3 cache. Also, the ram in the i7 is slower than the ram in the i5! (2x 2666 2R vs. 4x 3000 1R)
agreed they shouldn't come into it.
I think if it were me I would do the test and check that really is so.
By 2x and 4x do you mean dual channel architecture vs quad? If so which mobo has the quad channel? I have only seen that on Haswell boards so far, but maybe I just haven't been looking hard enough.
R~~
____________
My computers found:
9831*21441403+1 is a quadhectokilo prime prime, ie >400,000 digits ;)
252031090528237591 + 65521*149*23*19*17*13*11*7*5*3*2*n is prime for every n in { 0..20 } (an arithemtic progression of 21 primes) | |
|
|
It doesn't make sense to assume ram is influencing multithread results here as the faster system has the slower ram. The number I prefixed with is the number of modules in total. I had previously found 2 rank per channel gave significantly better results than single, so the 4x1R should be near enough equivalent to 2x2R in that respect. The number of channels is implied through the CPU models stated.
Changing ram settings to normalise them is not trivial. The only way I could guarantee it is to swap modules, and that is physically impossible, or would require running in a significantly degraded state, which I have no intention of doing.
My goal is to do the other testing I mentioned on my other suspiciously fast i7 system, and after that, I'll start talking baseline timings on TRP with the two i7 systems, while I explore cache speed settings with the i5. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 0
                           
|
I'm wondering, is there a difference in run times between S and R units? I'm not sure, but the odd unit may be a little faster than others in the same FFT size. The difference between the median values remains similar, at 3.1% faster on the i7.
tldr: yes
Sermon version:
Yes, and there should also be a difference in timings between different K's. There's also significant differences as n varies, even with the same FFT size, so getting old tasks mixed in with new tasks is all it takes to send you on a wild goose chase.
I've said often that it's better to run benchmarks in a controlled test. If you use live data you will spend your time chasing irregularities in the test data that are due to variables in the inputs that you can't control.
There's exceptions, of course. Sieves have relatively stable run times, as does SGS. Anything else is a crap-shoot. Conjectures are the absolute worst.
Want to do benchmarks? Stop BOINC, pick a test case, and run it under the different conditions you want to test. If you're running Windows, boot to safe mode first.
Using live data is fine if you're looking for ballpark estimates. "Yes, this Kaby Lake is a lot faster than a Core2" works fine. But if you're looking at differences of less than 10% your answers will be lost in the noise if you're using live tasks.
____________
My lucky number is 75898524288+1 | |
|
|
I think that's not the first time I had that advice :)
With enough units done I hope to average it out. I don't have the time to micro-manage manual testing at the moment. Maybe later. | |
|
|
It doesn't make sense to assume ram is influencing multithread results here as the faster system has the slower ram.
quite so, but I obviously failed to be clear in what I was saying.
I was suggesting that it sometimes turns out to be worth testing out daft assumptions just to ensure the is nothing really surprising going on.
However, you go on to say
Changing ram settings to normalise them is not trivial. The only way I could guarantee it is to swap modules, and that is physically impossible, or would require running in a significantly degraded state, which I have no intention of doing.
I had not appreciated this detail. My previous suggestion is only valid if it is a reasonably easy teat to make, and clearly that is not so.
Thanks for explaining about the number of modules.
R~~
____________
My computers found:
9831*21441403+1 is a quadhectokilo prime prime, ie >400,000 digits ;)
252031090528237591 + 65521*149*23*19*17*13*11*7*5*3*2*n is prime for every n in { 0..20 } (an arithemtic progression of 21 primes) | |
|
|
I've said often that it's better to run benchmarks in a controlled test. If you use live data you will spend your time chasing irregularities in the test data that are due to variables in the inputs that you can't control.
[...]
Want to do benchmarks? Stop BOINC, pick a test case, and run it under the different conditions you want to test. If you're running Windows, boot to safe mode first.
The antithesis to this argument is that by tuning for one test case, you only know if the tuning is right for that test case. Given the variations arising from different FFT sizes and so on, there is at best a vague hope that the best settings for your one test case will be best across thew board.
Combining the two sides of the argument, if you want to o it properly, pick several test cases, not just one.
I would suggest taking something like a day's crunching for short jobs, or proportionately more for longer jobs, and look at the variation in run times for those jobs running all on the same settings, Do not take much notice of changes you find that are less than double this initial variation.
Then use that same ensemble to test each of the other settings.
Unfortunately doing this would take A LOT longer.
But if you shorten the process be aware that you will fall into one or other of the two traps -- either as Michael says you will get misled by natural variation, or as I suggest you risk tuning for a task that is actually not a good representative of the average. Pick the risk that least frightens you,
R~~
____________
My computers found:
9831*21441403+1 is a quadhectokilo prime prime, ie >400,000 digits ;)
252031090528237591 + 65521*149*23*19*17*13*11*7*5*3*2*n is prime for every n in { 0..20 } (an arithemtic progression of 21 primes) | |
|
|
On the i7, I had a bad unit. The system is slightly overclocked to 4.2 GHz (4.0 stock) and doesn't have a history of bad units, apart from one time when cooling failed. So as of right now, that's 32 units completed using 4 threads, 25 valid, 1 invalid, 5 pending, 1 inconclusive. The inconclusive wingman doesn't have any invalid in current history. Temperatures look ok though, not even reaching 60C.
I'll let it continue as is for now but will keep a close eye on it. If that turns out bad I'll revert the OC. Speculation: by not being ram limited, is it pushing the compute units harder than it has before and entering some borderline unstable state?
Edit: the i5 also has an inconclusive unit, where the wingman doesn't have any current history of errors. | |
|
RafaelVolunteer tester
 Send message
Joined: 22 Oct 14 Posts: 885 ID: 370496 Credit: 334,085,845 RAC: 0
                  
|
Speculation: by not being ram limited, is it pushing the compute units harder than it has before and entering some borderline unstable state?
Edit: the i5 also has an inconclusive unit, where the wingman doesn't have any current history of errors.
Speculation: maybe hardware degrades over time? Every once in a couple months, after going on a streak of *rock solid* results, I start getting bad results over and over again on my OCed machine. I originally thought 4.3ghz was fine after a lot of testing, then had to back to 4275mhz some months later. After another streak of solidness, a couple months came and I backed down to 4.263. Then 4250. Now I'm at 4225. Seems like every once in a while, I'm forced to turn the Bclk down a little. | |
|
|
Speculation: maybe hardware degrades over time? Every once in a couple months, after going on a streak of *rock solid* results, I start getting bad results over and over again on my OCed machine. I originally thought 4.3ghz was fine after a lot of testing, then had to back to 4275mhz some months later. After another streak of solidness, a couple months came and I backed down to 4.263. Then 4250. Now I'm at 4225. Seems like every once in a while, I'm forced to turn the Bclk down a little.
Yeah it's something they call "electromigration" It happens to CPU's and especially with OC'd ones. Eventually the unit will short internally and cease to function. I think they all do it but overclocking speeds up the process dramatically.
____________
| |
|
|
While that remains a possibility, I need to let the inconclusive units run their course and see where we are. Since I added a 3rd similar system, it would get really interesting if they all show similar signs, as they were all bought at different times. For them to all suffer at the same time would be unlikely. But, I'm getting ahead of myself here and will let the crunching keep going. | |
|
|
Speculation: maybe hardware degrades over time? Every once in a couple months, after going on a streak of *rock solid* results, I start getting bad results over and over again on my OCed machine.
OC is bad in three ways, two minor and one major.
At constant temperature and Voltage, the lifetime of components in digital circuits is more tied to number of cycles than to number of seconds runtime. This on it own would mean that OC would reduce the years of life but not reduce the total work achievable in the shortened lifetime.
At constant temoperature, the risk of damage in any one cycle is more nearly proportional to the square of the Voltage than to linear Voltage. As you tend to overVolt as part of overclocking, this causes the second minor effect.
The major point is that more damage is done per cycle the hotter the component gets. For the same cooling setup the rig runs hotter when OC'd, so this accelerates the above effect. Best advice is to overcool so that the device runs significantly cooler than when running with standard clock and stock cooler -- even achieving the same temperature as the standard is not so good.
R~~
____________
My computers found:
9831*21441403+1 is a quadhectokilo prime prime, ie >400,000 digits ;)
252031090528237591 + 65521*149*23*19*17*13*11*7*5*3*2*n is prime for every n in { 0..20 } (an arithemtic progression of 21 primes) | |
|
|
I've had two bad SR5 units each when running 4 threads on both the initial test i5 and i7 systems. To me that points to overclock instability with the new load. Solutions are either to lower clocks or increase voltages. I'll do the former initially as I will be limited on time to decide/test voltages vs power/heat in the short term.
I'll also aim to expand the testing to other quad cores over the weekend. | |
|
Artist Volunteer tester Send message
Joined: 29 Sep 08 Posts: 86 ID: 29825 Credit: 261,903,911 RAC: 0
                       
|
2 invalid, 1 inconclusive WU on my Skylake, no problems on the Haswell so far.
I hope you have the time to run 3.8.18 on a Haswell.
I'll also aim to expand the testing to other quad cores over the weekend.
____________
144052 *5^2018290+1 is Prime! | |
|
Dave  Send message
Joined: 13 Feb 12 Posts: 2829 ID: 130544 Credit: 954,793,678 RAC: 0
                     
|
Yeah it's something they call "electromigration" It happens to CPU's and especially with OC'd ones. Eventually the unit will short internally and cease to function. I think they all do it but overclocking speeds up the process dramatically.
I for one have had to tone down my SB over the years. It was orig rated at 4.6 but after an early BIOS reset & subsequently operating on non-loaded-optimal-defaults for a few months it couldn't be stable. I daren't go for 4.5 - currently is 4.3 I think. Will reduce it to 4.0 tonight. I always said I wanted it to last 10 years (starting 2011). | |
|
|
It would be interesting to know how many are trying this, on what type of system, with or without any problems.
So far I'm running it on 3 Skylake systems, and the first two are having bad units. The 3rd similar system was set up later so we'll see if that follows.
The two systems I'll try adding are an i5-5675C and i5-4570S. The Broadwell is OC'd so I might have to rethink that before I do so. | |
|
|
Speculation: by not being ram limited, is it pushing the compute units harder than it has before and entering some borderline unstable state?
Both of those effects are certainly happening, or so it seems to me.
The only specualtion is whether the harder workout is causing the problem, exposing a latent issue that was there all along, or (less likely) is [/b]totally unconnected[/b] with the instablility...
R~~
____________
My computers found:
9831*21441403+1 is a quadhectokilo prime prime, ie >400,000 digits ;)
252031090528237591 + 65521*149*23*19*17*13*11*7*5*3*2*n is prime for every n in { 0..20 } (an arithemtic progression of 21 primes) | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 0
                           
|
It would be interesting to know how many are trying this, on what type of system, with or without any problems.
So far I'm running it on 3 Skylake systems, and the first two are having bad units. The 3rd similar system was set up later so we'll see if that follows.
The two systems I'll try adding are an i5-5675C and i5-4570S. The Broadwell is OC'd so I might have to rethink that before I do so.
It would be VERY useful to know if the bad results are repeatable or not. If it's repeatable, we have a larger problem. I can supply you with the candidate and original bad residue from one of your tasks if you give me the result ID number.
____________
My lucky number is 75898524288+1 | |
|
|
http://www.primegrid.com/result.php?resultid=775826749
Above is one of the invalid results I had.
Edit: at this point, I'd still assume it was overclocking related, but as the test will only take about an hour I can re-run it quickly. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 0
                           
|
http://www.primegrid.com/result.php?resultid=775826749
Above is one of the invalid results I had.
Edit: at this point, I'd still assume it was overclocking related, but as the test will only take about an hour I can re-run it quickly.
This is the (invalid) result returned by your computer:
64598*5^2318694-1 is not prime. RES64: 4692831FEF0D2834. OLD64: 585DF74D64965C94 Time : 2540.820 sec
Thanks!
____________
My lucky number is 75898524288+1 | |
|
|
From the original computer giving the error (i5):
64598*5^2318694-1 is not prime. RES64: 56D39BCA085E5378. OLD64: 047AD35E191AFA65 Time : 2710.984 sec.
From the i7 which gave errors on different units:
64598*5^2318694-1 is not prime. RES64: 3C7B9361E288F512. OLD64: B572BA25A79ADF33 Time : 2491.017 sec.
Above two are obviously different from each other, and also the previous incorrect result. Both systems have been clock lowered from 4.2 GHz previously, to 4.0 GHz now, without touching voltage. The i5 is still overclocked, and the i7 is essentially at stock, but the voltage may still be lower than standard. Is either one correct?
I've got the same unit running on a Haswell too, but that'll take a bit longer. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 0
                           
|
From the original computer giving the error (i5):
64598*5^2318694-1 is not prime. RES64: 56D39BCA085E5378. OLD64: 047AD35E191AFA65 Time : 2710.984 sec.
From the i7 which gave errors on different units:
64598*5^2318694-1 is not prime. RES64: 3C7B9361E288F512. OLD64: B572BA25A79ADF33 Time : 2491.017 sec.
Above two are obviously different from each other, and also the previous incorrect result. Both systems have been clock lowered from 4.2 GHz previously, to 4.0 GHz now, without touching voltage. The i5 is still overclocked, and the i7 is essentially at stock, but the voltage may still be lower than standard. Is either one correct?
I've got the same unit running on a Haswell too, but that'll take a bit longer.
Both are wrong. It's GOOD -- really, really good -- to know that it's not a systemic problem resulting in reproducible erroneous results. That would have been a very bad problem.
The correct result is this:
64598*5^2318694-1 is not prime. RES64: 47DC8FF7EF9583A8. OLD64: 5C3C1DD5662F6EF0 Time : 31893.685 sec.
You can use that as a sanity check to see if your machines are operating correctly.
I suspect that you're using this particular project because it's in a sweet spot where running MT gives you better cache performance, resulting in the CPU running faster. It's not surprising that this is pushing the CPU harder, since that is essentially the purpose of using MT.
____________
My lucky number is 75898524288+1 | |
|
|
I'm not sure we're at the end of the story yet...
64598*5^2318694-1 is not prime. RES64: 7D73797E64F86BC3. OLD64: FD00DA68C6582741 Time : 3431.898 sec.
This is on a Haswell Xeon. CPU is not overclocked (it can't), although it does run some tight timing ram, but the work should fit in cache. I don't recall that this system ever put out a bad unit before. I'm running it again and will see if it gives the same res, as the earlier run I did stop part way through and adjust the number of threads, not that I think it should make a difference.
Just to remove it from the equation, I'm going to restore both systems to stock CPU clocks, and also their ram. To answer the other question why I picked SR5, it wasn't because of a performance sweet spot, but a convenience sweet spot. I think TRP would see better benefits, but I want to walk before I run, and shorter SR5 units means less potential loss in case of problems.
Edit: the 6600k is now back to stock, so all core turbo of 3.6GHz, and auto voltage is about the same as where I was for 4.2 GHz. Ram dropped from 3000 to 2133. The 6700k clock is largely unchanged at 4.0 GHz all cores active, but the auto voltage has gone up 0.1v or so. Temps still look well in the safe zone. Ram I left alone, as it ran Kingston 2666 without needing XMP. Both are running the same test unit once again. | |
|
Artist Volunteer tester Send message
Joined: 29 Sep 08 Posts: 86 ID: 29825 Credit: 261,903,911 RAC: 0
                       
|
I'm not sure we're at the end of the story yet...
64598*5^2318694-1 is not prime. RES64: 7D73797E64F86BC3. OLD64: FD00DA68C6582741 Time : 3431.898 sec.
Not the end, My results are
Haswell:
64598*5^2318694-1 is not prime. RES64: C2FBBEFA317CD966. OLD64: CD99AADC2BE5702A Time : 3204.941 sec.
Skylake:
64598*5^2318694-1 is not prime. RES64: CAE67DDE583A6E33. OLD64: E559E788A01E2E91 Time : 3522.986 sec.
mackerel, I don't expect your computers to be inaccurate.
Maybe we find some other users running this WU and posting their results.
____________
144052 *5^2018290+1 is Prime! | |
|
RafaelVolunteer tester
 Send message
Joined: 22 Oct 14 Posts: 885 ID: 370496 Credit: 334,085,845 RAC: 0
                  
|
Not the end, My results are
Haswell:
64598*5^2318694-1 is not prime. RES64: C2FBBEFA317CD966. OLD64: CD99AADC2BE5702A Time : 3204.941 sec.
Skylake:
64598*5^2318694-1 is not prime. RES64: CAE67DDE583A6E33. OLD64: E559E788A01E2E91 Time : 3522.986 sec.
mackerel, I don't expect your computers to be inaccurate.
Maybe we find some other users running this WU and posting their results.
Give us a test suit with the necessary command line (tried to get manual LLR working once and failed miserably) and I'd gladly do it. Among others, I do have a Skylake and a Haswell machine to try and replicate results. | |
|
|
cllr64 -d -t4 -q"64598*5^2318694-1"
Change the number after -t for number of threads to run.
Random though: wonder if there's any difference between 64 and 32 bit ones...
I think I've seen enough that I'll discontinue running it on live tasks until we have a better understanding on what's happening here. | |
|
Artist Volunteer tester Send message
Joined: 29 Sep 08 Posts: 86 ID: 29825 Credit: 261,903,911 RAC: 0
                       
|
Give us a test suit with the necessary command line (tried to get manual LLR working once and failed miserably) and I'd gladly do it. Among others, I do have a Skylake and a Haswell machine to try and replicate results.
1. get the software from http://jpenne.free.fr we are talking about version 3.8.18
2. run sllr64 -t4 -d -q"64598*5^2318694-1"
note 1: sllr64 is the name for the linux binary, use the name in the zip-file for windows
note 2: -t4 is the magic key for running 4 threads
____________
144052 *5^2018290+1 is Prime! | |
|
|
64598*5^2318694-1 is not prime. RES64: 63392344EFABDF39. OLD64: 32F845A9FDE1659E Time : 2537.613 sec.
Same i7 as before but now at stock with more voltage. Still not matching, and different res again.
64598*5^2318694-1 is not prime. RES64: EEF0CC6BA6D485DD. OLD64: 5178D3308BEC758F Time : 2893.044 sec.
Same i5 as before but now at stock. Doesn't match.
I think this is enough for me to say, this specific test doesn't like these Skylakes. I haven't run it on my main system yet, but I do have it running on one each Haswell and Skylake i3s. They're going to be overnight jobs for me and I'll check in again in morning. On that note, I'll try single thread on those systems.
The re-run on Haswell Xeon did give the expected res. The difference between the latest run and past run was that the past one I had stopped to change the number of threads from 14 to 8. The 2nd run I just went straight to 8. | |
|
RafaelVolunteer tester
 Send message
Joined: 22 Oct 14 Posts: 885 ID: 370496 Credit: 334,085,845 RAC: 0
                  
|
Uh... Houston, we have a problem. After 52min of crunching:
64598*5^2318694-1 is not prime. RES64: 18D8C020651165E2. OLD64: 53D71C3C5E11F999 Time : 3170.125 sec.
I ran this on my i5 4590. It is undervolted and with highly OCed RAM, but this machine hasn't had a problem in quite a while, so make of that what you will. At any rate, I'm re-running the test with only 3 threads on the 4590, as well as running it with 4 on my 6600k and 2 on my Pentium E2180. Let's see where this leads us... | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 0
                           
|
Hey, guys, I made an assumption about a zillion posts ago, but maybe it's not a valid assumption.
Could everyone who built these apps (Jean, Rebirther, and Iain) please verify that the apps were indeed built with the multi-threaded versions of the libraries?
This is exactly the kind of problem you might see when you're using libraries that are not thread-safe when you're running multiple threads. That doesn't mean this is indeed the problem, but it's a possibility.
____________
My lucky number is 75898524288+1 | |
|
RafaelVolunteer tester
 Send message
Joined: 22 Oct 14 Posts: 885 ID: 370496 Credit: 334,085,845 RAC: 0
                  
|
More bad news:
6600k, 4t:
64598*5^2318694-1 is not prime. RES64: 696BECF84453AE85. OLD64: 3C43C6E8CCFB0B8C Time : 3301.371 sec.
4590, 3t:
64598*5^2318694-1 is not prime. RES64: 9140EE5FAD76D521. OLD64: B3C2CB1F08647F60 Time : 4149.088 sec.
2 more fails, 2 new residues. To me, it seems clear that, at the very least, there's a problem with FMA3 CPUs. I'll be running the test overnight on my Pentium E2180 (SSE2) and i3 530 (SSE4.2) to see if those are affected as well. | |
|
|
Different bad news?
3930k 4t
Base prime factor(s) taken : 5
Starting N+1 prime test of 64598*5^2318694-1
Using AVX FFT length 512K, Pass1=512, Pass2=1K, 4 threads, a = 3
64598*5^2318694-1 is not prime. RES64: 47DC8FF7EF9583A8. OLD64: 5C3C1DD5662F6EF0 Time : 4998.610 sec.
____________
Eating more cheese on Thursdays. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 0
                           
|
We know this problem happens on lots of CPUs.
I know it's happening with the Windows 64 bit build. Is anyone seeing a problem with any of the other builds? Win 32, linux, or Mac?
____________
My lucky number is 75898524288+1 | |
|
|
i7-4790K, HT off, non-OCed, Windows 8.1
Base prime factor(s) taken : 5
Starting N+1 prime test of 64598*5^2318694-1
Using FMA3 FFT length 512K, Pass1=256, Pass2=2K, 4 threads, a = 3
64598*5^2318694-1 is not prime. RES64: EF360BA3B795C277. OLD64: 524890D8BE302B5D Time : 2587.020 sec.
Tried again in safe mode
Base prime factor(s) taken : 5
Starting N+1 prime test of 64598*5^2318694-1
Using FMA3 FFT length 512K, Pass1=256, Pass2=2K, 4 threads, a = 3
64598*5^2318694-1 is not prime. RES64: 900E14D83350E1F4. OLD64: B9771A63C8D06DCF Time : 2679.875 sec.
Hmm...?
Does this means my computer is unstable? | |
|
RafaelVolunteer tester
 Send message
Joined: 22 Oct 14 Posts: 885 ID: 370496 Credit: 334,085,845 RAC: 0
                  
|
i7-4790K, HT off, non-OCed, Windows 8.1
Base prime factor(s) taken : 5
Starting N+1 prime test of 64598*5^2318694-1
Using FMA3 FFT length 512K, Pass1=256, Pass2=2K, 4 threads, a = 3
64598*5^2318694-1 is not prime. RES64: EF360BA3B795C277. OLD64: 524890D8BE302B5D Time : 2587.020 sec.
Tried again in safe mode
Base prime factor(s) taken : 5
Starting N+1 prime test of 64598*5^2318694-1
Using FMA3 FFT length 512K, Pass1=256, Pass2=2K, 4 threads, a = 3
64598*5^2318694-1 is not prime. RES64: 900E14D83350E1F4. OLD64: B9771A63C8D06DCF Time : 2679.875 sec.
Hmm...?
Does this means my computer is unstable?
From the looks of it, it's probably not your computer. Given the fact that lots of users are having problems on lots of different machines with different sorts of hardware, it's safe to say that the problem is on the software side of things, wherever it might be. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 0
                           
|
Different bad news?
3930k 4t
Base prime factor(s) taken : 5
Starting N+1 prime test of 64598*5^2318694-1
Using AVX FFT length 512K, Pass1=512, Pass2=1K, 4 threads, a = 3
64598*5^2318694-1 is not prime. RES64: 47DC8FF7EF9583A8. OLD64: 5C3C1DD5662F6EF0 Time : 4998.610 sec.
That's the correct result.
What build was that?
____________
My lucky number is 75898524288+1 | |
|
|
Different bad news?
3930k 4t
Base prime factor(s) taken : 5
Starting N+1 prime test of 64598*5^2318694-1
Using AVX FFT length 512K, Pass1=512, Pass2=1K, 4 threads, a = 3
64598*5^2318694-1 is not prime. RES64: 47DC8FF7EF9583A8. OLD64: 5C3C1DD5662F6EF0 Time : 4998.610 sec.
That's the correct result.
What build was that?
Indeed, hence the "different". Nothing like a correct result to muck up the problem parade. It's the Win64 build.
____________
Eating more cheese on Thursdays. | |
|
|
For following, do not read anything into the runtimes since other activities were also going. All were using 64 bit app on Windows.
64598*5^2318694-1 is not prime. RES64: 47DC8FF7EF9583A8. OLD64: 5C3C1DD5662F6EF0 Time : 7114.683 sec.
E5-2650 stock, 8 threads (Sandy Bridge)
64598*5^2318694-1 is not prime. RES64: 47DC8FF7EF9583A8. OLD64: 5C3C1DD5662F6EF0 Time : 8928.284 sec.
i7-6700k stock, single thread
64598*5^2318694-1 is not prime. RES64: 47DC8FF7EF9583A8. OLD64: 5C3C1DD5662F6EF0 Time : 11461.348 sec.
i5-6600k stock, single thread
64598*5^2318694-1 is not prime. RES64: B0EF9F206AB6B9DD. OLD64: 12CEDD6140242D94 Time : 7391.210 sec.
i3-6100 stock, 2 threads
64598*5^2318694-1 is not prime. RES64: 80985D7DC09AF970. OLD64: 066F8666D93FD048 Time : 9067.840 sec.
i3-4360 stock, 2 threads
| |
|
|
note 1: sllr64 is the name for the linux binary,...
actually there are two downloads for linux 64bit, with the name of the archive differing by one letter. This depends whether you downloaded the one which is statically linked or the 0ne that is dynamically linked.
sllr64 is the statically linked binary
llr64 is the dynamically linked one
I will run both, as if the dynamic one works that would suggest Michael was on the right track with his q about the libraries.
Question: is it likely to spoil the test if GFNocl is also running? -- I will turn off BOINC CPU tasks but would prefer not to lose the GPUs
R~~
____________
My computers found:
9831*21441403+1 is a quadhectokilo prime prime, ie >400,000 digits ;)
252031090528237591 + 65521*149*23*19*17*13*11*7*5*3*2*n is prime for every n in { 0..20 } (an arithemtic progression of 21 primes) | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 0
                           
|
At this point I would recommend that people NOT use llr 3.8.18, even in single threaded mode, for real tasks.
We do not understand the root cause of this problem, and it's unclear which tests, on which computers, will be affected.
Although it doesn't appear that there's any danger of two bad results validating against each other (although we can't be 100% certain of that), every bad test means lost computing time, lost credits, and potential lost primes for the person running the test.
We have an excellent test case with that SR5 candidate, and I've posted on the Mersenne forums so Jean Penne can take a look at it. It doesn't appear to be a build problem (unless the same mistake was made in all of the builds), so it's possible it's a coding error either in LLR or gwnum.
____________
My lucky number is 75898524288+1 | |
|
RafaelVolunteer tester
 Send message
Joined: 22 Oct 14 Posts: 885 ID: 370496 Credit: 334,085,845 RAC: 0
                  
|
At this point I would recommend that people NOT use llr 3.8.18, even in single threaded mode, for real tasks.
We do not understand the root cause of this problem, and it's unclear which tests, on which computers, will be affected.
Although it doesn't appear that there's any danger of two bad results validating against each other (although we can't be 100% certain of that), every bad test means lost computing time, lost credits, and potential lost primes for the person running the test.
We have an excellent test case with that SR5 candidate, and I've posted on the Mersenne forums so Jean Penne can take a look at it. It doesn't appear to be a build problem (unless the same mistake was made in all of the builds), so it's possible it's a coding error either in LLR or gwnum.
Just to put out some more data, both my Pentium E2180 and my i3 530 were able to get the correct result. And so did Grebuloner's 3930k. And after re-running my 6600k machine, it got the following (all failures):
-Stock CPU speeds, Stock voltages, XMP (3000mhz) enable, 4T test: 64598*5^2318694-1 is not prime. RES64: 15C49341B73D1E98. OLD64: C5F427B2BD263FC0 Time : 2860.512 sec.
-Stock CPU speeds, High overvoltage (to the one I use to crunch on 4225mhz), XMP (3000mhz) enabled, 4T test: 64598*5^2318694-1 is not prime. RES64: 0244653FFF3445AF. OLD64: 8B739DAD950BB505 Time : 2886.468 sec.
-Stock CPU speeds, High Overvoltage, XMP dissabled: 64598*5^2318694-1 is not prime. RES64: 3F9458F3C0849F27. OLD64: C809E6B6706BA568 Time : 2906.842 sec.
We've seen both Windows and Linux. We've seen i3, i5 and i7 fail, with and without HT enabled. We've seen 2~4 threads fail. And we've seen Skylake and Haswell fall flat. With and without CPU (and RAM) OC. There is one common factor that applies to every failure, though: FMA3 + Multithread = bad residue. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 0
                           
|
There is one common factor that applies to every failure, though: FMA3 + Multithread = bad residue.
Indeed, those are the observations, but don't assume that this means it's safe to run multi-threaded on non-FMA3 CPUs. Many multi-tasking problems are related to timing, and minor changes in conditions can cause major changes in results (such as the difference between working and failing). Since there are many examples of MT+FMA3 tasks working correctly, it's reasonable to assume that timing might be involved here as well. If that's true, it means we haven't seen a non-FMA3 CPU fail an MT task yet. It doesn't mean it's not going to happen. Until we know the cause, we can't be sure what's really working.
____________
My lucky number is 75898524288+1 | |
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 915 ID: 3110 Credit: 183,164,814 RAC: 0
                        
|
I'd forgotten, but I think that's why I didn't do anything with llrCUDA: Residues there didn't match either. | |
|
|
It seems to me that the error error affects only systems with KMA/AVX. This is the result for an older athlon with 64 cores:
./sllr64 -t64 -d -q"64598*5^2318694-1"
Base prime factor(s) taken : 5
Starting N+1 prime test of 64598*5^2318694-1
Using AMD K10 FFT length 512K, Pass1=512, Pass2=1K, 64 threads, a = 3
64598*5^2318694-1 is not prime. RES64: 47DC8FF7EF9583A8. OLD64: 5C3C1DD5662F6EF0 Time : 27005.523 sec.
...but it don't use 64 cores (threads). "top" shows only 6-8, so I will repeat the test with -t8
____________
DeleteNull | |
|
|
###correct result from a 10 year old processor. Here is a summary of the info I have about it
model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow vnmi flexpriority dtherm
time ./sllr64 -t4 -d -q64598*5^2318694-1
Base prime factor(s) taken : 5
Starting N+1 prime test of 64598*5^2318694-1
Using FFT length 512K, Pass1=512, Pass2=1K, 4 threads, a = 3
64598*5^2318694-1 is not prime. RES64: 47DC8FF7EF9583A8. OLD64: 5C3C1DD5662F6EF0 Time : 17455.329 sec.
real290m56.203s
user1100m6.148s
sys8m23.520s
Let me know if you want me to run it again on this CPU to see if it is consistently OK? Perhaps with a different number of threads?
EDIT corrected list of flags which had been truncated previously
____________
My computers found:
9831*21441403+1 is a quadhectokilo prime prime, ie >400,000 digits ;)
252031090528237591 + 65521*149*23*19*17*13*11*7*5*3*2*n is prime for every n in { 0..20 } (an arithemtic progression of 21 primes) | |
|
|
...but it don't use 64 cores (threads). "top" shows only 6-8, so I will repeat the test with -t8
I suspect there is something in it that makes it struggle to spread over many threads. When testing a 14 core Xeon with 14 of 28 threads, I observed total usage around 40%, not the 50% ideal case. If I looked at the threads individually, I could see each using about 80% of maximum. Thing is, when I dropped to 8 threads, I didn't see that ratio go up. I might revisit this in more detail once the current problem is resolved. | |
|
|
and on a Kaby Lake the TL:DR is that both rthe dynamically and the staically linked versions fail when multithreaded
Kaby Lake i7-7700. BOINC riuning on just one core: multi thread fails, residues not repeatable
model name : Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
llr - dynamically linked version
./llr64 -t3 -d -q"64598*5^2318694-1"
64598*5^2318694-1 is not prime. RES64: C8D9CFDB87384909. OLD64: DF33DD802D17BF13 Time : 3298.826 sec
real 54m59.186s
user 158m43.544s
sys 1m27.772s
./llr64 -t3 -d -q"64598*5^2318694-1"
64598*5^2318694-1 is not prime. RES64: A5A8D021246DCBDA. OLD64: F0FA70636D49638B Time : 3310.157 sec.
real 55m10.508s
user 158m48.472s
sys 1m30.240s
###Residues different even when repeated immediately with same arguments
./llr64 -t1 -d -q"64598*5^2318694-1" | tee llr.out
Base prime factor(s) taken : 5
Starting N+1 prime test of 64598*5^2318694-1
Using FMA3 FFT length 512K, Pass1=256, Pass2=2K, a = 3
64598*5^2318694-1 is not prime. RES64: 47DC8FF7EF9583A8. OLD64: 5C3C1DD5662F6EF0 Time : 8380.461 sec.
real 139m40.810s
user 139m36.872s
sys 0m0.368s
###Correct when single threaded
### now tesating with static linked version
time ./sllr64 -t3 -d -q"64598*5^2318694-1" | tee -a sllr-test
Base prime factor(s) taken : 5
Starting N+1 prime test of 64598*5^2318694-1
Using FMA3 FFT length 512K, Pass1=256, Pass2=2K, 3 threads, a = 3
64598*5^2318694-1 is not prime. RES64: 2B0FAF5974E1EA95. OLD64: 8A7BE9E78D8387B2 Time : 3328.013 sec.
real 55m28.347s
user 159m5.232s
sys 1m37.008s
____________
My computers found:
9831*21441403+1 is a quadhectokilo prime prime, ie >400,000 digits ;)
252031090528237591 + 65521*149*23*19*17*13*11*7*5*3*2*n is prime for every n in { 0..20 } (an arithemtic progression of 21 primes) | |
|
|
For those that don't also visit mersenneforum, following is latest info from George Woltman:
Looking at it now. I've got an assert in some debug code after several thousand iterations. Preliminary evidence suggests the bug occurs when one thread completes ALL its work before one of the other threads even starts its work. Obviously, this is more likely using faster FMA3 hardware. Also, more likely running more threads and smaller FFT sizes.
IMO further testing isn't needed at this point.
I suppose on the positive side, when I have time I can re-overclock my systems :) Now I have to get ready for a dark o'clock flight for work... | |
|
|
With 8 threads it's much faster than with 64 threads (64 core PC)
./sllr64 -t8 -d -q"64598*5^2318694-1"
Base prime factor(s) taken : 5
Starting N+1 prime test of 64598*5^2318694-1
Using AMD K10 FFT length 512K, Pass1=512, Pass2=1K, 8 threads, a = 3
64598*5^2318694-1 is not prime. RES64: 47DC8FF7EF9583A8. OLD64: 5C3C1DD5662F6EF0 Time : 10211.895 sec.
____________
DeleteNull | |
|
|
With 8 threads it's much faster than with 64 threads (64 core PC)
./sllr64 -t8 -d -q"64598*5^2318694-1"
Base prime factor(s) taken : 5
Starting N+1 prime test of 64598*5^2318694-1
Using AMD K10 FFT length 512K, Pass1=512, Pass2=1K, 8 threads, a = 3
64598*5^2318694-1 is not prime. RES64: 47DC8FF7EF9583A8. OLD64: 5C3C1DD5662F6EF0 Time : 10211.895 sec.
Presumably it's using one socket ;-) What about 16 threads? That would still be one socket? | |
|
|
Presumably it's using one socket ;-) What about 16 threads? That would still be one socket?
I will test this tomorrow (have to sleep now). In the moment there are 8 SR5 WU's running, each with 8 threads. It was difficult to create an app_info.xml that works.
____________
DeleteNull | |
|
|
3.8.19 is out: and it looks like that bug for multi-core support is fixed!
____________
92*10^1439761-1 REPDIGIT PRIME :) :) :)
314187728^131072+1 GENERALIZED FERMAT
31*332^367560+1 CRUS PRIME
Proud member of team Aggie The Pew. Go Aggie! | |
|
RafaelVolunteer tester
 Send message
Joined: 22 Oct 14 Posts: 885 ID: 370496 Credit: 334,085,845 RAC: 0
                  
|
3.8.19 is out: and it looks like that bug for multi-core support is fixed!
Just ran it on my 4590 in 4 and 3 thread mode, and results are now matching. Seems like the bug was indeed fixed. | |
|
Artist Volunteer tester Send message
Joined: 29 Sep 08 Posts: 86 ID: 29825 Credit: 261,903,911 RAC: 0
                       
|
3.8.19 is out: and it looks like that bug for multi-core support is fixed!
Just ran it on my 4590 in 4 and 3 thread mode, and results are now matching. Seems like the bug was indeed fixed.
Valid results on my i7-4770 and i5-6600T.
____________
144052 *5^2018290+1 is Prime! | |
|
|
cllr64.3.8.19 -d -t4 -q"64598*5^2318694-1"
Starting N+1 prime test of 64598*5^2318694-1
Using FMA3 FFT length 512K, Pass1=256, Pass2=2K, 4 threads, a = 3
64598*5^2318694-1 is not prime. RES64: 47DC8FF7EF9583A8. OLD64: 5C3C1DD5662F6EF0 Time : 2653.428 sec.
Valid result on my i7-4790K. | |
|
|
Also valid result running with 4 threads on my Xeon E5-1650 v2 Mac.
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1137 ID: 120786 Credit: 267,535,355 RAC: 0
                    
|
3.8.20 is out. Yum yum!
There was another bit of start-up code that also needed a fix while running an FMA3 PRP test with 5 or 7 threads. | |
|
|
As you can see here it's running well.
____________
DeleteNull | |
|
|
Just running it on one system as a test for now, and of 31 SR5 units, 15 validated, with 16 pending. Shall keep monitoring before wider deployment. | |
|
JimB Honorary cruncher Send message
Joined: 4 Aug 11 Posts: 916 ID: 107307 Credit: 974,514,092 RAC: 0
                    
|
I'm running it here where it's going through PSP candidates pretty quickly (running each on four cores). A number of them have already validated.
Any inconclusives you see on that host are for candidates where someone returned a prime result that wasn't believable. My run came out composite and we're waiting for another result to break the tie. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 0
                           
|
A lot of people have discovered that running multi-threaded seems to offer a decent overall increase in throughput. That's great news, and somewhat unexpected.
But what about hyperthreading? There hasn't been a lot of information about whether running multiple llr threads on hyperthreads helped or hurt.
As part of the llr 3.8.20 validation testing, among other things we've been testing running specific tests with different -t settings. This is to make sure it works rather than checking speed, but it does give us some data.
I'm currently running tests on an old and obsoleted laptop. It's a first generation Core-i3, which has two physical cores, plus hyperthreading. First, the raw data:
387*2^3322763+1 is prime! (1000254 decimal digits) Time : 11833.816 sec.
387*2^3322763+1 is prime! (1000254 decimal digits) Time : 9472.417 sec.
387*2^3322763+1 is prime! (1000254 decimal digits) Time : 10306.515 sec.
387*2^3322763+1 is prime! (1000254 decimal digits) Time : 10666.731 sec.
That's a PPS-MEGA prime, and is the result from using the -t1 through -t4 parameters, in sequence.
There's a noticeable 20% drop in runtime going from 1 thread to 2 threads. The runtimes go up significantly for 3 and 4 threads. Using the hyperthreads seems to not only not decrease throughput, but it makes the computation slower.
On this one computer, therefore, clearly there's no advantage to using hyperthreads when running LLR multi-threaded.
I would not, however, assume that these results are applicable to other computers. It's a laptop, with mobile components, and many important details (memory, MB chipset) are built for low power consumption rather than speed. Also, this is an OLD laptop, and even though it's configured for maximum performance, it may be throttling the CPU to keep temperatures in check. (There's some evidence to support that notion.)
But at least it's one data point, and it's clear that on this computer avoiding hyperthreads is still the way to go.
____________
My lucky number is 75898524288+1 | |
|
|
That's a PPS-MEGA prime, and is the result from using the -t1 through -t4 parameters, in sequence.
When the new software is moved into production, how will the end user handle the -t1 to -t4 parms? Will the software do some initial testing on its own and proceed accordingly or will there be some mechanism (one-time or not) for all consequent runs? | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 0
                           
|
That's a PPS-MEGA prime, and is the result from using the -t1 through -t4 parameters, in sequence.
When the new software is moved into production, how will the end user handle the -t1 to -t4 parms? Will the software do some initial testing on its own and proceed accordingly or will there be some mechanism (one-time or not) for all consequent runs?
You will have complete control over how LLR operates. You'll be able to use app_config (or app_info) to directly specify the "-t" parameter and select how many threads you wish to run.
Barring such intervention by the user, the new LLR will run single threaded, as it does today.
____________
My lucky number is 75898524288+1 | |
|
|
That's a PPS-MEGA prime, and is the result from using the -t1 through -t4 parameters, in sequence.
When the new software is moved into production, how will the end user handle the -t1 to -t4 parms? Will the software do some initial testing on its own and proceed accordingly or will there be some mechanism (one-time or not) for all consequent runs?
You will have complete control over how LLR operates. You'll be able to use app_config (or app_info) to directly specify the "-t" parameter and select how many threads you wish to run.
Barring such intervention by the user, the new LLR will run single threaded, as it does today.
Thanks! | |
|
Message boards :
Number crunching :
LLR Version 3.8.20 released |