Author |
Message |
|
I just noticed: progress goes about 0.3% and then return, like it does doublechecking?
60.543% - 60.871% and then return to 60.543%
____________
92*10^1439761-1 REPDIGIT PRIME :) :) :)
314187728^131072+1 GENERALIZED FERMAT
31*332^367560+1 CRUS PRIME
Proud member of team Aggie The Pew. Go Aggie! |
|
|
|
I don't think that DC is needed for sieve. If the sieve eliminates a candidate, i think it will return also the found factor which can be instaneously double-checked by the primegrid server.
____________
|
|
|
|
Then it is mystery what I sow :)
____________
92*10^1439761-1 REPDIGIT PRIME :) :) :)
314187728^131072+1 GENERALIZED FERMAT
31*332^367560+1 CRUS PRIME
Proud member of team Aggie The Pew. Go Aggie! |
|
|
|
Still no answer to my question?
____________
92*10^1439761-1 REPDIGIT PRIME :) :) :)
314187728^131072+1 GENERALIZED FERMAT
31*332^367560+1 CRUS PRIME
Proud member of team Aggie The Pew. Go Aggie! |
|
|
|
I just noticed: progress goes about 0.3% and then return, like it does doublechecking?
60.543% - 60.871% and then return to 60.543%
The percentage Boinc displays is just an estimate, I guess Boinc re-calculated the percentage at those points...
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 0
                           
|
Still no answer to my question?
Probably not. Usually when you don't get an answer it's because nobody has one.
But it's definitely not double checking of any sort.
BOINC should **NEVER** go backwards like that. The BOINC programming guidelines specifically prohibit having the progress meter go backwards.
Something incorrect happened. Other than a bug in the software, the only scenario I can think of that could cause this would be shutting down and restarting the WU, in which case the WU would restart from the most recent checkpoint. The checkpoint file would not be as far along as when the WU shut down, so after the restart progress could be lower than it was prior to the restart.
If there was no restart involved then I don't know.
____________
My lucky number is 75898524288+1 |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 0
                           
|
I just noticed: progress goes about 0.3% and then return, like it does doublechecking?
60.543% - 60.871% and then return to 60.543%
The percentage Boinc displays is just an estimate, I guess Boinc re-calculated the percentage at those points...
The percentage done isn't an estimate; it's the time remaining that BOINC estimates. At least with PrimeGrid's apps, the percentage done should be pretty reliable. It's the estimate of the time remaining that's often incredibly inaccurate.
It's not BOINC that computes the percentage done, it's the application itself. The application reports the percentage done to BOINC. It's conceivable that the sieve app is reporting percentages that fluctuate, but if it is, it's not supposed to be doing that.
____________
My lucky number is 75898524288+1 |
|
|
axnVolunteer developer Send message
Joined: 29 Dec 07 Posts: 285 ID: 16874 Credit: 28,027,106 RAC: 0
            
|
It's conceivable that the sieve app is reporting percentages that fluctuate, but if it is, it's not supposed to be doing that.
Since it is going back to the same % (60.543), the checkpoint explanation is the most likely one. Could be there is some file permission issue or disk space issue or some such problem that is causing the checkpoint writing itself to be stuck? |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 0
                           
|
It's conceivable that the sieve app is reporting percentages that fluctuate, but if it is, it's not supposed to be doing that.
Since it is going back to the same % (60.543), the checkpoint explanation is the most likely one. Could be there is some file permission issue or disk space issue or some such problem that is causing the checkpoint writing itself to be stuck?
Anything's possible, but it's kind of hard to imagine a checkpoint operation failing and doing something other than either crashing the app, or just continuing processing without doing the checkpoint. I'm not even sure how you would accidentally write it so that the app backs up. That would take... talent. :)
____________
My lucky number is 75898524288+1 |
|
|
axnVolunteer developer Send message
Joined: 29 Dec 07 Posts: 285 ID: 16874 Credit: 28,027,106 RAC: 0
            
|
Anything's possible, but it's kind of hard to imagine a checkpoint operation failing and doing something other than either crashing the app, or just continuing processing without doing the checkpoint. I'm not even sure how you would accidentally write it so that the app backs up. That would take... talent. :)
I'm thinking something like... "App resumes from checkpoint. After sometime, tries to write a new checkpoint, but is unable to (due to whatever). Crashes. BOINC sees app is dead, restarts it, and voila... back to square one."
I dunno enough about BOINC or the app to know if this is realistic :( |
|
|
|
In any case: with that behavior my WU is processed about 80% of time more that should. But on the other hand: I look at same CPU-s from other users and they have very similar time to mine, so that happen to them also...
____________
92*10^1439761-1 REPDIGIT PRIME :) :) :)
314187728^131072+1 GENERALIZED FERMAT
31*332^367560+1 CRUS PRIME
Proud member of team Aggie The Pew. Go Aggie! |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 0
                           
|
I thought of another possibility. This is pure speculation on my part, so it might be totally wrong...
You know how GeneferCUDA, when overclocked, errors out with that "maxErr exceeded" error? MaxErr wasn't really designed to catch hardware errors, although it fills that role admirably. It was designed to catch inevitable arithmetic overflows that are the result of the way programs like this do their math. In Genefer, it's what happens when you go beyond the usable 'b' limit.
LLR is more sophisticated and more flexible than Genefer in that regard. When it detects that maxErr is too large, it restarts the calculation from the beginning with a larger (and therefore slower) FFT size. Presumably, as long as the maxErr wasn't caused by a hardware problem, the calculation will then succeed with the larger FFT.
Under BOINC rules, when this restart happens, the progress meter should NOT go back to zero. It should keep going forward.
However, LLR is not a native BOINC program. It's run inside a BOINC wrapper, and it's the wrapper that handles all the BOINC stuff, including setting the progress meter. The wrapper reads the status lines which are output from LLR, converts that into a percentage completed value, and sends that to BOINC. Therefore, it probably can't handle the FFT restart correctly, so the meter will go backwards.
There's just one problem with my scenario: If that happens, the progress meter should go all the way back to 0. It wouldn't go back by just a small amount. Also, these FFT restarts should be pretty rare events.
____________
My lucky number is 75898524288+1 |
|
|
|
Michael,
Under your scenario it is feasible if the error occurred on the current number in the sieve file it is working on and it went back to the begging on that number. I am sure if sieving would generate a "maxErr exceeded" error, but your explanation has some merit in this idea. |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13513 ID: 53948 Credit: 237,712,514 RAC: 0
                           
|
Michael,
Under your scenario it is feasible if the error occurred on the current number in the sieve file it is working on and it went back to the begging on that number. I am sure if sieving would generate a "maxErr exceeded" error, but your explanation has some merit in this idea.
This isn't applicable to sieving, which don't use FFTs (and thus don't have the concept of maxErr).
____________
My lucky number is 75898524288+1 |
|
|