Message boards : Number crunching : Report stuck & aborted WU here please
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 18 · Next
Author | Message |
---|---|
[B^S] Paul@home Send message Joined: 18 Sep 05 Posts: 34 Credit: 393,096 RAC: 0 |
Hi Phil,
u have been loking at this in more detail than I have and, admitedly, it was 2am when i was trawling code last night so I do accept what you are saying. I just can't find it in the code (yet!). The only place I can see DCF used is in calculating the client's estimated to-completion time (what u see in BOINC Manager). This figure does not appear to have any relationship to the max allowed cpu time for a work unit.
It certainly would! I believe it is quite difficult for them to get a reasonable estimate for the number of fpops in a given WU type but if they could manage that somehow, they may fix the problem. Perhaps as David B. suggests, they may be able to run a few of each WU type thru a test server to determine an accurate run time. If this was a public server they would not even need to do the work themselves - just set a high fpops_bound in the WU and let them out. Bound value could be increased / reduced accordingly. Cheers and have a good weekend (I'm not back at a computer till monday! ) Paul. Wanna visit BOINC Synergy team site? Click below! Join BOINC Synergy Team |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
Hi Phil, After looking at this a little further I think I know what is happening. There are really two clocks running down toward the completion of a WU. There is the flop counter that will internally determine when a WU has exceeded the maximum time a WU will be allowed to run, and there is the time to completion clock that is presented to the user through the interface. I am assuming here that everyone would agree that if you have a certain number of flops available, and the system runs at a certain speed, that combining the two yields, in effect, a clock. As the WU progresses both of these clocks count down. While I have not verified what happens to BOINC if the completion clock runs out, I can imaging that if it goes to zero or less than zero, that this might cause problems unless the condition is trapped and handled. Clearly the DCF directly adjusts the completion clock. In my testing I have been able to extend the projected run time of a WU by manually adjusting the DCF. THis also appears to actually provide additional run time for the WU. But this only will go so far. I have determined that there is an absolute maximum time beyond which you cannot force the system to continue working on a particular WU. Interestingly this limit is almost exactly the same across WU types. Obviously that limit is the flops clock timing out. It now looks as though what is happening is that the flops clock can be set to a value that is higher than the calculated completion clock. This makes sense as they are completely separate. It may be that when this completion clock hits zero or drops below zero, that BOINC stops the process. By adjusting the the DCF, in effect resetting the completion clock to a higher value, you can increase the length of time that it takes for BOINC to count down the completion clock to zero, thus allowing the process to run longer. But the absolute maximum time is still the flops clock. So once the DCF is set to a value higher than the time flops counter will allow (based on system speed), the WU will fail on Max time when it hit the flops limit. While this theory matches the observed behavior of the system, it would take some looking at the BOINC and R@H code to determine what actually happens if the completion time becomes zero or less than zero. Regards Phil |
Mistral Send message Joined: 29 Sep 05 Posts: 1 Credit: 3,568 RAC: 0 |
While I have not verified what happens to BOINC if the completion clock runs out, I can imaging that if it goes to zero or less than zero, that this might cause problems unless the condition is trapped and handled. Hi Phil, Just to add some complexity to your thoughts. As far as Predictor@Home is concerned, the WU's percentage of completion will climb up to 103%, then it goes back to 97% and the WU completes in a few seconds. During the period during which the percentage of completion is comprised between 100% and 103% the "Remaining time" column will only show "---" (i.e. the completion clock has run out) and it will again show a few minutes when this percentage goes back to 97%. But this is normal behaviour for P@H. Hope this helps you translating the Rosetta stone :-) Regards Pierre |
Darren Send message Joined: 6 Oct 05 Posts: 27 Credit: 43,535 RAC: 0 |
While I have not verified what happens to BOINC if the completion clock runs out, I can imaging that if it goes to zero or less than zero, that this might cause problems unless the condition is trapped and handled. The complexity can be made even greater when you look at how a seti wu handles the completion clock. Seti units run for anywhere from a few minutes (normal seti app) to up to 45 minutes for the new enhanced beta seti app after the completion clock reaches 100%. For seti units, the clock does not go over 100% like it does for the predictor units. Instead, it simply stops at 100% and the remaining time stays at ---, while the work unit continues to run to actual completion. |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
As I understand it The DCF is a scaling factor that is used (effectively) to tweak the benchmarks for example when estimating the run times of WU for testing for EDF mode, etc. If it is the tweaked benchmark that is used to set the max run time from the max no ops, then on any one machine the actual max applied will be proportional to the DCF. If it is the raw benchmark that is applied, then it won't. Hope that helps. Hope it's right ;-) |
kevint Send message Joined: 8 Oct 05 Posts: 84 Credit: 2,530,451 RAC: 0 |
This work unit processed for about 5 hours - when Boinc did it automatic switch to work on another project this WU aborted. Ok, this seems to work for now - but now I have a different problem, appears to be near the same - When Boinc does its automatic benchmark I get this problem on a couple of my machines. It does not happen all the time - but it does happen. 1/20/2006 1:56:22 PM|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded 1/20/2006 1:56:51 PM||Suspending computation and network activity - running CPU benchmarks 1/20/2006 1:56:51 PM|rosetta@home|Pausing result NO_SIM_ANNEAL_BARCODE_30_1mky_251_5400_0 (removed from memory) 1/20/2006 1:56:51 PM|rosetta@home|Pausing result NO_SIM_ANNEAL_BARCODE_30_1r69_251_5402_0 (removed from memory) 1/20/2006 1:56:52 PM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1mky_251_5400_0 ( - exit code -1073741819 (0xc0000005)) 1/20/2006 1:56:52 PM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1r69_251_5402_0 ( - exit code -1073741819 (0xc0000005)) 1/20/2006 1:56:52 PM||request_reschedule_cpus: process exited 1/20/2006 1:56:52 PM|rosetta@home|Computation for result NO_SIM_ANNEAL_BARCODE_30_1mky_251_5400_0 finished 1/20/2006 1:56:52 PM||Running CPU benchmarks 1/20/2006 1:56:53 PM|rosetta@home|Computation for result NO_SIM_ANNEAL_BARCODE_30_1r69_251_5402_0 finished SETI.USA |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
This work unit processed for about 5 hours - when Boinc did it automatic switch to work on another project this WU aborted. This is related to the keep in memory issue. When the system benchmarks it removes the application from memory (as shown in your messages) the fact that the system is benchmarking does not matter. What does matter is that the app was removed from memory. This has the same effect as an application switch with keep in memory set to no. So of course the WUs abort. Not a good thing but that is how it happens. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
Hi Phil, Pierre & Darren, I see the same behavior you describe on those applications. However, while the percent complete is used to calculate the running completion clock, and if the percent exceeds 100% the clock will be nulled out. this is actually different than if the completion time actually runs out before the percent in 100. It would be possible to have a lot of nasty things like zero divides going on if the completion clock runs down to a value below zero. Now I have to assume that the BOINC programers are smart enough to prevent problems like that under normal conditions, but I can imagine a situation where they might figure the completion time would never hit zero because it is calculated from the percent complete. Once the WU hits 100% the completion clock is no longer incremented. If you watch close on P@H the clock never hits zero until the precise moment that the WU hits or exceeds 100% complete. With R@H we have a situation where all kind of screwy things are going on with the completion clock. The time is actually rising between percent compete changes, and then if jumps all at once to a new value based in part on percent complete. It is possible for a R@H WU to run out of time on the completion clock before the WU actually completes. So the question is, what happens if the time to completion runs out at say 95% complete? Could this abort a WU? Since I have never seen this kind of clock behavior on any other projects I have nothing to go on. But based on the behavior of all the clocks and percent counters (which are kept by BOINC) we can generally assume that BOINC was not designed to handle whatever R@H is doing as it runs. Good thoughts though. Keep thinking, we will eventually figure this thing out. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
boinc isn't doing a count-down on the time. It does increase the current-time variable. It periodically compares the current time to the max time, and if the current time is greater then the WU is aborted. The max speed is calculated from the benchmark score and the limit specified by the WU. The DCF is not used in this calculation in the official version of BOINC. The time-remaining variable is never incremented or decremented. It is periodically recalculated using the original estimated time, the current time, and the corrected speed of the machine. The DCF and the benchmark score are combined to get the corrected speed. |
KaptainBlazzed Send message Joined: 30 Dec 05 Posts: 3 Credit: 969,393 RAC: 0 |
i got this error. Unrecoverable error for result PRODUCTION_ABINITIO_2chf__250_1035_0 (Maximum CPU time exceeded) the same goes for PRODUCTION_ABINITIO_2acy__250_1035_0 in total i lost 12Hrs of cpu time on these 2 WU's |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
boinc isn't doing a count-down on the time. It does increase the current-time variable. It periodically compares the current time to the max time, and if the current time is greater then the WU is aborted. The max speed is calculated from the benchmark score and the limit specified by the WU. The DCF is not used in this calculation in the official version of BOINC. What you have described makes no sense. Clearly the time to completion DOES decrement. It does this on all of the projects. The CPU time rises as processing moves along. Moreover the completion time decrements in proportion to the percent complete. That is why you can see it rising on R@H WU as they process. The CPU time is rising, but the percent complete is not, so the "To completion" time also rises, until the percent complete finally changes. While I would agree that the DCF is used to determine the "to completion" time, I would disagree that BOINC is not making use of these numbers. The absolute time for a WU to complete is set by a variable value stored in the WU. But that is an absolute value. since the project could not possible have any idea what amount of time the slowest machine might take, there must be a system to make adjustments. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
KaptainBlazzed Send message Joined: 30 Dec 05 Posts: 3 Credit: 969,393 RAC: 0 |
now this one too, i am probably going to abort ALL PRODUCTION_ABINITO_xxxxxxx WU's I can not waste my CPU time like this!! Unrecoverable error for result PRODUCTION_ABINITIO_1acf__250_338_0 (Maximum CPU time exceeded) |
KaptainBlazzed Send message Joined: 30 Dec 05 Posts: 3 Credit: 969,393 RAC: 0 |
I aborted this one after 4 1/2 hours and only being 1% done Unrecoverable error for result NO_VARY_OMEGA_2reb_253_1552_0 (aborted via GUI RPC) |
Viking69 Send message Joined: 3 Oct 05 Posts: 20 Credit: 6,815,776 RAC: 2,618 |
1/21/2006 2:42:03 PM|rosetta@home|Unrecoverable error for result DEFAULT_2reb_219_913_1 ( - exit code -1073741819 (0xc0000005)) 1/21/2006 2:42:03 PM||request_reschedule_cpus: process exited 1/21/2006 2:42:03 PM|rosetta@home|Computation for result DEFAULT_2reb_219_913_1 finished This one stopped after 50 minutes. Hi all you enthusiastic crunchers..... |
SarahCorreia Send message Joined: 11 Dec 05 Posts: 1 Credit: 3,351 RAC: 0 |
1/22/2006 6:43:06 AM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1ogw_251_13924_0 ( - exit code -1073741819 (0xc0000005)) 1/22/2006 6:43:06 AM||request_reschedule_cpus: process exited 1/22/2006 6:43:06 AM|rosetta@home|Computation for result NO_SIM_ANNEAL_BARCODE_30_1ogw_251_13924_0 finished 1/21/2006 2:47:29 PM|rosetta@home|Pausing result NO_MORE_RELAX_CYCLES_1n0u_249_8874_0 (removed from memory) 1/21/2006 2:47:31 PM|rosetta@home|Unrecoverable error for result NO_MORE_RELAX_CYCLES_1n0u_249_8874_0 ( - exit code -1073741819 (0xc0000005)) 1/21/2006 2:47:33 PM||request_reschedule_cpus: process exited 1/21/2006 7:24:04 AM|rosetta@home|Pausing result NEW_SOFT_CENTROID_PACKING_1n0u_249_8877_0 (removed from memory) 1/21/2006 7:24:07 AM|rosetta@home|Unrecoverable error for result NEW_SOFT_CENTROID_PACKING_1n0u_249_8877_0 ( - exit code -1073741819 (0xc0000005)) 1/21/2006 7:24:08 AM||request_reschedule_cpus: process exited 1/21/2006 7:24:08 AM|rosetta@home|Computation for result NEW_SOFT_CENTROID_PACKING_1n0u_249_8877_0 finished 1/21/2006 12:53:22 AM|rosetta@home|Pausing result NO_MORE_RELAX_CYCLES_1n0u_249_6302_0 (removed from memory) 1/21/2006 12:53:23 AM|rosetta@home|Unrecoverable error for result NO_MORE_RELAX_CYCLES_1n0u_249_6302_0 ( - exit code -1073741819 (0xc0000005)) 1/21/2006 12:53:24 AM||request_reschedule_cpus: process exited 1/21/2006 12:53:24 AM|rosetta@home|Computation for result NO_MORE_RELAX_CYCLES_1n0u_249_6302_0 finished 1/20/2006 2:32:51 AM|rosetta@home|Pausing result NO_SIM_ANNEAL_BARCODE_30_1ogw_251_4062_0 (removed from memory) 1/20/2006 2:32:53 AM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1ogw_251_4062_0 ( - exit code -1073741819 (0xc0000005)) 1/20/2006 2:32:53 AM||request_reschedule_cpus: process exited 1/20/2006 2:32:53 AM|rosetta@home|Computation for result NO_SIM_ANNEAL_BARCODE_30_1ogw_251_4062_0 finished 1/19/2006 12:40:55 PM|rosetta@home|Pausing result PRODUCTION_ABINITIO_1a32__250_1520_0 (removed from memory) 1/19/2006 12:40:56 PM|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_1a32__250_1520_0 ( - exit code -1073741819 (0xc0000005)) 1/19/2006 12:41:00 PM||request_reschedule_cpus: process exited 1/19/2006 12:41:01 PM|rosetta@home|Computation for result PRODUCTION_ABINITIO_1a32__250_1520_0 finished 1/19/2006 4:57:06 AM|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_1a19A_250_1520_0 ( - exit code -1073741819 (0xc0000005)) 1/19/2006 4:57:06 AM||request_reschedule_cpus: process exited 1/19/2006 4:57:06 AM|rosetta@home|Computation for result PRODUCTION_ABINITIO_1a19A_250_1520_0 finished |
Darren Send message Joined: 6 Oct 05 Posts: 27 Credit: 43,535 RAC: 0 |
1/21/2006 2:47:29 PM|rosetta@home|Pausing result NO_MORE_RELAX_CYCLES_1n0u_249_8874_0 (removed from memory) You need to change your preferences to leave the app in memory. If you go into your online account, you'll find the option under your general preferences. Just change the "leave applications in memory while preempted" setting to "yes". |
Viking69 Send message Joined: 3 Oct 05 Posts: 20 Credit: 6,815,776 RAC: 2,618 |
1/21/2006 2:47:29 PM|rosetta@home|Pausing result NO_MORE_RELAX_CYCLES_1n0u_249_8874_0 (removed from memory) Yes, I used to do that but then my PC's would be using 200% of the available RAM (512MB). Even as I incresed the Swap file to allow this, the performance of the PC did suffer. I beleive that Rossetta is the only BOINC system that requires this to be enabled, but it affects all the BOINC systems I am running. Hi all you enthusiastic crunchers..... |
Trog Dog Send message Joined: 25 Nov 05 Posts: 129 Credit: 57,345 RAC: 0 |
I beleive that Rossetta is the only BOINC system that requires this to be enabled, but it affects all the BOINC systems I am running. As far as I can work out World Community Grid also requires this setting on windows machines - it uses an earlier version of the Rosetta app. I'm pretty sure that Climate Prediction wants you to leave the results in memory too. I'm not prepared to do this on my systems so I don't run CPDN and only run Rosetta and WCG on my Linux boxes. |
gpcola Send message Joined: 31 Dec 05 Posts: 8 Credit: 361,118 RAC: 0 |
Hi, I've had a couple of wierd WUs recently: https://boinc.bakerlab.org/rosetta/result.php?resultid=7113121 This failed with a 'Maximum CPU time exceeded' and it certainly wasted enough CPU cycles in the process, having run for 6.5hrs. https://boinc.bakerlab.org/rosetta/result.php?resultid=7442399 This one is really strange. It had been sitting at 70% complete for over an hour with the CPU time reading +-5.5hrs. I was beginning to worry that it was stuck but thought I'd leave it and hope for the best. Sometime soon after I needed to reboot the machine and when I next checked it's progress the CPU time had dropped to half an hour but it was still at 70% progress! I decided at this point to abort the WU. Several others have failed with an exit status of '1073741819 (0xc0000005)' and they all happen to be similar types (PRODUCTION_ABINITIO_xxxxxxx): https://boinc.bakerlab.org/rosetta/result.php?resultid=7449596 https://boinc.bakerlab.org/rosetta/result.php?resultid=7113161 https://boinc.bakerlab.org/rosetta/result.php?resultid=7113121 https://boinc.bakerlab.org/rosetta/result.php?resultid=7113092 https://boinc.bakerlab.org/rosetta/result.php?resultid=7113091 |
XS_Duc Send message Joined: 30 Dec 05 Posts: 17 Credit: 310,471 RAC: 0 |
I have two to report for the moment: I just aborted this one, stuck at 1% after more then 7 hours... https://boinc.bakerlab.org/rosetta/result.php?resultid=7670806 The other one gave a "Max CPU time exceeded' error, it was crunching for more then 14 hours... https://boinc.bakerlab.org/rosetta/result.php?resultid=7165518 The weak shall perish... |
Message boards :
Number crunching :
Report stuck & aborted WU here please
©2024 University of Washington
https://www.bakerlab.org