Message boards : Number crunching : Report stuck & aborted WU here please
Previous · 1 · 2 · 3 · 4 · 5 . . . 18 · Next
Author | Message |
---|---|
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=6869420 Never had it before. Just a computer which couldn't flush (installed R@H last week) because of ZA and I've changed that so it could flush. Now all of the WU's are like the above. And couldn't upload any WU's this afternoon. Reason can be ???? |
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=6869420 Update : Just checked and the downloading seems to have been OK now. Don't know about the other problem though. |
Tibor Futo Send message Joined: 13 Jan 06 Posts: 1 Credit: 472 RAC: 0 |
1/14/2006 10:08:04 PM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1dcj_240_4586_0 ( - exit code -1073741819 (0xc0000005)) 1/14/2006 11:24:37 PM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_MORE_FRAGS_1hz6_241_4827_0 ( - exit code -1073741819 (0xc0000005)) Check also: https://boinc.bakerlab.org/rosetta/results.php?userid=50111 And I think these errors are related to project switching. I noticed one time that Rosetta was working fine, then BOINC switched projects, and when it reloaded Rosetta next time, it tried to start then gave the error. Tibor |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 15 |
And I think these errors are related to project switching. I noticed one time that Rosetta was working fine, then BOINC switched projects, and when it reloaded Rosetta next time, it tried to start then gave the error. These look like the "application not left in memory" bug; if you have "leave applications in memory when preempted" set to "no", you'll need to set it to "yes" until this bug is exterminated... |
godpiou Send message Joined: 22 Dec 05 Posts: 7 Credit: 1,373 RAC: 0 |
Hi ! Sorry but another error... |rosetta@home|Unrecoverable error for result NEW_SOFT_CENTROID_PACKING_1di2_225_9449_1 ( - exit code -1073741819 (0xc0000005)) And I support the hypothesis for project switching causing this type of error. Look at this part of my log: 14-01-06 22:39:34|SETI@home|Restarting result 05oc03ab.24335.11026.429814.1.28_1 using setiathome version 418 14-01-06 22:39:34|rosetta@home|Pausing result NEW_SOFT_CENTROID_PACKING_1di2_225_9449_1 (removed from memory) 14-01-06 22:39:35|rosetta@home|Unrecoverable error for result NEW_SOFT_CENTROID_PACKING_1di2_225_9449_1 ( - exit code -1073741819 (0xc0000005)) 14-01-06 22:39:35||request_reschedule_cpus: process exited 14-01-06 22:39:35|rosetta@home|Computation for result NEW_SOFT_CENTROID_PACKING_1di2_225_9449_1 finished And...again...hope this help ! Godpiou |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 15 |
14-01-06 22:39:34|rosetta@home|Pausing result NEW_SOFT_CENTROID_PACKING_1di2_225_9449_1 (removed from memory) This is DEFINITELY the known bug. You _MUST_ set "leave applications in memory when preempted" to _YES_!!! |
godpiou Send message Joined: 22 Dec 05 Posts: 7 Credit: 1,373 RAC: 0 |
14-01-06 22:39:34|rosetta@home|Pausing result NEW_SOFT_CENTROID_PACKING_1di2_225_9449_1 (removed from memory) Hi ! Sorry Bill, The correction had been done. Thank's a lot for this information that I should have seen...sorry again. Godpiou |
DonutDon Send message Joined: 23 Sep 05 Posts: 2 Credit: 545,377 RAC: 0 |
01/15/2006 11:37:19|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_1who__239_654_0 (app_version download error: couldn't get input files:<file_xfer_error> <file_name>rosetta_4.81_windows_intelx86.exe</file_name> <error_code>-200</error_code> <error_message></error_message></file_xfer_error>) 01/15/2006 11:37:19|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_1tul__239_2411_0 (app_version download error: couldn't get input files:<file_xfer_error> <file_name>rosetta_4.81_windows_intelx86.exe</file_name> <error_code>-200</error_code> <error_message></error_message></file_xfer_error>) 01/15/2006 11:37:19|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_1ubi__239_2340_0 (app_version download error: couldn't get input files:<file_xfer_error> <file_name>rosetta_4.81_windows_intelx86.exe</file_name> <error_code>-200</error_code> <error_message></error_message></file_xfer_error>) 01/15/2006 11:37:19|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_1who__239_646_0 (app_version download error: couldn't get input files:<file_xfer_error> <file_name>rosetta_4.81_windows_intelx86.exe</file_name> <error_code>-200</error_code> <error_message></error_message></file_xfer_error>) 01/15/2006 11:37:19|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_1gvp__239_2415_0 (app_version download error: couldn't get input files:<file_xfer_error> <file_name>rosetta_4.81_windows_intelx86.exe</file_name> <error_code>-200</error_code> <error_message></error_message></file_xfer_error>) 01/15/2006 11:37:19|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_2vik__239_650_0 (app_version download error: couldn't get input files:<file_xfer_error> <file_name>rosetta_4.81_windows_intelx86.exe</file_name> <error_code>-200</error_code> <error_message></error_message></file_xfer_error>) 01/15/2006 11:37:20|rosetta@home|Deferring communication with project for 3 minutes and 44 seconds It had temporarily backed-off downloading the .exe, but then when the WU files finished downloading, Boinc tried to run them before it finished downloading the .exe. |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 15 |
It had temporarily backed-off downloading the .exe, but then when the WU files finished downloading, Boinc tried to run them before it finished downloading the .exe. Yes, this has happened to me before, and it's been reported. It's an annoying BOINC bug. I _thought_ it had been fixed somewhere in the 5.2.x series though... |
DonutDon Send message Joined: 23 Sep 05 Posts: 2 Credit: 545,377 RAC: 0 |
Yes, this has happened to me before, and it's been reported. It's an annoying BOINC bug. I _thought_ it had been fixed somewhere in the 5.2.x series though... It may well have been fixed: I'm still running Boinc 4.45. |
Marky-UK Send message Joined: 1 Nov 05 Posts: 73 Credit: 1,689,495 RAC: 0 |
|
Divide Overflow Send message Joined: 17 Sep 05 Posts: 82 Credit: 921,382 RAC: 0 |
I just noticed two WU's that ran for just over 9 hours before aborting with the Maximum CPU time exceeded: 1/17/2006 12:20:28 PM|rosetta@home|Aborting result PRODUCTION_ABINITIO_2chf__250_242_0: exceeded CPU time limit 32474.092756 1/17/2006 12:20:28 PM|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_2chf__250_242_0 (Maximum CPU time exceeded) 1/17/2006 3:58:41 PM|rosetta@home|Aborting result PRODUCTION_ABINITIO_2vik__250_261_0: exceeded CPU time limit 32474.092756 1/17/2006 3:58:41 PM|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_2vik__250_261_0 (Maximum CPU time exceeded) Are there more bad batches of WU's out there again? |
Darren Send message Joined: 6 Oct 05 Posts: 27 Credit: 43,535 RAC: 0 |
|
STE\/E Send message Joined: 17 Sep 05 Posts: 125 Credit: 4,100,301 RAC: 84 |
Here's another ABINITIO WU that exceeded the maximum CPU time. Yup, I just had one do the same thing with over 10 hours on it, I was watching it because I didn't think it would make it. The first ABINITIO took under 3 hours to do & it reset the time to completion to under 5 hours. Then the second ABINITIO stumbled it's way to over 10 hours & was only showing 75% Completion, I had a feeling it wouldn't make it ... |
jpashton Send message Joined: 4 Oct 05 Posts: 1 Credit: 559,238 RAC: 0 |
Have been getting a lot of these the past few days: 1/18/2006 11:25:43 AM|rosetta@home|Unrecoverable error for result BARCODE_FRAG_30_1ogw_234_9512_0 ( - exit code -1073741819 (0xc0000005)) 1/18/2006 11:25:43 AM|rosetta@home|Unrecoverable error for result BARCODE_FRAG_30_2reb_234_9512_0 ( - exit code -1073741819 (0xc0000005)) Usual CPU time is between 1.5 - 2 hours. I haven't run into any that sit at 1% for hours though, just a lot of computation errors. My two cents for those that want/need to know... |
Divide Overflow Send message Joined: 17 Sep 05 Posts: 82 Credit: 921,382 RAC: 0 |
Yet another ABINITIO that exceeded the maximum CPU time... 1/18/2006 3:16:11 PM|rosetta@home|Aborting result PRODUCTION_ABINITIO_1fkb__250_452_0: exceeded CPU time limit 32474.092756 1/18/2006 3:16:11 PM|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_1fkb__250_452_0 (Maximum CPU time exceeded) |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
The fact is a lot of folks are seeing a lot of "Max Time" errors. I get at least 6-7 a day between both my machines. So many that reporting them seems like a waste of time. Since they all fail at 80% to 90% complete, this represent nominally 35 to 40 hours of time lost to the project every day for just my two machines. This is NOT a BOINC problem, this is a R@H WU problem. I have not once EVER seen this error on any of the other projects I and running. Even SETI WUs used to take longer than the R@H Max time failed WUs I am seeing. The old E@H application used to take 7 hours and 20 min almost to the second and not one failed WU for Max time. Only a few months ago I got a few R@H WUs that ran over 30 hours and completed ok. NOW if a WU runs longer than 5 or 6 hours on R@H it fails. All that has changed on my systems is the BOINC version and the updated R@H Application with the 1% stall patch. I have not seen one single computation error of any kind on any of the other projects I am working on, for Max time or anything else, so forgive me if I don't see this as a BOINC problem. The BOINC system is not designed to accommodate a 900%-1000% variation in WU size. It is as simple as that. I have NEVER seen the R@H DCF corrected to allow more time, always less. Eventually this leads to longer WUs failing. Also there seemes to be an absolute limit to the range of CPU times the system will allow for a particular machine. In other words, there is an absolute maximum for the CPU time difference between the shortest and the longest WUs allowed by the system and anything outside the top of that range will fail. The system simply cannot be forced to process beyond that limit, I have tried. In practical terms this means that there is an absolute limit to the longest WU any particular system can complete successfully, based on the shortest one it has seen. This is why these errors occur on a particular WU on one system but not another. The limit of this range is unique to each system set up. If all you see is long or mid length WUs and the DCF is set to allow that, then the system will work ok. If all you see is short ones and then you get a long one, forget it. There seems to be a very limited window of systems that can handle almost all of the WUs that they get somewhere in the middle of processing speed. The only solution I can see is for the project to finally recognize that the fix is to limit the maximum difference between the largest WU and the smallest to something more like 100% to 200%. If that means larger WUs for everyone fine, if it means smaller WUs for everyone fine, but that is the short term fix. Anything else is going to require recoding the application. Now perhaps if you take out the fix for the 1% stalls this limit will go away. It seems to me that when that 1% stall fix was installed the problems on Max time began. As an impact to the progress of the project in terms of lost time the Max time failures far and away exceed the 1% hang issue, and the 20 second WU failures pale by comparison to either of these. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
This is NOT a BOINC problem, this is a R@H WU problem. I have not once EVER seen this error on any of the other projects I and running. It is an unexpected interaction between BOINC and R@h. The R@h software runs fine on its own; as you correctly say the BOINC software runs fine with every other project (in this regard at least)
Yes, various parts of the BOINC system assume a repeatable mapping from the estimated time to the actual time for a result, These Rosetta WU break the expected repeatablility. There are actually two different issues here. One is the fact that some R@h W skip over some of the work when they predict it will be useless. This is also done on LHC, but it does not happen as often. The second is that while the current BOINC system allows for a different correction factor for different projects, it does not allow for differing correction factors between different categories of WU within a project. At present Baker et al are trying out more than a dozen different strategies and this is stretching the BOIC code firther than it will go. Perhaps later BOINC versions will build in different correction factors for different categories of WU - there is in principle a possible demand for similarly wide variation like this from future projects. That is why I am not so sure as you that it is right to call it solely a R@h issue. We agree however that the initial fix must come from R@h simply as this is the project that first needs such a wide variation. I have NEVER seen the R@H DCF corrected to allow more time, always less. I have on LHC. The problem here is that the variation is more severe. When the WU overruns on LHC it is a small enough overrun that the result still completes OK, and the DCF is boosted. On this project the overrun is larger, the WU aborts, and an aborted WU does not adjust the DCF as it is regarded as an error.
As I understand it, so is getting a more stable run length. My suggestion is that instead of aiming for a pre-planned number of structures in a run (currently ten structures) the app should "cheat" by aiming for a pre-planned run time +/- say 20%. It would do this by seeing how the time is going at the end of each struct.
I think this is an acute observation but a wrong dignosis in my opinion. The 1% fix came at around the same time as the explosion in kinds of work unit. It is the latter that I believe has triggered this problem, combined with the already existing problem of some WU ending early - but I haven't seen the code so I can't say for sure. River~~ |
Lee Carre Send message Joined: 6 Oct 05 Posts: 96 Credit: 79,331 RAC: 0 |
I have a result that hasn't failed or anything yet, but has been going for about 7 hours at 0% normally rosetta results finish sooner than 7 hours on that host, i'll leave it and see what it does thou, because it's a "PRODUCTION" WU, a type i haven't seen before the WU name is "PRODUCTION_ABINITIO_1urnA_250_1147" if that helps |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
Now perhaps if you take out the fix for the 1% stalls this limit will go away. It seems to me that when that 1% stall fix was installed the problems on Max time began. As an impact to the progress of the project in terms of lost time the Max time failures far and away exceed the 1% hang issue, and the 20 second WU failures pale by comparison to either of these. River- You are correct that the two issues cloud one another, but you are wrong that one is not the cause of the other. If WUs were not forced to abort because they take too long (the 1% fix), then they would NOT be aborting because of longer run times (max time exceeded). Take out the abort that stops a hung WUs, and you fix the Max time errors. It really is just that simple. As for rewriting the code. The suggestions for changing the WU run length are all done on the server not the client software. Removing the 1% hang solution IS a client side application fix that would require client app programming. But the size of the WUs is all determined on the server side, so that fix is not as big a deal as you claim, it requires altering some scripts. The fix for the 1% solution that was implemented was not very elegant, and it is in fact a club where a scalpel was needed. The Max time errors are the result of applying a heavy handed quick fix to a subtile problem. The post just ahead of this one is another example of a stuck WU, but it did not happen at 1% it happened at 0%. These hangs occur all the time on R@H, so something is going on. While aborting the WU stopped the hang it does not fix the root problem. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
Message boards :
Number crunching :
Report stuck & aborted WU here please
©2024 University of Washington
https://www.bakerlab.org