Message boards : Number crunching : Auto abort for Hung Work Units - Discussion
Previous · 1 · 2
Author | Message |
---|---|
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
is there a way to get WU that run faster rather than the slower running ones that you dont have to keep a check on to see if they have stalled. You don't likely need the abort step either. The project is sending high resolution work units sometimes. These are doing valuable work. But the work is down in the fractions rather than up in the big picture you can see in the graphic. So, they don't appear to be doing much, and it takes them longer to complete model 1, but in general, let 'em run. There are a few other threads on the Crunching board discussing the merits of having the user be able to select which WUs they would prefer. At present there is no way to achieve that. There were 4 WU names that had problems and we were asked to cancel. These were getting "hung" and running for a long time. So, if it was one of those, then the abort was what they needed. Once they saw problems they deleted them from the server so noone else received them. I see from your WUs that only very few have been aborted. And that your normal run time is between 2 and 4 hours. And that the ones you aborted ran for almost a full day. I was going to tell you to be more patient, but I see now you already have done that. You're doing it right. Stay the course. The new checkpointing and watchdog they are testing on Ralph right now should resolve this rare problem for you within a week or so. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
Do you still receive the finished models if a WU gets aborted from the watchdog in model 2 or 3 etc? What about not aborting the WU but only the model and go on and abort the whole WU only after the 3rd aborted model? |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Do you still receive the finished models if a WU gets aborted from the watchdog in model 2 or 3 etc? What about not aborting the WU but only the model and go on and abort the whole WU only after the 3rd aborted model? Yes, they still "output any data" on the completed models. As for aborting a model and continuing on... that would be cool, and save bandwidth, but I'm thinking one likely reason for a hang is that the construction of the WU is flawed, and so further models are likely to hang as well. And therefore ending it, and getting the feedback to the project as soon as possible is preferable. Then they can pull the WU if it's happening to everyone, and reissue it once they resolve the creation problem. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
MintabiePete Send message Joined: 5 Nov 05 Posts: 30 Credit: 418,959 RAC: 0 |
Thanks for the feedback, much appreciated :) |
Message boards :
Number crunching :
Auto abort for Hung Work Units - Discussion
©2024 University of Washington
https://www.bakerlab.org