Auto abort for Hung Work Units - Discussion

Message boards : Number crunching : Auto abort for Hung Work Units - Discussion

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 14596 - Posted: 25 Apr 2006, 16:28:26 UTC - in response to Message 14562.  

is there a way to get WU that run faster rather than the slower running ones that you dont have to keep a check on to see if they have stalled.


You don't likely need the abort step either. The project is sending high resolution work units sometimes. These are doing valuable work. But the work is down in the fractions rather than up in the big picture you can see in the graphic. So, they don't appear to be doing much, and it takes them longer to complete model 1, but in general, let 'em run.

There are a few other threads on the Crunching board discussing the merits of having the user be able to select which WUs they would prefer. At present there is no way to achieve that.

There were 4 WU names that had problems and we were asked to cancel. These were getting "hung" and running for a long time. So, if it was one of those, then the abort was what they needed. Once they saw problems they deleted them from the server so noone else received them.

I see from your WUs that only very few have been aborted. And that your normal run time is between 2 and 4 hours. And that the ones you aborted ran for almost a full day. I was going to tell you to be more patient, but I see now you already have done that.

You're doing it right. Stay the course. The new checkpointing and watchdog they are testing on Ralph right now should resolve this rare problem for you within a week or so.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 14596 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 14610 - Posted: 25 Apr 2006, 18:53:39 UTC
Last modified: 25 Apr 2006, 18:54:48 UTC

Do you still receive the finished models if a WU gets aborted from the watchdog in model 2 or 3 etc? What about not aborting the WU but only the model and go on and abort the whole WU only after the 3rd aborted model?
ID: 14610 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 14612 - Posted: 25 Apr 2006, 19:10:48 UTC - in response to Message 14610.  
Last modified: 25 Apr 2006, 19:13:41 UTC

Do you still receive the finished models if a WU gets aborted from the watchdog in model 2 or 3 etc? What about not aborting the WU but only the model and go on and abort the whole WU only after the 3rd aborted model?

Yes, they still "output any data" on the completed models.

As for aborting a model and continuing on... that would be cool, and save bandwidth, but I'm thinking one likely reason for a hang is that the construction of the WU is flawed, and so further models are likely to hang as well. And therefore ending it, and getting the feedback to the project as soon as possible is preferable. Then they can pull the WU if it's happening to everyone, and reissue it once they resolve the creation problem.

Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 14612 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile MintabiePete
Avatar

Send message
Joined: 5 Nov 05
Posts: 30
Credit: 418,959
RAC: 0
Message 14627 - Posted: 26 Apr 2006, 0:07:55 UTC - in response to Message 14596.  
Last modified: 26 Apr 2006, 0:08:33 UTC


I see from your WUs that only very few have been aborted. And that your normal run time is between 2 and 4 hours. And that the ones you aborted ran for almost a full day. I was going to tell you to be more patient, but I see now you already have done that.

You're doing it right. Stay the course. The new checkpointing and watchdog they are testing on Ralph right now should resolve this rare problem for you within a week or so.


Thanks for the feedback, much appreciated :)


ID: 14627 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Auto abort for Hung Work Units - Discussion



©2024 University of Washington
https://www.bakerlab.org