Message boards : Number crunching : Auto abort for Hung Work Units - Discussion
Author | Message |
---|---|
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
I'm testing the watchdog thread over on ralph now, and wanted advice from the users posting here. When the watchdog detects that the job is stuck or taking too long, we can do two things: (1) Call the job valid, output a data file, and the job automatically gets credit. You never know that there was a timeout on your machine... (2) Call the job invalid, still output the data file, and your job won't get credit for a few days (every time we check for claimed credit that wasn't granted). The advantage here is that you see the errors, but have to wait for credit. I'm kind of neutral -- maybe (1) would be the easiest for me, but give you guys less info. Let me know your thoughts. |
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
I'm testing the watchdog thread over on ralph now, and wanted I strongly suggest to call the job valid and grant credits automatically. That is what the guys over at LHC are doing. Some of there projections aren't stable and fail in a few minutes. They call it valid and say the info about failed projection is as important as of succesful ones. Since many aborted WU will have finished models and even the failing ones provide you with useful information I see no reason to not call it valid. Besides people will be much happier with valid jobs and immediate credit than the other way around. ;-) |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Thanks for the advice; I'll follow it. Its good to know about the LHC philosophy, too; we're similar to them in that we're constantly testing new ideas and modes of Rosetta. I'll test the new "graceful" watchdog behavior on ralph today. I'm testing the watchdog thread over on ralph now, and wanted |
XS_Duc![]() Send message Joined: 30 Dec 05 Posts: 17 Credit: 310,471 RAC: 0 |
The weak shall perish... |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
Rhiju stated: If using option 1, is there going to be an automatic tally somewhere so problem WUs get spotted quickly and can be dealt with the same day as released? "Set it up and forget about it" is a great philosophy, but if you're not getting enough data back about certain machines with problems - how are you going to be able to identify problems that are hardware related? (System overheating.. ram failed, etc) And how are we going to know we have to do maintenance on a failing system? I'd suggest Option 1 with a new error code - or some other kind of notification when a high rate of failures happen on a particular machine. |
Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0 |
I'd suggest Option 1 with a new error code - or some other kind of notification when a high rate of failures happen on a particular machine. Good point! When the watchdog kills a WU we will have the stderr returned with the reasons for the killing, and we can assign a different exit status code for this situation. |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
I vote for awarding immediate credit because the more trouble-free the project, the more people will crunch for it. It's important for the project to keep an eye on results being returned, and to react swiftly to any WU problem. People don't like to find out that the last three days worth of WUs were the stuck-abort type and all that CPU was simply wasted. I would want to know if one of my nodes turned into a WU killer, but the symptoms of that are usually WUs erroring out early. People should be able to see when that happens. But stuck WUs are usually not the fault of the system, so there is no point in bothering people about them. BTW, does the watchdog thread take into account the speed of the system? A very slow system will need more time to complete a percent-incrementing step. |
![]() ![]() Send message Joined: 25 Nov 05 Posts: 129 Credit: 57,345 RAC: 0 |
S@H redflags invalid wu's in the results page, how about yellow flagging (or any other colour) auto aborted wu's. That way you can have your set and forget system but it highlights problem wu's. ![]() ![]() |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Sweet, thanks for all the input. For now, we've gone for option 1 temporarily (auto-aborts get automatic credit). Based on your great suggestions, we are going to look into making those auto-aborts show up in different colors in your results page. And also we'll try to make a different BOINC exit message show up in the BOINC manager, for non-fatal errors. It sounds totally feasible. I vote for awarding immediate credit because the more trouble-free the project, the more people will crunch for it. |
Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0 |
BTW, does the watchdog thread take into account the speed of the system? A very slow system will need more time to complete a percent-incrementing step. By default the watchdog kills a WU when the energy doesn't change ( = no movement on the searching section of your screen saver) for 30 minutes. It's probably a pretty conservative figure even for the slow computers. |
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
BTW, does the watchdog thread take into account the speed of the system? A very slow system will need more time to complete a percent-incrementing step. I'm not sure, see this WU: http://ralph.bakerlab.org/workunit.php?wuid=82603 I completed it successful but it was aborted on a slower comp. Btw can't you just abort the current model and see if the next model is progressing and just kill the whole WU after three models have failed or after the target CPU time has been met? |
![]() ![]() Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Make the client trouble free. That means that if the project feels even the errors are useful, then report them to the project (i.e. select a unique exit message), but don't bother the client about it (i.e. don't show it as an error). If the client ends the WU, and returns valid results that are useful to the project... well, then it's NOT an error. Even if the result is just to identify a hung WU and Rosetta combination that needs to be investigated further. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0 |
This is actually a case indicating the new watchdog is improving! I think Rhiju fixed a bug in the v5.02 watchdog (the one killed one of your WU by mistake), and the fixed v5.03 watchdog let the WU went through! [quote]I'm not sure, see this WU: |
Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0 |
This is exactly what we are going to do based on all your excellent suggestions! Make the client trouble free. That means that if the project feels even the errors are useful, then report them to the project (i.e. select a unique exit message), but don't bother the client about it (i.e. don't show it as an error). If the client ends the WU, and returns valid results that are useful to the project... well, then it's NOT an error. Even if the result is just to identify a hung WU and Rosetta combination that needs to be investigated further. |
MattDavis![]() Send message Joined: 22 Sep 05 Posts: 206 Credit: 1,377,748 RAC: 0 |
Why the HECK wouldn't you list on the FRONT PAGE that there were problem work units? I looked at a computer (that I usually leave alone, trusting BOINC and the projects) AND SAW A WORK UNIT WAS UP TO 127 HOURS AND STILL NOT COMPLETED. I check the news (front page) EVERY DAY but had to look in the forums for this. WHY WASN'T THIS ON THE FRONT PAGE? I've wasted HUNDREDS OF HOURS on two computers because this wasn't on the front page. ![]() |
![]() ![]() Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Matt was referring to the hung work units themselves, not the auto abort. See his other post Bin, I hope you didn't take any offense to my comments. I was simply trying to say "I vote for option #1 - call the WU valid" in response to Rhiju's original post at the bottom. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
![]() ![]() Send message Joined: 5 Nov 05 Posts: 30 Credit: 418,959 RAC: 0 |
Without going into this too much , I seem to be getting a lot of WU's lately that sit on a low % and dont seem to move , I just suspend them and then abort them , not too sure if that is the correct way you do it , but thats the way I do it :) I am running 4 computers and seem to get these slow running WU all the time , is there a way to get WU that run faster rather than the slower running ones that you dont have to keep a check on to see if they have stalled . Thanks ![]() |
John McLeod VII![]() Send message Joined: 17 Sep 05 Posts: 108 Credit: 195,137 RAC: 0 |
Without going into this too much , I seem to be getting a lot of WU's lately that sit on a low % and dont seem to move , I just suspend them and then abort them , not too sure if that is the correct way you do it , but thats the way I do it :) I am running 4 computers and seem to get these slow running WU all the time , is there a way to get WU that run faster rather than the slower running ones that you dont have to keep a check on to see if they have stalled . You don't need to bother with the suspend step. ![]() ![]() BOINC WIKI |
![]() ![]() Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
is there a way to get WU that run faster rather than the slower running ones that you dont have to keep a check on to see if they have stalled. You don't likely need the abort step either. The project is sending high resolution work units sometimes. These are doing valuable work. But the work is down in the fractions rather than up in the big picture you can see in the graphic. So, they don't appear to be doing much, and it takes them longer to complete model 1, but in general, let 'em run. There are a few other threads on the Crunching board discussing the merits of having the user be able to select which WUs they would prefer. At present there is no way to achieve that. There were 4 WU names that had problems and we were asked to cancel. These were getting "hung" and running for a long time. So, if it was one of those, then the abort was what they needed. Once they saw problems they deleted them from the server so noone else received them. I see from your WUs that only very few have been aborted. And that your normal run time is between 2 and 4 hours. And that the ones you aborted ran for almost a full day. I was going to tell you to be more patient, but I see now you already have done that. You're doing it right. Stay the course. The new checkpointing and watchdog they are testing on Ralph right now should resolve this rare problem for you within a week or so. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
Do you still receive the finished models if a WU gets aborted from the watchdog in model 2 or 3 etc? What about not aborting the WU but only the model and go on and abort the whole WU only after the 3rd aborted model? |
Message boards :
Number crunching :
Auto abort for Hung Work Units - Discussion
©2025 University of Washington
https://www.bakerlab.org