Auto abort for Hung Work Units

Author	Message
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0	Message 14410 - Posted: 22 Apr 2006, 20:04:53 UTC I'm testing the watchdog thread over on ralph now, and wanted advice from the users posting here. When the watchdog detects that the job is stuck or taking too long, we can do two things: (1) Call the job valid, output a data file, and the job automatically gets credit. You never know that there was a timeout on your machine... (2) Call the job invalid, still output the data file, and your job won't get credit for a few days (every time we check for claimed credit that wasn't granted). The advantage here is that you see the errors, but have to wait for credit. I'm kind of neutral -- maybe (1) would be the easiest for me, but give you guys less info. Let me know your thoughts. ID: 14410 · Rating: 0 · rate: / Reply Quote

tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0	Message 14411 - Posted: 22 Apr 2006, 20:37:10 UTC - in response to Message 14410. Last modified: 22 Apr 2006, 20:38:50 UTC I'm testing the watchdog thread over on ralph now, and wanted advice from the users posting here. When the watchdog detects that the job is stuck or taking too long, we can do two things: (1) Call the job valid, output a data file, and the job automatically gets credit. You never know that there was a timeout on your machine... (2) Call the job invalid, still output the data file, and your job won't get credit for a few days (every time we check for claimed credit that wasn't granted). The advantage here is that you see the errors, but have to wait for credit. I'm kind of neutral -- maybe (1) would be the easiest for me, but give you guys less info. Let me know your thoughts. I strongly suggest to call the job valid and grant credits automatically. That is what the guys over at LHC are doing. Some of there projections aren't stable and fail in a few minutes. They call it valid and say the info about failed projection is as important as of succesful ones. Since many aborted WU will have finished models and even the failing ones provide you with useful information I see no reason to not call it valid. Besides people will be much happier with valid jobs and immediate credit than the other way around. ;-) ID: 14411 · Rating: 1 · rate: / Reply Quote

Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0	Message 14417 - Posted: 22 Apr 2006, 21:52:45 UTC - in response to Message 14411. Thanks for the advice; I'll follow it. Its good to know about the LHC philosophy, too; we're similar to them in that we're constantly testing new ideas and modes of Rosetta. I'll test the new "graceful" watchdog behavior on ralph today. I'm testing the watchdog thread over on ralph now, and wanted advice from the users posting here. When the watchdog detects that the job is stuck or taking too long, we can do two things: (1) Call the job valid, output a data file, and the job automatically gets credit. You never know that there was a timeout on your machine... (2) Call the job invalid, still output the data file, and your job won't get credit for a few days (every time we check for claimed credit that wasn't granted). The advantage here is that you see the errors, but have to wait for credit. I'm kind of neutral -- maybe (1) would be the easiest for me, but give you guys less info. Let me know your thoughts. I strongly suggest to call the job valid and grant credits automatically. That is what the guys over at LHC are doing. Some of there projections aren't stable and fail in a few minutes. They call it valid and say the info about failed projection is as important as of succesful ones. Since many aborted WU will have finished models and even the failing ones provide you with useful information I see no reason to not call it valid. Besides people will be much happier with valid jobs and immediate credit than the other way around. ;-) ID: 14417 · Rating: 0 · rate: / Reply Quote

XS_Duc Send message Joined: 30 Dec 05 Posts: 17 Credit: 310,471 RAC: 0	Message 14418 - Posted: 22 Apr 2006, 21:54:02 UTC - in response to Message 14410. Last modified: 22 Apr 2006, 21:56:30 UTC The weak shall perish... ID: 14418 · Rating: 0 · rate: / Reply Quote

BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0	Message 14423 - Posted: 22 Apr 2006, 23:19:24 UTC - in response to Message 14410. Rhiju stated: I'm testing the watchdog thread over on ralph now, and wanted advice from the users posting here. When the watchdog detects that the job is stuck or taking too long, we can do two things: [...] Lett me know your thoughts. If using option 1, is there going to be an automatic tally somewhere so problem WUs get spotted quickly and can be dealt with the same day as released? "Set it up and forget about it" is a great philosophy, but if you're not getting enough data back about certain machines with problems - how are you going to be able to identify problems that are hardware related? (System overheating.. ram failed, etc) And how are we going to know we have to do maintenance on a failing system? I'd suggest Option 1 with a new error code - or some other kind of notification when a high rate of failures happen on a particular machine. ID: 14423 · Rating: 0 · rate: / Reply Quote

Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0	Message 14425 - Posted: 23 Apr 2006, 0:06:24 UTC - in response to Message 14423. I'd suggest Option 1 with a new error code - or some other kind of notification when a high rate of failures happen on a particular machine. Good point! When the watchdog kills a WU we will have the stderr returned with the reasons for the killing, and we can assign a different exit status code for this situation. ID: 14425 · Rating: 0 · rate: / Reply Quote

Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0	Message 14434 - Posted: 23 Apr 2006, 2:19:06 UTC - in response to Message 14410. Last modified: 23 Apr 2006, 2:24:16 UTC I'm testing the watchdog thread over on ralph now, and wanted advice from the users posting here. When the watchdog detects that the job is stuck or taking too long, we can do two things: (1) Call the job valid, output a data file, and the job automatically gets credit. You never know that there was a timeout on your machine... (2) Call the job invalid, still output the data file, and your job won't get credit for a few days (every time we check for claimed credit that wasn't granted). The advantage here is that you see the errors, but have to wait for credit. I'm kind of neutral -- maybe (1) would be the easiest for me, but give you guys less info. Let me know your thoughts. Rhiju, It occurs to me that there must be some way to mix these two concepts. I think it is important for the project to see errors occurring as fast as possible. Right now that happens when people start reporting errors. If the system is automatic we will all loose that reporting system. Moreover, the farmers and some of the more active users actually WANT to know when things are not going right. With a fully automatic system no one will know for a few days that anything is wrong and the users might never find out. While this is good politically, it is not good if the problem is something on the client system that needs adjusting or a bad Work Unit run. What would make sense to me is similar to what was proposed by "Bin Qian", and "BennyRop". For the obvious political, and project workload reasons, I would like to see the credits awarded automatically. But there should be some kind of visible notification to the user. Looking in the staterr file is a pain, and many errors will not be noticed. An error code returned on the Work Unit alone might not be noticed because the Work Unit will have credit and not stand out in the lists the way they do now. So perhaps an error code in the result file that you can find easily, and some kind of entry to the Messages or the tasks tab status when the Work Unit finishes that the user can see could be done. Or an automatic system that awards the credit, but a new Work Unit status in the stats listing. Something like "non-fatal error", or "Auto Abort Called". This would let people see the problem, but still grant the credit. {NOTE/QUESTION: Do you think we should open a discussion sticky on this topic and move the posts there to make this easier to follow?} Moderator9 ROSETTA@home FAQ Moderator Contact ID: 14434 · Rating: 0 · rate: / Reply Quote

AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0	Message 14442 - Posted: 23 Apr 2006, 3:23:42 UTC I vote for awarding immediate credit because the more trouble-free the project, the more people will crunch for it. It's important for the project to keep an eye on results being returned, and to react swiftly to any WU problem. People don't like to find out that the last three days worth of WUs were the stuck-abort type and all that CPU was simply wasted. I would want to know if one of my nodes turned into a WU killer, but the symptoms of that are usually WUs erroring out early. People should be able to see when that happens. But stuck WUs are usually not the fault of the system, so there is no point in bothering people about them. BTW, does the watchdog thread take into account the speed of the system? A very slow system will need more time to complete a percent-incrementing step. ID: 14442 · Rating: 0 · rate: / Reply Quote

Trog Dog Send message Joined: 25 Nov 05 Posts: 129 Credit: 57,345 RAC: 0	Message 14444 - Posted: 23 Apr 2006, 3:33:40 UTC - in response to Message 14434. Rhiju, It occurs to me that there must be some way to mix these two concepts. I think it is important for the project to see errors occurring as fast as possible. Right now that happens when people start reporting errors. If the system is automatic we will all loose that reporting system. Moreover, the farmers and some of the more active users actually WANT to know when things are not going right. With a fully automatic system no one will know for a few days that anything is wrong and the users might never find out. While this is good politically, it is not good if the problem is something on the client system that needs adjusting or a bad Work Unit run. What would make sense to me is similar to what was proposed by "Bin Qian", and "BennyRop". For the obvious political, and project workload reasons, I would like to see the credits awarded automatically. But there should be some kind of visible notification to the user. Looking in the staterr file is a pain, and many errors will not be noticed. An error code returned on the Work Unit alone might not be noticed because the Work Unit will have credit and not stand out in the lists the way they do now. So perhaps an error code in the result file that you can find easily, and some kind of entry to the Messages or the tasks tab status when the Work Unit finishes that the user can see could be done. Or an automatic system that awards the credit, but a new Work Unit status in the stats listing. Something like "non-fatal error", or "Auto Abort Called". This would let people see the problem, but still grant the credit. {NOTE/QUESTION: Do you think we should open a discussion sticky on this topic and move the posts there to make this easier to follow?} S@H redflags invalid wu's in the results page, how about yellow flagging (or any other colour) auto aborted wu's. That way you can have your set and forget system but it highlights problem wu's. ID: 14444 · Rating: 0 · rate: / Reply Quote

Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0	Message 14446 - Posted: 23 Apr 2006, 4:12:16 UTC I am starting this thread for a discussion of an auto abort featurs for stuck/hung Work Units. the Original question as posed by the development team is - I'm testing the watchdog thread over on ralph now, and wanted advice from the users posting here. When the watchdog detects that the job is stuck or taking too long, we can do two things: (1) Call the job valid, output a data file, and the job automatically gets credit. You never know that there was a timeout on your machine... (2) Call the job invalid, still output the data file, and your job won't get credit for a few days (every time we check for claimed credit that wasn't granted). The advantage here is that you see the errors, but have to wait for credit. I'm kind of neutral -- maybe (1) would be the easiest for me, but give you guys less info. Let me know your thoughts. Moderator9 ROSETTA@home FAQ Moderator Contact ID: 14446 · Rating: 0 · rate: / Reply Quote

Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0	Message 14447 - Posted: 23 Apr 2006, 4:17:24 UTC - in response to Message 14442. Sweet, thanks for all the input. For now, we've gone for option 1 temporarily (auto-aborts get automatic credit). Based on your great suggestions, we are going to look into making those auto-aborts show up in different colors in your results page. And also we'll try to make a different BOINC exit message show up in the BOINC manager, for non-fatal errors. It sounds totally feasible. I vote for awarding immediate credit because the more trouble-free the project, the more people will crunch for it. It's important for the project to keep an eye on results being returned, and to react swiftly to any WU problem. People don't like to find out that the last three days worth of WUs were the stuck-abort type and all that CPU was simply wasted. I would want to know if one of my nodes turned into a WU killer, but the symptoms of that are usually WUs erroring out early. People should be able to see when that happens. But stuck WUs are usually not the fault of the system, so there is no point in bothering people about them. BTW, does the watchdog thread take into account the speed of the system? A very slow system will need more time to complete a percent-incrementing step. ID: 14447 · Rating: 0 · rate: / Reply Quote

Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0	Message 14460 - Posted: 23 Apr 2006, 8:14:23 UTC - in response to Message 14442. BTW, does the watchdog thread take into account the speed of the system? A very slow system will need more time to complete a percent-incrementing step. By default the watchdog kills a WU when the energy doesn't change ( = no movement on the searching section of your screen saver) for 30 minutes. It's probably a pretty conservative figure even for the slow computers. ID: 14460 · Rating: 0 · rate: / Reply Quote

tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0	Message 14473 - Posted: 23 Apr 2006, 12:39:25 UTC - in response to Message 14460. BTW, does the watchdog thread take into account the speed of the system? A very slow system will need more time to complete a percent-incrementing step. By default the watchdog kills a WU when the energy doesn't change ( = no movement on the searching section of your screen saver) for 30 minutes. It's probably a pretty conservative figure even for the slow computers. I'm not sure, see this WU: http://ralph.bakerlab.org/workunit.php?wuid=82603 I completed it successful but it was aborted on a slower comp. Btw can't you just abort the current model and see if the next model is progressing and just kill the whole WU after three models have failed or after the target CPU time has been met? ID: 14473 · Rating: 0 · rate: / Reply Quote

Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0	Message 14476 - Posted: 23 Apr 2006, 12:56:59 UTC Make the client trouble free. That means that if the project feels even the errors are useful, then report them to the project (i.e. select a unique exit message), but don't bother the client about it (i.e. don't show it as an error). If the client ends the WU, and returns valid results that are useful to the project... well, then it's NOT an error. Even if the result is just to identify a hung WU and Rosetta combination that needs to be investigated further. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ ID: 14476 · Rating: 1 · rate: / Reply Quote

Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0	Message 14485 - Posted: 23 Apr 2006, 16:05:08 UTC - in response to Message 14473. This is actually a case indicating the new watchdog is improving! I think Rhiju fixed a bug in the v5.02 watchdog (the one killed one of your WU by mistake), and the fixed v5.03 watchdog let the WU went through! [quote]I'm not sure, see this WU: http://ralph.bakerlab.org/workunit.php?wuid=82603 I completed it successful but it was aborted on a slower comp. Btw can't you just abort the current model and see if the next model is progressing and just kill the whole WU after three models have failed or after the target CPU time has been met? ID: 14485 · Rating: 0 · rate: / Reply Quote

Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0	Message 14486 - Posted: 23 Apr 2006, 16:06:03 UTC - in response to Message 14476. This is exactly what we are going to do based on all your excellent suggestions! Make the client trouble free. That means that if the project feels even the errors are useful, then report them to the project (i.e. select a unique exit message), but don't bother the client about it (i.e. don't show it as an error). If the client ends the WU, and returns valid results that are useful to the project... well, then it's NOT an error. Even if the result is just to identify a hung WU and Rosetta combination that needs to be investigated further. ID: 14486 · Rating: 1 · rate: / Reply Quote

MattDavis Send message Joined: 22 Sep 05 Posts: 206 Credit: 1,377,748 RAC: 0	Message 14547 - Posted: 24 Apr 2006, 20:36:08 UTC Why the HECK wouldn't you list on the FRONT PAGE that there were problem work units? I looked at a computer (that I usually leave alone, trusting BOINC and the projects) AND SAW A WORK UNIT WAS UP TO 127 HOURS AND STILL NOT COMPLETED. I check the news (front page) EVERY DAY but had to look in the forums for this. WHY WASN'T THIS ON THE FRONT PAGE? I've wasted HUNDREDS OF HOURS on two computers because this wasn't on the front page. ID: 14547 · Rating: 0 · rate: / Reply Quote

Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0	Message 14551 - Posted: 24 Apr 2006, 22:05:13 UTC Matt was referring to the hung work units themselves, not the auto abort. See his other post Bin, I hope you didn't take any offense to my comments. I was simply trying to say "I vote for option #1 - call the WU valid" in response to Rhiju's original post at the bottom. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ ID: 14551 · Rating: 0 · rate: / Reply Quote

MintabiePete Send message Joined: 5 Nov 05 Posts: 30 Credit: 418,959 RAC: 0	Message 14562 - Posted: 25 Apr 2006, 2:43:36 UTC Without going into this too much , I seem to be getting a lot of WU's lately that sit on a low % and dont seem to move , I just suspend them and then abort them , not too sure if that is the correct way you do it , but thats the way I do it :) I am running 4 computers and seem to get these slow running WU all the time , is there a way to get WU that run faster rather than the slower running ones that you dont have to keep a check on to see if they have stalled . Thanks ID: 14562 · Rating: 0 · rate: / Reply Quote

John McLeod VII Send message Joined: 17 Sep 05 Posts: 108 Credit: 195,137 RAC: 0	Message 14565 - Posted: 25 Apr 2006, 3:53:50 UTC - in response to Message 14562. Without going into this too much , I seem to be getting a lot of WU's lately that sit on a low % and dont seem to move , I just suspend them and then abort them , not too sure if that is the correct way you do it , but thats the way I do it :) I am running 4 computers and seem to get these slow running WU all the time , is there a way to get WU that run faster rather than the slower running ones that you dont have to keep a check on to see if they have stalled . Thanks You don't need to bother with the suspend step. BOINC WIKI ID: 14565 · Rating: 0 · rate: / Reply Quote

Auto abort for Hung Work Units - Discussion