Rosetta WU delivery out of control

Author	Message
Aurum Send message Joined: 12 Jul 17 Posts: 32 Credit: 38,158,977 RAC: 0	Message 94252 - Posted: 12 Apr 2020, 16:11:57 UTC Last night Rosetta delivered 999 WUs to a quad-core computer with a total estimated time to complete of 265 days. Its queue was set to 0.5/0.1 days. This problem has gotten so bad I've had to set all my computers to accept no new work from Rosetta. Only when it runs out do I let it download way too much work requiring that most be aborted. I'm tired of babysitting this project. From now on I'll just let it send me hundreds of times more than I can crunch and they can sit there until they expire. ID: 94252 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 94271 - Posted: 12 Apr 2020, 20:12:04 UTC - in response to Message 94252. With your hosts hidden, it is difficult to offer you any suggestions to improve your situation. It sounds like perhaps you hit a batch of work that failed rather immediately, and the BOINC Manager started to think that was a normal runtime. The project obviously needs to stop sending tasks that fail, and that will then avoid the side-effect you seem to be observing. Rosetta Moderator: Mod.Sense ID: 94271 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1925 Credit: 18,534,891 RAC: 0	Message 94293 - Posted: 13 Apr 2020, 0:15:44 UTC - in response to Message 94271. Last modified: 13 Apr 2020, 0:17:28 UTC With your hosts hidden, it is difficult to offer you any suggestions to improve your situation. It sounds like perhaps you hit a batch of work that failed rather immediately, and the BOINC Manager started to think that was a normal runtime. The project obviously needs to stop sending tasks that fail, and that will then avoid the side-effect you seem to be observing. If the project configures their settings so that only Validated work is considered for Runtime/Estimated time calculations, it will stop faulty Tasks that crash & burn early on/instantly from affecting the Estimated completion times. Grant Darwin NT ID: 94293 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1925 Credit: 18,534,891 RAC: 0	Message 94295 - Posted: 13 Apr 2020, 0:20:01 UTC - in response to Message 94252. Its queue was set to 0.5/0.1 days. Given the huge number of projects you are attached to, 0.25 + 0.02 would probably be a better cache setting (although it still would've had issues with the sheer number of faulty Tasks that were sent out). Grant Darwin NT ID: 94295 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 94303 - Posted: 13 Apr 2020, 0:42:31 UTC - in response to Message 94293. Last modified: 13 Apr 2020, 13:25:15 UTC If the project configures their settings so that only Validated work is considered for Runtime/Estimated time calculations, it will stop faulty Tasks that crash & burn early on/instantly from affecting the Estimated completion times. If you could track down the specific project setting required, I'd be glad to suggest it to the Project Team. Rosetta Moderator: Mod.Sense ID: 94303 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1925 Credit: 18,534,891 RAC: 0	Message 94305 - Posted: 13 Apr 2020, 1:21:58 UTC - in response to Message 94303. Last modified: 13 Apr 2020, 2:21:28 UTC If the project configures their settings so that only Validated work is considered for Runtime/Estimated time calculations, it will stop faulty Tasks that crash & burn early on/instantly from affecting the Estimated completion times. If you could track done the specific project setting required, I'd be glad to suggest it to the Project Team. I've not a clue, but put out a request for help to those that might. Edit- no response to the call for help yet, but from what i've found it's really rather ugly as the Runtime estimation is a significant part of how Credit is calculated. And that gave me nothing but headaches when tying to make sense of it all in the past. From the looks of it, what is happening here at Rosetta shouldn't be happening- it appears you don't need to explicitly exclude Invalid or Error results. There is meant to be a function that excludes outlier results. eg a Task that has an Estimated completion time of 8 hours finishes in 7 hours will be used to calculate further Estimated completion times. Likewise one Estimated to take 8 hours & actually takes 9hrs will be used for further estimates. But one that goes 4 hours over the Estimated completion time, or one that finishes in less than half the time Estimated should be discarded from Estimated completion time calculations. Job Runtime estimation Credit New Grant Darwin NT ID: 94305 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1925 Credit: 18,534,891 RAC: 0	Message 94312 - Posted: 13 Apr 2020, 3:58:03 UTC - in response to Message 94303. If you could track done the specific project setting required, I'd be glad to suggest it to the Project Team. I've had a look, and come up with a WAG (Wild Arse Guess). For each Task, the project supplies an estimate of the FLOPs used by a job (wu.fpops_est) a limit on FLOPs, after which the job will be aborted (wu.fpops_bound). Rosetta allows for a 4 hour overrun from the Target CPU Time (this is a fixed time, regardless of Target CPU time? 2hr or 36hr Target CPU time, 4hrs overrun till the watchdog timer ends the Task?). So for Tasks of only 2hr Target time, the estimate of the FLOPs for that Task (wu.fpops_est) would be very small, but the limit on FLOPs after which the job will be aborted (wu.fpops_bound) would have to be very, very large to allow the 4 hour overrun before the Watchdog timer ends the Task. And it would have to be very, very, very, very large to allow for high clock speed CPUs- a lot more FLOPs done during that 4hrs than with a slower CPU. As near as i can tell, the wu.fpops_bound value is used for the Sanity check for Task size, estimated completion time & actual completion time used for keeping track of Runtimes & Estimated completion time. The extremely large wu.fpops_bound value (necessary for the 4 hour cutoff for the Watchdog timer) appears to break the Sanity check, so extremely short completion times (ie Tasks erroring out in seconds) are included in Estimated completion time calculations instead of being excluded. Does the project track how many tasks exceed their Target CPU time? By how much they exceed that time? Maybe that 4 hours could be reduced to 1hr, or even 30min? If my WAG is correct, that would then (maybe, hopefully) allow the Sanity check to work as intended to exclude extremely large outlier runtimes (ie Tasks erroring out in seconds or even minutes) and help reduce people getting more work than they can handle- at least when things go haywire (overly optimistic caches will of course still cause their own issues). I'll leave it to those with more of a clue as to how BOINC works to figure out if i'm barking up the wrong tree or not. Grant Darwin NT ID: 94312 · Rating: 0 · rate: / Reply Quote

JAMES Send message Joined: 5 May 07 Posts: 8 Credit: 275,386 RAC: 0	Message 94317 - Posted: 13 Apr 2020, 5:43:35 UTC - in response to Message 94252. Last night Rosetta delivered 999 WUs to a quad-core computer with a total estimated time to complete of 265 days. Its queue was set to 0.5/0.1 days. This problem has gotten so bad I've had to set all my computers to accept no new work from Rosetta. Only when it runs out do I let it download way too much work requiring that most be aborted. I'm tired of babysitting this project. From now on I'll just let it send me hundreds of times more than I can crunch and they can sit there until they expire. Look at it this way, it could have been worse. You could have gotten 999 WU’s from ClimatePrediction. They come in at about 325 MB’s each. ID: 94317 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1925 Credit: 18,534,891 RAC: 0	Message 94319 - Posted: 13 Apr 2020, 6:32:35 UTC - in response to Message 94317. Last modified: 13 Apr 2020, 6:33:13 UTC Look at it this way, it could have been worse. You could have gotten 999 WU’s from ClimatePrediction. They come in at about 325 MB’s each. Some Rosetta Tasks can use up to 1GB of HDD space (actually it's probably more than that, as many Tasks use less- at one stage i had 12GB of HDD space in use by Rosetta with 12 Tasks running). Grant Darwin NT ID: 94319 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1925 Credit: 18,534,891 RAC: 0	Message 94320 - Posted: 13 Apr 2020, 6:43:16 UTC - in response to Message 94305. If the project configures their settings so that only Validated work is considered for Runtime/Estimated time calculations, it will stop faulty Tasks that crash & burn early on/instantly from affecting the Estimated completion times. If you could track done the specific project setting required, I'd be glad to suggest it to the Project Team. I've not a clue, but put out a request for help to those that might. Edit- no response to the call for help yet, but from what i've found it's really rather ugly as the Runtime estimation is a significant part of how Credit is calculated. And that gave me nothing but headaches when tying to make sense of it all in the past.[/quote]And help has arrived. The answer- The keyword to look for is "runtime outlier". We did have exactly this problem at SETI around 2011, and we pressurised David Anderson to implement a fix. It's done in the validator (which of course is project-specific code): in SETI's case, we look for the overflow marker SETI@Home Informational message -9 result_overflow NOTE: The number of results detected equals the storage space allocated. in MB tasks, and the percentage of radar blanking in AP tasks. Tell them to look at https://boinc.berkeley.edu/trac/wiki/ValidationSimple#Runtimeoutliers Grant Darwin NT ID: 94320 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2541 Credit: 47,118,286 RAC: 474	Message 94354 - Posted: 13 Apr 2020, 16:03:06 UTC - in response to Message 94252. Last night Rosetta delivered 999 WUs to a quad-core computer with a total estimated time to complete of 265 days. Its queue was set to 0.5/0.1 days. This problem has gotten so bad I've had to set all my computers to accept no new work from Rosetta. Only when it runs out do I let it download way too much work requiring that most be aborted. I'm tired of babysitting this project. From now on I'll just let it send me hundreds of times more than I can crunch and they can sit there until they expire. Good choice. While scheduling is a Boinc issue, not Rosetta, Rosetta's initial runtime setting for new program versions makes it worse. But with a 0.5+0.1 queue and 8hr runtime, Boinc will time out the initially wrongly sent tasks within 24hrs and give you until the 3-day deadline to send back and be credited for as many as can be completed by then even if you don't manually intervene. Intervening may even make matters worse, so save yourself the trouble. This is a start-up, one-time issue for new hosts and/or new program versions. The project isn't about the first few days but however long the host contributes, so no need to obsess over it in the first day or two. The host will contribute the maximum it can either way. ID: 94354 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 32 Credit: 38,158,977 RAC: 0	Message 95806 - Posted: 2 May 2020, 14:23:43 UTC It has nothing to do with WUs failing or the runtime estimate being wrong. I can crunch any Rosetta WU they send. Rosetta just simply does not respect BOINC settings and DLs far too many WUs. Client side fix is to set all computers to No New Work and abort a few thousand a day. ID: 95806 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2541 Credit: 47,118,286 RAC: 474	Message 95819 - Posted: 2 May 2020, 16:04:44 UTC - in response to Message 95806. It has nothing to do with WUs failing or the runtime estimate being wrong What is the <expected> runtime of your tasks now? Has it got closer to the runtime that you set? It will by now. If so, it'll only be grabbing what you can complete within the deadline with a maximum cache setting of 1.5 days. That's how it works. ID: 95819 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 95832 - Posted: 2 May 2020, 18:07:32 UTC My machine, once I reset to the new project URL, got a new batch of v4.20 tasks and estimated they would take 1H 27M to complete. So, even with a small cache, that's easily waaaaayyyy too much work for my 24 hour runtime preference. So, it does happen. Small cache is helpful, but still doesn't address everything. Especially if your first tasks for the new application version come in when you are away from the machine. Rosetta Moderator: Mod.Sense ID: 95832 · Rating: 0 · rate: / Reply Quote

Tomcat雄猫 Send message Joined: 20 Dec 14 Posts: 180 Credit: 5,390,659 RAC: 0	Message 95858 - Posted: 2 May 2020, 20:58:02 UTC - in response to Message 95832. Last modified: 2 May 2020, 21:00:38 UTC My machine, once I reset to the new project URL, got a new batch of v4.20 tasks and estimated they would take 1H 27M to complete. So, even with a small cache, that's easily waaaaayyyy too much work for my 24 hour runtime preference. So, it does happen. Small cache is helpful, but still doesn't address everything. Especially if your first tasks for the new application version come in when you are away from the machine. I decided to give Ralph a spin on my Mac. My cache setting is set to 0.1 + 0 days and it still somehow managed to download nearly too much work. That's because the estimated completion times on my tasks are 47 minutes and 39 seconds, which is a new low... sigh ID: 95858 · Rating: 0 · rate: / Reply Quote

CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0	Message 95862 - Posted: 2 May 2020, 21:13:46 UTC - in response to Message 95858. I decided to give Ralph a spin on my Mac. My cache setting is set to 0.1 + 0 days and it still somehow managed to download nearly too much work. That's because the estimated completion times on my tasks are 47 minutes and 39 seconds, which is a new low... sigh You will find that most Ralph tasks run for about 1 hour or so regardless of the estimated completion time. Disregard anything about the % complete, or the time complete, they shoot to 100% suddenly without warning. Run a few, you'll see what I mean. ID: 95862 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 95867 - Posted: 2 May 2020, 21:35:51 UTC - in response to Message 95862. Ralph has a runtime preference as well. It also tests WUs sometimes that limit the number of models produced. Rosetta Moderator: Mod.Sense ID: 95867 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1925 Credit: 18,534,891 RAC: 0	Message 95887 - Posted: 2 May 2020, 23:40:05 UTC - in response to Message 94252. Last night Rosetta delivered 999 WUs to a quad-core computer with a total estimated time to complete of 265 days. Its queue was set to 0.5/0.1 days. This problem has gotten so bad I've had to set all my computers to accept no new work from Rosetta. Only when it runs out do I let it download way too much work requiring that most be aborted. Hence my suggestion to fix the problem with Estimated completion times for new hosts/applications. Of course the fact you have such a large cache setting while running so many projects just exacerbates the severity of your problem. Grant Darwin NT ID: 95887 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 95901 - Posted: 3 May 2020, 5:54:16 UTC Last modified: 3 May 2020, 6:05:49 UTC i'm not too sure if this would help but i set zero task cache , store additional zero days of work my setup is 0.1 / 0 for now it seemed to work on Pi4 i'm not sure about the rest i'm not too sure if there is any boinc client configs that can further limit the number of tasks downloaded. in the most extreme it may take a custom boinc-client to fix it i'd think. how about make an entry in the boinc forums to see if they could help? perhaps provide a new option to limit work cache based on the number of tasks? ID: 95901 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 95943 - Posted: 3 May 2020, 19:23:11 UTC - in response to Message 95901. Scheduler changes are being tested on Ralph that will help avoid getting more work than can be completed within the deadline. Rosetta Moderator: Mod.Sense ID: 95943 · Rating: 0 · rate: / Reply Quote