Rosetta WU delivery out of control

Message boards : Number crunching : Rosetta WU delivery out of control

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Aurum

Send message
Joined: 12 Jul 17
Posts: 32
Credit: 38,158,977
RAC: 0
Message 94252 - Posted: 12 Apr 2020, 16:11:57 UTC

Last night Rosetta delivered 999 WUs to a quad-core computer with a total estimated time to complete of 265 days. Its queue was set to 0.5/0.1 days. This problem has gotten so bad I've had to set all my computers to accept no new work from Rosetta. Only when it runs out do I let it download way too much work requiring that most be aborted.
I'm tired of babysitting this project. From now on I'll just let it send me hundreds of times more than I can crunch and they can sit there until they expire.
ID: 94252 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94271 - Posted: 12 Apr 2020, 20:12:04 UTC - in response to Message 94252.  

With your hosts hidden, it is difficult to offer you any suggestions to improve your situation.

It sounds like perhaps you hit a batch of work that failed rather immediately, and the BOINC Manager started to think that was a normal runtime. The project obviously needs to stop sending tasks that fail, and that will then avoid the side-effect you seem to be observing.
Rosetta Moderator: Mod.Sense
ID: 94271 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,757,861
RAC: 22,931
Message 94293 - Posted: 13 Apr 2020, 0:15:44 UTC - in response to Message 94271.  
Last modified: 13 Apr 2020, 0:17:28 UTC

With your hosts hidden, it is difficult to offer you any suggestions to improve your situation.

It sounds like perhaps you hit a batch of work that failed rather immediately, and the BOINC Manager started to think that was a normal runtime. The project obviously needs to stop sending tasks that fail, and that will then avoid the side-effect you seem to be observing.
If the project configures their settings so that only Validated work is considered for Runtime/Estimated time calculations, it will stop faulty Tasks that crash & burn early on/instantly from affecting the Estimated completion times.
Grant
Darwin NT
ID: 94293 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,757,861
RAC: 22,931
Message 94295 - Posted: 13 Apr 2020, 0:20:01 UTC - in response to Message 94252.  

Its queue was set to 0.5/0.1 days.
Given the huge number of projects you are attached to, 0.25 + 0.02 would probably be a better cache setting (although it still would've had issues with the sheer number of faulty Tasks that were sent out).
Grant
Darwin NT
ID: 94295 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94303 - Posted: 13 Apr 2020, 0:42:31 UTC - in response to Message 94293.  
Last modified: 13 Apr 2020, 13:25:15 UTC

If the project configures their settings so that only Validated work is considered for Runtime/Estimated time calculations, it will stop faulty Tasks that crash & burn early on/instantly from affecting the Estimated completion times.


If you could track down the specific project setting required, I'd be glad to suggest it to the Project Team.
Rosetta Moderator: Mod.Sense
ID: 94303 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,757,861
RAC: 22,931
Message 94305 - Posted: 13 Apr 2020, 1:21:58 UTC - in response to Message 94303.  
Last modified: 13 Apr 2020, 2:21:28 UTC

If the project configures their settings so that only Validated work is considered for Runtime/Estimated time calculations, it will stop faulty Tasks that crash & burn early on/instantly from affecting the Estimated completion times.
If you could track done the specific project setting required, I'd be glad to suggest it to the Project Team.
I've not a clue, but put out a request for help to those that might.



Edit- no response to the call for help yet, but from what i've found it's really rather ugly as the Runtime estimation is a significant part of how Credit is calculated. And that gave me nothing but headaches when tying to make sense of it all in the past.

From the looks of it, what is happening here at Rosetta shouldn't be happening- it appears you don't need to explicitly exclude Invalid or Error results. There is meant to be a function that excludes outlier results. eg a Task that has an Estimated completion time of 8 hours finishes in 7 hours will be used to calculate further Estimated completion times. Likewise one Estimated to take 8 hours & actually takes 9hrs will be used for further estimates.
But one that goes 4 hours over the Estimated completion time, or one that finishes in less than half the time Estimated should be discarded from Estimated completion time calculations.


Job Runtime estimation
Credit New
Grant
Darwin NT
ID: 94305 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,757,861
RAC: 22,931
Message 94312 - Posted: 13 Apr 2020, 3:58:03 UTC - in response to Message 94303.  

If you could track done the specific project setting required, I'd be glad to suggest it to the Project Team.
I've had a look, and come up with a WAG (Wild Arse Guess).

For each Task, the project supplies
an estimate of the FLOPs used by a job (wu.fpops_est)
a limit on FLOPs, after which the job will be aborted (wu.fpops_bound).
Rosetta allows for a 4 hour overrun from the Target CPU Time (this is a fixed time, regardless of Target CPU time? 2hr or 36hr Target CPU time, 4hrs overrun till the watchdog timer ends the Task?).
So for Tasks of only 2hr Target time, the estimate of the FLOPs for that Task (wu.fpops_est) would be very small, but the limit on FLOPs after which the job will be aborted (wu.fpops_bound) would have to be very, very large to allow the 4 hour overrun before the Watchdog timer ends the Task. And it would have to be very, very, very, very large to allow for high clock speed CPUs- a lot more FLOPs done during that 4hrs than with a slower CPU.

As near as i can tell, the wu.fpops_bound value is used for the Sanity check for Task size, estimated completion time & actual completion time used for keeping track of Runtimes & Estimated completion time. The extremely large wu.fpops_bound value (necessary for the 4 hour cutoff for the Watchdog timer) appears to break the Sanity check, so extremely short completion times (ie Tasks erroring out in seconds) are included in Estimated completion time calculations instead of being excluded.


Does the project track how many tasks exceed their Target CPU time? By how much they exceed that time? Maybe that 4 hours could be reduced to 1hr, or even 30min? If my WAG is correct, that would then (maybe, hopefully) allow the Sanity check to work as intended to exclude extremely large outlier runtimes (ie Tasks erroring out in seconds or even minutes) and help reduce people getting more work than they can handle- at least when things go haywire (overly optimistic caches will of course still cause their own issues).

I'll leave it to those with more of a clue as to how BOINC works to figure out if i'm barking up the wrong tree or not.
Grant
Darwin NT
ID: 94312 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JAMES

Send message
Joined: 5 May 07
Posts: 8
Credit: 275,386
RAC: 0
Message 94317 - Posted: 13 Apr 2020, 5:43:35 UTC - in response to Message 94252.  

Last night Rosetta delivered 999 WUs to a quad-core computer with a total estimated time to complete of 265 days. Its queue was set to 0.5/0.1 days. This problem has gotten so bad I've had to set all my computers to accept no new work from Rosetta. Only when it runs out do I let it download way too much work requiring that most be aborted.
I'm tired of babysitting this project. From now on I'll just let it send me hundreds of times more than I can crunch and they can sit there until they expire.


Look at it this way, it could have been worse. You could have gotten 999 WU’s from ClimatePrediction. They come in at about 325 MB’s each.
ID: 94317 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,757,861
RAC: 22,931
Message 94319 - Posted: 13 Apr 2020, 6:32:35 UTC - in response to Message 94317.  
Last modified: 13 Apr 2020, 6:33:13 UTC

Look at it this way, it could have been worse. You could have gotten 999 WU’s from ClimatePrediction. They come in at about 325 MB’s each.
Some Rosetta Tasks can use up to 1GB of HDD space (actually it's probably more than that, as many Tasks use less- at one stage i had 12GB of HDD space in use by Rosetta with 12 Tasks running).
Grant
Darwin NT
ID: 94319 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,757,861
RAC: 22,931
Message 94320 - Posted: 13 Apr 2020, 6:43:16 UTC - in response to Message 94305.  

If the project configures their settings so that only Validated work is considered for Runtime/Estimated time calculations, it will stop faulty Tasks that crash & burn early on/instantly from affecting the Estimated completion times.
If you could track done the specific project setting required, I'd be glad to suggest it to the Project Team.
I've not a clue, but put out a request for help to those that might.

Edit- no response to the call for help yet, but from what i've found it's really rather ugly as the Runtime estimation is a significant part of how Credit is calculated. And that gave me nothing but headaches when tying to make sense of it all in the past.[/quote]And help has arrived.


The answer-
The keyword to look for is "runtime outlier". We did have exactly this problem at SETI around 2011, and we pressurised David Anderson to implement a fix. It's done in the validator (which of course is project-specific code): in SETI's case, we look for the overflow marker

SETI@Home Informational message -9 result_overflow
NOTE: The number of results detected equals the storage space allocated.
in MB tasks, and the percentage of radar blanking in AP tasks.

Tell them to look at https://boinc.berkeley.edu/trac/wiki/ValidationSimple#Runtimeoutliers

Grant
Darwin NT
ID: 94320 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2121
Credit: 41,179,074
RAC: 11,480
Message 94354 - Posted: 13 Apr 2020, 16:03:06 UTC - in response to Message 94252.  

Last night Rosetta delivered 999 WUs to a quad-core computer with a total estimated time to complete of 265 days. Its queue was set to 0.5/0.1 days. This problem has gotten so bad I've had to set all my computers to accept no new work from Rosetta. Only when it runs out do I let it download way too much work requiring that most be aborted.
I'm tired of babysitting this project. From now on I'll just let it send me hundreds of times more than I can crunch and they can sit there until they expire.

Good choice. While scheduling is a Boinc issue, not Rosetta, Rosetta's initial runtime setting for new program versions makes it worse.
But with a 0.5+0.1 queue and 8hr runtime, Boinc will time out the initially wrongly sent tasks within 24hrs and give you until the 3-day deadline to send back and be credited for as many as can be completed by then even if you don't manually intervene. Intervening may even make matters worse, so save yourself the trouble.
This is a start-up, one-time issue for new hosts and/or new program versions.
The project isn't about the first few days but however long the host contributes, so no need to obsess over it in the first day or two. The host will contribute the maximum it can either way.
ID: 94354 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aurum

Send message
Joined: 12 Jul 17
Posts: 32
Credit: 38,158,977
RAC: 0
Message 95806 - Posted: 2 May 2020, 14:23:43 UTC

It has nothing to do with WUs failing or the runtime estimate being wrong. I can crunch any Rosetta WU they send. Rosetta just simply does not respect BOINC settings and DLs far too many WUs. Client side fix is to set all computers to No New Work and abort a few thousand a day.
ID: 95806 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2121
Credit: 41,179,074
RAC: 11,480
Message 95819 - Posted: 2 May 2020, 16:04:44 UTC - in response to Message 95806.  

It has nothing to do with WUs failing or the runtime estimate being wrong

What is the <expected> runtime of your tasks now? Has it got closer to the runtime that you set? It will by now.
If so, it'll only be grabbing what you can complete within the deadline with a maximum cache setting of 1.5 days. That's how it works.
ID: 95819 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 95832 - Posted: 2 May 2020, 18:07:32 UTC

My machine, once I reset to the new project URL, got a new batch of v4.20 tasks and estimated they would take 1H 27M to complete. So, even with a small cache, that's easily waaaaayyyy too much work for my 24 hour runtime preference. So, it does happen. Small cache is helpful, but still doesn't address everything. Especially if your first tasks for the new application version come in when you are away from the machine.
Rosetta Moderator: Mod.Sense
ID: 95832 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tomcat雄猫

Send message
Joined: 20 Dec 14
Posts: 180
Credit: 5,386,173
RAC: 0
Message 95858 - Posted: 2 May 2020, 20:58:02 UTC - in response to Message 95832.  
Last modified: 2 May 2020, 21:00:38 UTC

My machine, once I reset to the new project URL, got a new batch of v4.20 tasks and estimated they would take 1H 27M to complete. So, even with a small cache, that's easily waaaaayyyy too much work for my 24 hour runtime preference. So, it does happen. Small cache is helpful, but still doesn't address everything. Especially if your first tasks for the new application version come in when you are away from the machine.


I decided to give Ralph a spin on my Mac. My cache setting is set to 0.1 + 0 days and it still somehow managed to download nearly too much work. That's because the estimated completion times on my tasks are 47 minutes and 39 seconds, which is a new low...
*sigh*
ID: 95858 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 100
Credit: 21,059,812
RAC: 0
Message 95862 - Posted: 2 May 2020, 21:13:46 UTC - in response to Message 95858.  


I decided to give Ralph a spin on my Mac. My cache setting is set to 0.1 + 0 days and it still somehow managed to download nearly too much work. That's because the estimated completion times on my tasks are 47 minutes and 39 seconds, which is a new low...
*sigh*


You will find that most Ralph tasks run for about 1 hour or so regardless of the estimated completion time. Disregard anything about the % complete, or the time complete, they shoot to 100% suddenly without warning.

Run a few, you'll see what I mean.
ID: 95862 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 95867 - Posted: 2 May 2020, 21:35:51 UTC - in response to Message 95862.  

Ralph has a runtime preference as well. It also tests WUs sometimes that limit the number of models produced.
Rosetta Moderator: Mod.Sense
ID: 95867 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,757,861
RAC: 22,931
Message 95887 - Posted: 2 May 2020, 23:40:05 UTC - in response to Message 94252.  

Last night Rosetta delivered 999 WUs to a quad-core computer with a total estimated time to complete of 265 days. Its queue was set to 0.5/0.1 days. This problem has gotten so bad I've had to set all my computers to accept no new work from Rosetta. Only when it runs out do I let it download way too much work requiring that most be aborted.
Hence my suggestion to fix the problem with Estimated completion times for new hosts/applications.
Of course the fact you have such a large cache setting while running so many projects just exacerbates the severity of your problem.
Grant
Darwin NT
ID: 95887 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 95901 - Posted: 3 May 2020, 5:54:16 UTC
Last modified: 3 May 2020, 6:05:49 UTC

i'm not too sure if this would help but i set zero task cache , store additional zero days of work
my setup is 0.1 / 0
for now it seemed to work on Pi4 i'm not sure about the rest

i'm not too sure if there is any boinc client configs that can further limit the number of tasks downloaded.
in the most extreme it may take a custom boinc-client to fix it i'd think.
how about make an entry in the boinc forums to see if they could help? perhaps provide a new option to limit work cache based on the number of tasks?
ID: 95901 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 95943 - Posted: 3 May 2020, 19:23:11 UTC - in response to Message 95901.  

Scheduler changes are being tested on Ralph that will help avoid getting more work than can be completed within the deadline.
Rosetta Moderator: Mod.Sense
ID: 95943 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : Rosetta WU delivery out of control



©2024 University of Washington
https://www.bakerlab.org