Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 275 · 276 · 277 · 278 · 279 · 280 · 281 . . . 301 · Next

AuthorMessage
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2122
Credit: 41,184,189
RAC: 10,001
Message 109161 - Posted: 24 Apr 2024, 0:49:35 UTC - in response to Message 109157.  

I've got 15 tasks returned after deadline and they've all validated and credited.
I have a further 6 awaiting validation.

Just checking further, the tasks I returned after deadline have been reissued to other users 10 minutes before I returned them.
One of the reissues has been cancelled by the Server. The others haven't.

Good for you.
I have a lot of "cancelled by the server"

So do I (although mostly they run OK). There seems to be something wrong with the server. It sends out a task, and before it returns its result or times out it sends the same one to me. Then the first user returns the result, and mine gets cancelled. Just plain sloppy.

It's a consequence of the whole site being down.
It seems like, once the site came back up, it timed-out tasks that missed deadline straight away and reissued them, but the host didn't re-poll the server until it's timer ran out - could've been 4-5hrs after the site came back up - to report they were completed.
It's just unfortunate.
ID: 109161 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1678
Credit: 17,776,441
RAC: 22,704
Message 109162 - Posted: 24 Apr 2024, 6:28:59 UTC - in response to Message 109159.  

Changing the target runtime back to 8hrs, even with the folding@home contention, will take 7-11hrs out of the running tasks and a further 7-11hrs out of the cached tasks.
14-22hrs less processing time to complete tasks will make a huge difference to whether Panic mode arises.
All that does is stop Panic mode from occurring most of the time- there will still be times where it does occur (because of all the other projects all taking longer to complete their Tasks than they expect to as well).
Stopping the sharing of cores & threads will fix the actual problem, not just the symptoms.


It can certainly be solved your way, but that gets a bit fiddly imo and doesn't resolve the confusion Rosetta runtime introduces.
How is it fiddly?
I'm changing one value, and fixing the cause of the problem (over committed CPU).
You're changing one value, and fixing the symptom (Panic mode occuring).

In both cases, only one value needs to be changed.
Although it does require some thought to fix the problem, to determine what % "Use at most..." should be set to.
87% leaves 1 core/thread free for non-BOINC work (7/8=0.875).
75% leaves 2 cores/threads free for non-BOINC work (6/8=0.75).

Not really a big effort required IMHO.
Grant
Darwin NT
ID: 109162 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1678
Credit: 17,776,441
RAC: 22,704
Message 109169 - Posted: 25 Apr 2024, 5:14:48 UTC
Last modified: 25 Apr 2024, 5:24:56 UTC

And once again we've got problems.
The Validators & Assimilators are down, so the backlog of that work continues to pile up. And if it backs up enough, then the disks end up full & things crash and fall over all over again.


Edit- looks like they're all on the one server- boinc-process
Grant
Darwin NT
ID: 109169 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2122
Credit: 41,184,189
RAC: 10,001
Message 109172 - Posted: 25 Apr 2024, 10:00:17 UTC - in response to Message 109162.  

Changing the target runtime back to 8hrs, even with the folding@home contention, will take 7-11hrs out of the running tasks and a further 7-11hrs out of the cached tasks.
14-22hrs less processing time to complete tasks will make a huge difference to whether Panic mode arises.
All that does is stop Panic mode from occurring most of the time- there will still be times where it does occur (because of all the other projects all taking longer to complete their Tasks than they expect to as well).

Given panic-mode means Boinc realises tasks can't be completed within deadline, preventing Panic mode occurring is the entire solution.
It may not be pretty in that processes are sharing cores, but imo no-one in their right mind cares which bit of a process of which task runs at what time as long as 1) the CPUs are being fully utilised and 2) tasks complete within deadline without further manual intervention.
Missing deadlines has all sorts of consequences both sides of the server divide. Meeting deadlines has none.

For some reason I now want to quote Mr Micawber from Charles Dickens' David Copperfield:
“Annual income twenty pounds, annual expenditure nineteen pounds, nineteen and six, result happiness.
Annual income twenty pounds, annual expenditure twenty pounds ought and six, result misery”
Point being, the detail isn't relevant as long as you succeed.

Stopping the sharing of cores & threads will fix the actual problem, not just the symptoms.

First, I dispute the sharing of cores & threads is a) a problem and b) one that needs fixing.
Second, that it's any business of the user as long as the computer doesn't crash and completes its work successfully and within the envelope of time allowed.
If the user is happy for more tasks to be running simultaneously, outside of their individual planned time, but still within the overall deadline, that's entirely up to them.
Your alternative being a smaller number of tasks run for each project, but with a core/thread dedicated to them, which is fine but will fall flat when there's a lack of task availability.
It's a choice. I recognise it, but I wouldn't personally opt for your one either.
ID: 109172 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2122
Credit: 41,184,189
RAC: 10,001
Message 109173 - Posted: 25 Apr 2024, 10:18:19 UTC

I was going to edit the last post, but decided it's worth a new message.

I notice adrianxw hasn't reappeared here to comment, so I looked at his tasks and he's taken Rosetta off "no new tasks".
I believe he's now set Target Run Time to the default. Not to 8hrs explicitly, but the default. That is "Not Selected".

However, his completed tasks now run for ~10,800secs rather than 43,200secs, taking ~15,000secs rather than ~112,000secs.
This will definitely provide a solution for him imo. Fine.

At some point somewhere - and quite recently - Rosetta's default appears to have changed to 3hrs, meaning tasks get completed and used up far more quickly than intended.
And I'm not sure about this, but I think Boinc is forced to assume and schedule Rosetta tasks to run for 8hrs, which is now not right.

Can people check what they have set up?
Is it 8 hrs or "Not Selected"?
Do tasks run for 8hrs or 3 for "Not Selected"?
I believe it's the latter.
What does Boinc assume runtime will be at download?

Somethings gone wrong imo.
ID: 109173 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 259
Credit: 486,566
RAC: 352
Message 109174 - Posted: 25 Apr 2024, 10:39:06 UTC

It sets 8 hours for 4.20 and 3 for 6.05
ID: 109174 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Link
Avatar

Send message
Joined: 4 May 07
Posts: 356
Credit: 382,349
RAC: 0
Message 109175 - Posted: 25 Apr 2024, 10:45:50 UTC - in response to Message 109172.  

Given panic-mode means Boinc realises tasks can't be completed within deadline, preventing Panic mode occurring is the entire solution.
Eliminating the reason for the panic mode is the entire solution, everything else is a workaround, which might fail as soon as something changes (new WU type, new project, whatever) or even before.


It may not be pretty in that processes are sharing cores, but imo no-one in their right mind cares which bit of a process of which task runs at what time as long as 1) the CPUs are being fully utilised and 2) tasks complete within deadline without further manual intervention.
It's not just not pretty, highly overcommiting the system might slow down the overall production, in particular with hyperthreading CPUs many people leave 1-2 theads for non-BOINC stuff.
.
ID: 109175 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1678
Credit: 17,776,441
RAC: 22,704
Message 109176 - Posted: 25 Apr 2024, 11:07:33 UTC - in response to Message 109172.  

Given panic-mode means Boinc realises tasks can't be completed within deadline, preventing Panic mode occurring is the entire solution.
No Panic mode doesn't mean they won't be completed. It means there is a high risk of not being completed if not processed immediately.
Which fixing the overcommitted CPU does resolve.


It may not be pretty in that processes are sharing cores, but imo no-one in their right mind cares which bit of a process of which task runs at what time as long as 1) the CPUs are being fully utilised and 2) tasks complete within deadline without further manual intervention.
No one in their right mind would think taking 12 hours to 6 hours work is good (which is double the time required- on another project it's taking them 4 times as long).


Stopping the sharing of cores & threads will fix the actual problem, not just the symptoms.
First, I dispute the sharing of cores & threads is a) a problem and b) one that needs fixing.
You may dispute that, but it doesn't make it any less true.
And it needs fixing because the poster keeps complaining about it. If they don't complain about it, then no it doesn't need fixing.


If the user is happy for more tasks to be running simultaneously, outside of their individual planned time, but still within the overall deadline, that's entirely up to them.
Yep.
But in this case it May cause problems with deadlines, resulting in Panic Mode, which the poster has an issue with, so it is an issue that should be addressed.
Why fix the symptom, when fixing the problem would result in more work being done- even with less cores/threads available to BOINC, the amount of work done for BOINC would be almost triple what it presently is.


Your alternative being a smaller number of tasks run for each project, but with a core/thread dedicated to them, which is fine but will fall flat when there's a lack of task availability.
Why would you think that?????
All my setting does is stop 9 things, or 10 things or more from trying to run on 8 cores/threads at the same time. It does not in any way stop cores/threads from being used by different projects at the same time. What it does stop is BOINC from trying to use cores/threads that are being heavily used by non BOINC applications.
If there are 10 projects with work, or only one, all available cores/threads will be used.
Grant
Darwin NT
ID: 109176 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tazzduke

Send message
Joined: 2 Jul 09
Posts: 2
Credit: 1,234,765
RAC: 44
Message 109177 - Posted: 25 Apr 2024, 11:25:45 UTC - in response to Message 109176.  

Greetings,

Well I have 3 systems running at the moment, all using the default location in preferences with target cpu set to 2hrs

Ryzen 5700x (#1) 8c/16t - only using 8c, cpu times are averaging 3hrs (Win11)

Ryzen 5700x (#2) 8c/16t - only using 8c, cpu times are averaging 2hrs (Win11)

Dual Xeon E5-2470v2 (20c/40t) - only using 8c, cpu times are averaging 2hrs (Linux Mint 21.3) also LHC using some cores as well.

I have set work fetch preferences to 0.1 days & 0.1 days, which keeps a small amount of workunits in cache on each machine, its how I like it.

But as Grant has already mentioned, the validators are still offline, as pendings are growing.

I also fine tuned the core usage on these machines, as I have app_config files in each project, cause sometimes I am running various other projects at sometimes, again my preference only.

When pushing hard on some projects and I start using the hyperthreads, I still as a rule, leave 2 threads in reserve for each cpu, for the OS and GPU to use, again my preference only.

Hope you have a good day :-)
ID: 109177 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1678
Credit: 17,776,441
RAC: 22,704
Message 109178 - Posted: 25 Apr 2024, 11:28:30 UTC - in response to Message 109173.  

At some point somewhere - and quite recently - Rosetta's default appears to have changed to 3hrs, meaning tasks get completed and used up far more quickly than intended.
And I'm not sure about this, but I think Boinc is forced to assume and schedule Rosetta tasks to run for 8hrs, which is now not right.
The default Runtime is still 8 hours.
Rosetta 4.20 Tasks generally still take that long. However, Rosetta Beta Tasks generally only require 3 hours.


The initial Estimated completion time for Rosetta has been broken ever since i joined Rosetta.
When i joined, the initial Estimated Completion time was way, way, way less than the actual Runtime, and people would download hundreds (even thousands for the huge multicore systems) of Tasks and most would time out, but eventually the Estimated Completion time would reflect the actual Runtime.

The best fix would have been to use the mechanism that every other project uses- an Initial Estimated completion time based on the Estimated amount of work to do, but Rosetta doesn't work that way.
The next best fix would have been to use the average Runtime for all tasks for a given application (or the previous application, or that new application from the Ralph Runtimes) for the Initial Estimated Completion time, which would eventually end up matching the actual Runtime.
The next best fix would have been to set the Initial Estimated Completion time to match the Target CPU Runtime set by each cruncher for their systems, and which would eventually end up matching the actual Runtime.
The next best fix would have been to set the Initial Estimated Completion time to match the project's default Target CPU time, and it would eventually end up matching the actual Runtime.
The next best fix would have been to set the Initial Estimated Completion time to match the Target CPU Runtime set by each cruncher for their systems, and not update it using their actual Runtimes.
The next best fix would have been to set the Initial Estimated Completion time to match the project's default Target CPU time, and not update it using their actual Runtimes. And that's what we ended up with.
While it was a huge improvement over what was used before, it was nowhere near as good as it could have been.
Grant
Darwin NT
ID: 109178 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2122
Credit: 41,184,189
RAC: 10,001
Message 109182 - Posted: 25 Apr 2024, 23:39:03 UTC - in response to Message 109174.  

It sets 8 hours for 4.20 and 3 for 6.05

Oh! I didn't even think of that.
I assume this is for Target CPU Runtime = Not Selected".

So, does Boinc assume they're all 8hr tasks before they run, then rapidly reduce Remaining Time as the 6.04/6.05 task works its way through?

So that Boinc schedules the same as tasks run, I'd set Target CPU Runtime explicitly to 8hrs, not "Not Selected"
ID: 109182 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2122
Credit: 41,184,189
RAC: 10,001
Message 109183 - Posted: 25 Apr 2024, 23:57:10 UTC - in response to Message 109175.  
Last modified: 25 Apr 2024, 23:57:45 UTC

Given panic-mode means Boinc realises tasks can't be completed within deadline, preventing Panic mode occurring is the entire solution.
Eliminating the reason for the panic mode is the entire solution, everything else is a workaround, which might fail as soon as something changes (new WU type, new project, whatever) or even before.

The root cause reason for panic mode is holding too large an offline cache.
Aside from the number of days you chose to hold, if Rosetta actively misleads Boinc on top of that, which it certainly does, then that's what has to be resolved before anything else.

It may not be pretty in that processes are sharing cores, but imo no-one in their right mind cares which bit of a process of which task runs at what time as long as 1) the CPUs are being fully utilised and 2) tasks complete within deadline without further manual intervention.
It's not just not pretty, highly overcommiting the system might slow down the overall production, in particular with hyperthreading CPUs many people leave 1-2 threads for non-BOINC stuff.

Might it? Does it? To what extent?
I don't leave any cores/threads spare.
Occasionally I get WCG GPU tasks for Open Pandemics and genuinely don't notice any deterioration/extension of wall clock times as result.
There may be some, but not so that I notice.
Other GPU tasks with other projects may be different, but I don't run them.
I understand the theoretical point. I just don't see any practical difference.
ID: 109183 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2122
Credit: 41,184,189
RAC: 10,001
Message 109184 - Posted: 26 Apr 2024, 0:44:31 UTC - in response to Message 109176.  

It may not be pretty in that processes are sharing cores, but imo no-one in their right mind cares which bit of a process of which task runs at what time as long as 1) the CPUs are being fully utilised and 2) tasks complete within deadline without further manual intervention.
No one in their right mind would think taking 12 hours to 6 hours work is good (which is double the time required- on another project it's taking them 4 times as long).

But that's not what's happening, is it.
It's taking 12hrs to do 6hrs work because it's taking that same 12hrs to run 6hrs of non-Boinc work. So in 12hrs it's doing 6+6hrs work=12hrs.
This isn't a problem, because Adrian (in this case) said both projects are important to him.

Stopping the sharing of cores & threads will fix the actual problem, not just the symptoms.
First, I dispute the sharing of cores & threads is a) a problem and b) one that needs fixing.
You may dispute that, but it doesn't make it any less true.
And it needs fixing because the poster keeps complaining about it. If they don't complain about it, then no it doesn't need fixing.

It's not true at all because you're only counting Boinc work as work.

If the user is happy for more tasks to be running simultaneously, outside of their individual planned time, but still within the overall deadline, that's entirely up to them.
Yep.
But in this case it May cause problems with deadlines, resulting in Panic Mode, which the poster has an issue with, so it is an issue that should be addressed.
Why fix the symptom, when fixing the problem would result in more work being done- even with less cores/threads available to BOINC, the amount of work done for BOINC would be almost triple what it presently is.

Because there's more than one problem and you're not acknowledging the significance of the first of them.
Unfortunately (because it's boring) I'm going to have to spell it out.

I hold a 0.5 + 0.1day cache and deliberately run Rosetta for 12hrs rather than 8. (like Adrian unwittingly does/did)
When the website came back up, a load of tasks came down and started at the same time.

Boinc saw them as 8hr (0.33day) tasks so downloaded 1 more task per core (=0.66 days for Boinc, but Target runtime 1.0 days)

After about 2hrs, the total of tasks drops below 0.6 days according to Boinc, but actually 22hrs according to my target runtime, so Boinc downloads another task per core,
Boinc sees this as 3 8hr tasks per core, minus 2 hrs. equals 22hrs of work.
My CPU runtime settings mean it's actually 36(-2)=34hrs work.
That's 12hrs difference already because of what Rosetta does that Boinc won't recognise until 6hrs in, when Boinc tells me there's 4hrs remaining (still 2hrs short) or 7hrs in when Boinc now says there's 3h45m remaining (still 1h15m short) until finally at 8hrs in Boinc finally realises there's still 4hrs remaining. Plus 2 lots of 8hrs per core, that are really 12hrs per core.
So at the start, Boinc saw 3*8hrs of work, whereas after 8hrs of processing there's actually still 28hrs left

And all that's without Folding at home messing things up.

So while I entirely take your point about what Folding@home does to Boinc project runtimes, there's a massive elephant in the room to deal with <first> when using non-standard Rosetta runtimes.

Which is why I went on to ask what a default runtime currently means because it's making a right mess of Boinc's scheduling with some very weird and unexpected consequences
ID: 109184 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2122
Credit: 41,184,189
RAC: 10,001
Message 109185 - Posted: 26 Apr 2024, 1:02:07 UTC - in response to Message 109177.  

Greetings,

Well I have 3 systems running at the moment, all using the default location in preferences with target cpu set to 2hrs

Welcome.
All that sounds good except for your Target CPU runtime.
The default is supposed to be 8hrs.
By setting target CPU runtime at 2hrs, you're throwing away 6hrs of results per task for the project AND throwing away 6/8ths of the credit you could be earning, while messing up Boinc's scheduling on your PCs, and also making tasks less available to others while we're a bit hand to mouth for task availability in recent months.
Everything else is fine just as you prefer it, but if you could consider changing that runtime, it would be appreciated.
ID: 109185 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2122
Credit: 41,184,189
RAC: 10,001
Message 109186 - Posted: 26 Apr 2024, 1:09:19 UTC - in response to Message 109178.  

At some point somewhere - and quite recently - Rosetta's default appears to have changed to 3hrs, meaning tasks get completed and used up far more quickly than intended.
And I'm not sure about this, but I think Boinc is forced to assume and schedule Rosetta tasks to run for 8hrs, which is now not right.
The default Runtime is still 8 hours.
Rosetta 4.20 Tasks generally still take that long. However, Rosetta Beta Tasks generally only require 3 hours.

The initial Estimated completion time for Rosetta has been broken ever since i joined Rosetta.
When i joined, the initial Estimated Completion time was way, way, way less than the actual Runtime, and people would download hundreds (even thousands for the huge multicore systems) of Tasks and most would time out, but eventually the Estimated Completion time would reflect the actual Runtime.

It wasn't an issue for me at the time, so I kind of glossed over the reasoning, but yes I think it was to do with estimated runtimes for new users being way out of kilter that caused what we've got right now.

My main PC had a major problem the other month (RAM failure causing endless blue screens) and I had to reinstall everything and my first Rosetta task runtimes were still all over the place before the 8hr thing finally cut in.
Same applied to WCG tbf.
ID: 109186 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1678
Credit: 17,776,441
RAC: 22,704
Message 109188 - Posted: 26 Apr 2024, 5:03:25 UTC - in response to Message 109182.  

So that Boinc schedules the same as tasks run, I'd set Target CPU Runtime explicitly to 8hrs, not "Not Selected"
It doesn't work that way.
The project has hard coded 8 hours as the default Estimated time remaining, regardless of how long they run or what you have set your Target CPU time to.
As they run, the Estimated Completion time will go up or down as necessary till it eventually comes close to what the actual Runtime ends up being, but when the next Task starts, it's Estimated Completion time will always be 8 hours.
Grant
Darwin NT
ID: 109188 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1678
Credit: 17,776,441
RAC: 22,704
Message 109189 - Posted: 26 Apr 2024, 5:51:31 UTC - in response to Message 109184.  
Last modified: 26 Apr 2024, 6:00:31 UTC

It may not be pretty in that processes are sharing cores, but imo no-one in their right mind cares which bit of a process of which task runs at what time as long as 1) the CPUs are being fully utilised and 2) tasks complete within deadline without further manual intervention.
No one in their right mind would think taking 12 hours to 6 hours work is good (which is double the time required- on another project it's taking them 4 times as long).
But that's not what's happening, is it.
Yes, YES, YES, that is exactly what is happening, as i pointed out in one of my earlier posts, that you even quoted in one of yours.

And the same issue is happening with your other projects.
Asteroids- 2hrs Runtime,1hr CPU time.
SIdock- 31.5hrs Runtime, 27hrs 40min CPU time.
Denis- 3hr 40min Runtime, 1hr CPU time.

For reference- CPU time is the amount of time spent by the CPU processing the Task. Runtime- that is the time (think of a clock on the wall) it actually takes to process the Task. From the time it starts running, to the time it finishes & uploads the result.
So for Denis, on his system, Tasks that should take 1 hour to process, it actually takes 3hrs 40 min from the time it starts to the time it ends. 220min for something that should take 60min.
That is exactly what is happening.



It's taking 12hrs to do 6hrs work because it's taking that same 12hrs to run 6hrs of non-Boinc work. So in 12hrs it's doing 6+6hrs work=12hrs.
This isn't a problem, because Adrian (in this case) said both projects are important to him.
It is a problem because that isn't what is happening. It is only taking 2 to 4 times as long to process BOINC Tasks, because they run at a very low priority.
Folding@home runs at a much higher priority, so it isn't affected in any way shape or form.

For Folding it would take 1 hour to do 1 hours worth of work (ie CPU Time= Runtime), where as here at BOINC it's taking from 31.5 hrs to do 27hrs 40min of SIdock work, to taking twice as long to do Rosetta & Asteroids work, to taking almost 4 times as long to do Denis work.



Stopping the sharing of cores & threads will fix the actual problem, not just the symptoms.
First, I dispute the sharing of cores & threads is a) a problem and b) one that needs fixing.
You may dispute that, but it doesn't make it any less true.
And it needs fixing because the poster keeps complaining about it. If they don't complain about it, then no it doesn't need fixing.
It's not true at all because you're only counting Boinc work as work.
Not at all- Folding is work, watching movies on the computer is work, doing emails is work, editing photos is work, transcoding movies is work, and as far as the Compter is concerned- playing games is work, surfing the internet is work.
And each and every one of these things requires CPU time in order to do it.

In the case of the vast majority of those task the amount of CPU time they require is negligible.
In the case of Folding, and transcoding, it is not negligible & it requires a full core/thread (or more to do). If it is being fully used by those programmes, then BOINC trying to make use of it as well will result in that BOINC work running much, much, much slower than it would if it wasn't sharing that core/thread (as the times i reposted above show- up to almost 4 times slower in some cases).



Because there's more than one problem and you're not acknowledging the significance of the first of them.
It's not a case of not acknowledging the significance of it, because it's not that significant.
What you are proposing will only fix the High Priority issue with Rosetta (and as Link pointed out, it could easily occur if they add or remove a Project, a new application is released here or on another project.)
What i propose fixes the High priority issue with Rosetta, and it also fixes the ridiculously long Run times here & on the other BOINC projects as well.
An issue that affects multiple projects compared to one that affects only a single project is much more significant IMHO. And so something that fixes all of the issues is much more sgnificant than one with doesn't even fix one issue, it only fixes the symptom, not the underlying cause.



Unfortunately (because it's boring) I'm going to have to spell it out.
No you don't.
I am fully aware of what the effect of having a fixed Initial Estimated Completion time is when the actual Runtime isn't necessarily anywhere near that.
What you don't seem to be appreciating is that by having the system overcommitted, there will still be situations where problems arise because the Estimated Completion time can never, ever match the actual Runtime, because the Runtime (how long it actually takes) is so far out from the CPU time (how long it should actually take).

Just look at Rosetta 4.20 Tasks. They take 8 hours- which is the default CPU Target time. But because the system is overcommitted it will take 16 hours to actually process that Task.
The Initial Estimated completion time & the actual CPU time match perfectly, but problems will still occur because neither of those times comes close to the actual Runtime- all because the system is over committed.
Grant
Darwin NT
ID: 109189 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1678
Credit: 17,776,441
RAC: 22,704
Message 109190 - Posted: 26 Apr 2024, 7:06:25 UTC - in response to Message 109183.  

I don't leave any cores/threads spare.
And you don't really need to as your non-BOINC CPU usage (or GPU usage requiring CPU support) is generally quite light, but sill heavier than mine.
eg- My two systems
Run time 7 hours 30 min 42 sec
CPU time 7 hours 29 min 37 sec

Run time 6 hours 54 min 13 sec
CPU time 6 hours 52 min 53 sec
My CPU Time and Run times are very close, there is a slightly bigger difference on the system that i make use of daily. The other is a cruncher only (unless the main system dies, then i've got a spare to use).

Your two systems
Run time 12 hours 12 min 6 sec
CPU time 11 hours 59 min 57 sec

Run time 12 hours 20 min 23 sec
CPU time 11 hours 59 min 56 sec
A bigger gap between CPU time and Run time, but still not large. Which indicates the systems are getting some non-BOINC use, but not a lot. Or they're running a GPU application (BOINC or otherwise), that doesn't require very much CPU support.

And from one of the systems in the top 10 of the Top Hosts list.
Run time 8 hours 0 min 44 sec
CPU time 7 hours 59 min 51 sec




Occasionally I get WCG GPU tasks for Open Pandemics and genuinely don't notice any deterioration/extension of wall clock times as result.
There may be some, but not so that I notice.

1 It's a BOINC project, so it shares it's time & processing with all the other BOINC projects.
2 It's a GPU application, and the amount of CPU support required can vary hugely between applications, from almost none at all, to needing a full CPU core/thread for each running GPU Task.



Other GPU tasks with other projects may be different, but I don't run them.
I understand the theoretical point. I just don't see any practical difference.
The practical difference is if a GPU application needs a full CPU core/thread for each running Task, and it has to share that core/thread with another Task being processed on the CPU, not only will the CPU processing times suffer, but the GPU output can tank massively.
I can't remember the actual numbers, but a GPU that can process a Task in 4 min with a full core/thread supporting it (if it needs it of course), may take 40min (or more) if it has to share that core/thread with another CPU heavy load.
Only doing 360 Tasks per day when over 3,600 is possible is a pretty poor choice to make.
That is the practical difference.
Grant
Darwin NT
ID: 109190 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2122
Credit: 41,184,189
RAC: 10,001
Message 109191 - Posted: 26 Apr 2024, 10:59:21 UTC - in response to Message 109188.  

So that Boinc schedules the same as tasks run, I'd set Target CPU Runtime explicitly to 8hrs, not "Not Selected"
It doesn't work that way.
The project has hard coded 8 hours as the default Estimated time remaining, regardless of how long they run or what you have set your Target CPU time to.
As they run, the Estimated Completion time will go up or down as necessary till it eventually comes close to what the actual Runtime ends up being, but when the next Task starts, it's Estimated Completion time will always be 8 hours.

I have to tell you, I'm absolutely amazed that you think Boinc scheduling being wrong by 50% one way or 2-300% the other way for the bulk of the time a task is processing - and 100% of the time it's sitting waiting in the cache - is no kind of problem, but losing the odd few seconds or minutes during processing is a big issue. (Talking about my PCs here).

Your standards compared to mine are on different planets.
ID: 109191 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2122
Credit: 41,184,189
RAC: 10,001
Message 109192 - Posted: 26 Apr 2024, 11:42:28 UTC - in response to Message 109189.  

It may not be pretty in that processes are sharing cores, but imo no-one in their right mind cares which bit of a process of which task runs at what time as long as 1) the CPUs are being fully utilised and 2) tasks complete within deadline without further manual intervention.
No one in their right mind would think taking 12 hours to 6 hours work is good (which is double the time required- on another project it's taking them 4 times as long).
But that's not what's happening, is it.
Yes, YES, YES, that is exactly what is happening, as i pointed out in one of my earlier posts, that you even quoted in one of yours.

And the same issue is happening with your other projects.
Asteroids- 2hrs Runtime,1hr CPU time.
SIdock- 31.5hrs Runtime, 27hrs 40min CPU time.
Denis- 3hr 40min Runtime, 1hr CPU time.

For reference- CPU time is the amount of time spent by the CPU processing the Task. Runtime- that is the time (think of a clock on the wall) it actually takes to process the Task. From the time it starts running, to the time it finishes & uploads the result.
So for Denis, on his system, Tasks that should take 1 hour to process, it actually takes 3hrs 40 min from the time it starts to the time it ends. 220min for something that should take 60min.
That is exactly what is happening.

It's taking 12hrs to do 6hrs work because it's taking that same 12hrs to run 6hrs of non-Boinc work. So in 12hrs it's doing 6+6hrs work=12hrs.
This isn't a problem, because Adrian (in this case) said both projects are important to him.
It is a problem because that isn't what is happening. It is only taking 2 to 4 times as long to process BOINC Tasks, because they run at a very low priority.
Folding@home runs at a much higher priority, so it isn't affected in any way shape or form.

For Folding it would take 1 hour to do 1 hours worth of work (ie CPU Time= Runtime), where as here at BOINC it's taking from 31.5 hrs to do 27hrs 40min of SIdock work, to taking twice as long to do Rosetta & Asteroids work, to taking almost 4 times as long to do Denis work.

First thing to say is I didn't appreciate Folding runs at a dfferent (normal compared to low I assume) priority to Rosetta or other projects. I assumed they were all low priority.
But to hear folding runs at a higher priority - nominally 1 to 1 CPU to wallclock time - makes me think that's massively better than I thought.
Yes, Denis is particularly bad, Asteroids isn't great - but their tasks are very short so bygones - but Sidock looks pretty good by my standards in that context. If I was getting 1-to-1 for Folding on top of that, I'd be pretty happy.
On the proviso they all meet their respective deadlines.

Stopping the sharing of cores & threads will fix the actual problem, not just the symptoms.
First, I dispute the sharing of cores & threads is a) a problem and b) one that needs fixing.
You may dispute that, but it doesn't make it any less true.
And it needs fixing because the poster keeps complaining about it. If they don't complain about it, then no it doesn't need fixing.
It's not true at all because you're only counting Boinc work as work.
Not at all - Folding is work, watching movies on the computer is work, doing emails is work, editing photos is work, transcoding movies is work, and as far as the Computer is concerned- playing games is work, surfing the internet is work.
And each and every one of these things requires CPU time in order to do it.

In the case of the vast majority of those tasks the amount of CPU time they require is negligible.
In the case of Folding, and transcoding, it is not negligible & it requires a full core/thread (or more to do). If it is being fully used by those programmes, then BOINC trying to make use of it as well will result in that BOINC work running much, much, much slower than it would if it wasn't sharing that core/thread (as the times i reposted above show- up to almost 4 times slower in some cases).

This is all self-evident. But you've missed out where the problem is.
All the things you've pointed out are things you've chosen to do.
And from the outset we all understand that Boinc runs in the gaps when we're not fully utilising our computers, not ever 100% of the time.
And if you chose to do one thing you're prioritising that over Boinc.
Personally I insist on that because if I ever got bogged down in writing or viewing or whatever I'd consider that a big problem.
So if I <only> got the "losses" in task processing time that you later point me to, the first thing I'd think is I'm wasting my time having a computer because I'm not doing anything with it but donating it to distributed computing.
Frankly, I'm not that rich nor generous.
I like distributed computing, but not that much. If I didn't already have a computer for my own needs, I wouldn't buy one to run Boinc (or non-Boinc) tasks.

Unfortunately (because it's boring) I'm going to have to spell it out.
No you don't.
I am fully aware of what the effect of having a fixed Initial Estimated Completion time is when the actual Runtime isn't necessarily anywhere near that.
What you don't seem to be appreciating is that by having the system overcommitted, there will still be situations where problems arise because the Estimated Completion time can never, ever match the actual Runtime, because the Runtime (how long it actually takes) is so far out from the CPU time (how long it should actually take).

Just look at Rosetta 4.20 Tasks. They take 8 hours - which is the default CPU Target time. But because the system is overcommitted it will take 16 hours to actually process that Task.
The Initial Estimated completion time & the actual CPU time match perfectly, but problems will still occur because neither of those times comes close to the actual Runtime- all because the system is over committed.

I think our difference here is in your definition of over-commiting.
There are two causes for this imo. One is over-scheduling (my issue) and the other is under-processing <compared to the ideal> (your issue).
The first is a Boinc issue, the second a user issue.
Rosetta actively misleads Boinc when using non-standard runtimes (and, we now discover, standard/default runtimes).
If that's what the user intended, fine but account for it in the way I said because Boinc is slow to adapt during processing and is prevented from adapting for the cache.
Regarding under-processing, first the user expects a certain amount of that (at my level) and it may only be a consequence of the conflict between Boinc and non-Boinc processing so CPUs are fully utilised.
Imo you only need to 'take a view' on that to ensure your offline cache is small enough to account for the discrepancy - even if it's a big one - because none of it matters, within tasks, as long as deadlines are met.
Any non-processing of particular tasks is taken up either by non-Boinc processing or what you've chosen to do as a user, all of which the user actively prioritises ahead of low-priority Boinc processing and by definition is not a problem.

That's my view. I'm clear it's not yours.
ID: 109192 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 275 · 276 · 277 · 278 · 279 · 280 · 281 . . . 301 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org