out of work

Message boards : Number crunching : out of work

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1833
Credit: 120,009,519
RAC: 6,828
Message 26074 - Posted: 5 Sep 2006, 6:48:30 UTC

are you getting work again now?

As your machines are hidden we can't see the results of the jobs returned so it makes working out the problem a bit more difficult. (if you unhide your computers then no-one can see their names other than you - have a look at mine and you'll see you only get limited info if they're not your machines)

HTH
Danny
ID: 26074 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ananas

Send message
Joined: 1 Jan 06
Posts: 232
Credit: 752,471
RAC: 0
Message 26075 - Posted: 5 Sep 2006, 7:22:59 UTC
Last modified: 5 Sep 2006, 7:25:00 UTC

Afaik. the daily quota is per CPU as long as you have up to 4 CPUs. So there is an absolute maximum for a machine, which is four times the daily quota per CPU (400 in case of Rosetta), allowing about 400 hours of CPU time in the smallest WU setting.

So an 8 CPU machine will receive work for about 50 hours per day (total) in the small WU setting, about 1200 hours per day in the largest WU setting. Even if the WUs are usually a little below the target time, that should keep that box busy :-)

Damaged and lost results reduce the daily quota, getting the "daily quota reached" message must be caused by some problem on the machine.
ID: 26075 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
SuperG //1.303.02%

Send message
Joined: 4 May 06
Posts: 14
Credit: 1,561,763
RAC: 0
Message 26118 - Posted: 5 Sep 2006, 17:52:58 UTC - in response to Message 26075.  

Thanks to doc, dcdc, and Ananas. Your comments helped determine root causes.
I'll convey what happened so others may benefit...

1) With 8core machines, set to 24hr work unit, and 2 days network connect,
we wound up with 120 days (!!!) of work in the machine queue. This was true
and consistent amongst all those 8core machines. Not realizing the
consequences (newbies to Rosetta), we reset to lower cpu target time, and
more frequent network connect. And then committed suicide by manually
aborting the processes which had not yet started...

2) The result (predictable to those who knew) was the daily quota problem.
Otherwise known as "pilot error." That would my fault.

3) Had tried "Reset project" but that did no good to changing the daily
quota numbers. Considered "Detaching" from project, then re-attaching
later, and merging stats at another time. Finally decided to let things
settle down overnight and see how things were in the AM.

4) All is back to normal now, machines being fed work, and results happily
sending back to Rosetta servers.

Again thanks foks, you were a big help.

ID: 26118 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 26120 - Posted: 5 Sep 2006, 18:19:04 UTC - in response to Message 26118.  

Thanks to doc, dcdc, and Ananas. Your comments helped determine root causes.
I'll convey what happened so others may benefit...

1) With 8core machines, set to 24hr work unit, and 2 days network connect,
we wound up with 120 days (!!!) of work in the machine queue. This was true
and consistent amongst all those 8core machines. Not realizing the
consequences (newbies to Rosetta), we reset to lower cpu target time, and
more frequent network connect. And then committed suicide by manually
aborting the processes which had not yet started...

2) The result (predictable to those who knew) was the daily quota problem.
Otherwise known as "pilot error." That would my fault.

3) Had tried "Reset project" but that did no good to changing the daily
quota numbers. Considered "Detaching" from project, then re-attaching
later, and merging stats at another time. Finally decided to let things
settle down overnight and see how things were in the AM.

4) All is back to normal now, machines being fed work, and results happily
sending back to Rosetta servers.

Again thanks foks, you were a big help.



You should have listened to Feet1st:
https://boinc.bakerlab.org/forum_thread.php?id=2236#25957. ;-)

It would be very interesting if you could unhide your hosts. As other pointed out, no information, which would allow to identify your hosts will be presented to other users, just the specs and OS plus the credits. :-)
ID: 26120 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 26126 - Posted: 5 Sep 2006, 20:05:47 UTC
Last modified: 5 Sep 2006, 20:06:04 UTC

OK, so now we know what happened. For every valid result you return your daily WU quota doubles, so just crunch what you have, report it back and you should have sufficient quota to keep you busy.

For each WU that failed, your daily quota was reduced by 1. So, by default, daily quota is 100. But, for obvious reasons, your quota cannot be any less then 1, and will quickly covery normally.

In the short term, if you don't have all of your CPUs busy, you might set the WU runtime preference to an hour and get a few WUs reported.

Another tip I use is that while I'm tinkering trying to get work downloaded or new WU runtimes established, on the Projects tab, you can select "no new tasks" to prevent your machine from getting too much work based on short WU runtimes. The problem is always remembering to set it back when you're estimated runtime is inline with your WU runtime preference. You can also select the option to suspend network activity. This is under the Activity tab. This is handy when you want to avoid getting any more WUs until you've completed the ones you have (to recognize their longer runtime perhaps).
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 26126 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
SuperG //1.303.02%

Send message
Joined: 4 May 06
Posts: 14
Credit: 1,561,763
RAC: 0
Message 26179 - Posted: 6 Sep 2006, 14:13:56 UTC - in response to Message 26126.  

Thanks Feet1st, tralala, doc, dcdc, and Ananas.

Don't want to bore anyone, nor get too deeply into this, however.... ONLY the 4P/dual-core machines were effected. The 8P/duals, 8P/singles, 4P/singles, 2P/duals, 2P/singles were not. Nor did we make big changes to both WU settings and reconnect time, only the WU setting.

I'm sure you see the problem... the General and Rosetta settings are universal, but the ridiculous work amount was only sent to the 4P/dual machines. Hence they were the only ones where we aborted un-started work, so they were the ones that got their quota cut, and so their problem. Once we left them alone for 12 hours, they got new work. Through-out the episode, all the other machines were kept busy 100% of the time.

BTW - I do understand why folks would like for our computers to be visible, but given the testing environment, really can't happen for NDA reasons. And it is exactly the specs and OS that can't be visible.

ID: 26179 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 19 Sep 05
Posts: 403
Credit: 537,991
RAC: 0
Message 29178 - Posted: 11 Oct 2006, 17:19:21 UTC - in response to Message 25828.  

I get this message.

2006-09-01 16:40:33|rosetta@home|No work from project

Anybody?

Anders n



Is it time again ??

Anders n
ID: 29178 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Christoph

Send message
Joined: 10 Dec 05
Posts: 57
Credit: 1,512,386
RAC: 0
Message 29181 - Posted: 11 Oct 2006, 17:26:12 UTC

I couldn't reach the server for a few hours. Now I'm getting this message too.
ID: 29181 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 29182 - Posted: 11 Oct 2006, 17:33:13 UTC

The server status page says there are only 2 WUs "Ready to send". I suppose this is because all the new WUs are bombing out with download errors.
ID: 29182 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Tymbrimi
Volunteer moderator
Avatar

Send message
Joined: 22 Aug 06
Posts: 148
Credit: 153
RAC: 0
Message 29188 - Posted: 11 Oct 2006, 17:52:17 UTC

Passed this on to the Rosetta Team, so there's probably someone at the download server trying to teach the NIC how to speak Internet again. <attempt at humor>

We should get a response here when they've tracked down the problem, and given us a batch of error free WUs to download.
Rosetta Moderator: Mod.Tymbrimi
ROSETTA@home FAQ
Moderator Contact
ID: 29188 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 29190 - Posted: 11 Oct 2006, 18:16:35 UTC - in response to Message 29188.  

Passed this on to the Rosetta Team, so there's probably someone at the download server trying to teach the NIC how to speak Internet again. <attempt at humor>

We should get a response here when they've tracked down the problem, and given us a batch of error free WUs to download.



Probably because we just quickly plowed through all the task with errors ;-)

Could you also pass on another error that they may not detect since they return valid results

# random seed: 2214495
# cpu_run_time_pref: 7200
# cpu_run_time_pref: 7200
WARNING! error deleting file .aa1d5m.out
======================================================
DONE :: 1 starting structures built 36 (nstruct) times
This process generated 36 decoys from 36 attempts
0 starting pdbs were skipped
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>



Bold added to emphasis the error, seems to be happening with all the results I've returned with the new 5.32 client (if they havn't had the file transfer error)

What happened to Ralph testing :-D
Team mauisun.org
ID: 29190 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1480
Credit: 4,334,829
RAC: 0
Message 29193 - Posted: 11 Oct 2006, 19:22:40 UTC

Please see this post.

The warning below can be ignored.
ID: 29193 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chu

Send message
Joined: 23 Feb 06
Posts: 120
Credit: 112,439
RAC: 0
Message 29199 - Posted: 11 Oct 2006, 20:51:40 UTC - in response to Message 29190.  

this is not a real error but more like a warning. The reason is that on Windows platform, there is a problem of removing the original source file after it is gzipped. We can probably turn this warning off in the next update and avoid confusion. The actual result is gzipped and validated correctly...
Passed this on to the Rosetta Team, so there's probably someone at the download server trying to teach the NIC how to speak Internet again. <attempt at humor>

We should get a response here when they've tracked down the problem, and given us a batch of error free WUs to download.



Probably because we just quickly plowed through all the task with errors ;-)

Could you also pass on another error that they may not detect since they return valid results

# random seed: 2214495
# cpu_run_time_pref: 7200
# cpu_run_time_pref: 7200
WARNING! error deleting file .aa1d5m.out
======================================================
DONE :: 1 starting structures built 36 (nstruct) times
This process generated 36 decoys from 36 attempts
0 starting pdbs were skipped
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>



Bold added to emphasis the error, seems to be happening with all the results I've returned with the new 5.32 client (if they havn't had the file transfer error)

What happened to Ralph testing :-D


ID: 29199 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 29236 - Posted: 12 Oct 2006, 12:26:54 UTC - in response to Message 29199.  

this is not a real error but more like a warning. The reason is that on Windows platform, there is a problem of removing the original source file after it is gzipped. We can probably turn this warning off in the next update and avoid confusion. The actual result is gzipped and validated correctly...
Passed this on to the Rosetta Team, so there's probably someone at the download server trying to teach the NIC how to speak Internet again. <attempt at humor>

We should get a response here when they've tracked down the problem, and given us a batch of error free WUs to download.



Probably because we just quickly plowed through all the task with errors ;-)

Could you also pass on another error that they may not detect since they return valid results

# random seed: 2214495
# cpu_run_time_pref: 7200
# cpu_run_time_pref: 7200
WARNING! error deleting file .aa1d5m.out
======================================================
DONE :: 1 starting structures built 36 (nstruct) times
This process generated 36 decoys from 36 attempts
0 starting pdbs were skipped
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>



Bold added to emphasis the error, seems to be happening with all the results I've returned with the new 5.32 client (if they havn't had the file transfer error)

What happened to Ralph testing :-D



Well as long as the file is left behind (i.e. is eventually delete so we)
All is ok.
Team mauisun.org
ID: 29236 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2

Message boards : Number crunching : out of work



©2025 University of Washington
https://www.bakerlab.org