Message boards : Number crunching : out of work
Previous · 1 · 2
Author | Message |
---|---|
![]() Send message Joined: 3 Nov 05 Posts: 1833 Credit: 120,009,519 RAC: 6,828 ![]() |
are you getting work again now? As your machines are hidden we can't see the results of the jobs returned so it makes working out the problem a bit more difficult. (if you unhide your computers then no-one can see their names other than you - have a look at mine and you'll see you only get limited info if they're not your machines) HTH Danny |
Ananas Send message Joined: 1 Jan 06 Posts: 232 Credit: 752,471 RAC: 0 |
Afaik. the daily quota is per CPU as long as you have up to 4 CPUs. So there is an absolute maximum for a machine, which is four times the daily quota per CPU (400 in case of Rosetta), allowing about 400 hours of CPU time in the smallest WU setting. So an 8 CPU machine will receive work for about 50 hours per day (total) in the small WU setting, about 1200 hours per day in the largest WU setting. Even if the WUs are usually a little below the target time, that should keep that box busy :-) Damaged and lost results reduce the daily quota, getting the "daily quota reached" message must be caused by some problem on the machine. |
SuperG //1.303.02% Send message Joined: 4 May 06 Posts: 14 Credit: 1,561,763 RAC: 0 |
Thanks to doc, dcdc, and Ananas. Your comments helped determine root causes. I'll convey what happened so others may benefit... 1) With 8core machines, set to 24hr work unit, and 2 days network connect, we wound up with 120 days (!!!) of work in the machine queue. This was true and consistent amongst all those 8core machines. Not realizing the consequences (newbies to Rosetta), we reset to lower cpu target time, and more frequent network connect. And then committed suicide by manually aborting the processes which had not yet started... 2) The result (predictable to those who knew) was the daily quota problem. Otherwise known as "pilot error." That would my fault. 3) Had tried "Reset project" but that did no good to changing the daily quota numbers. Considered "Detaching" from project, then re-attaching later, and merging stats at another time. Finally decided to let things settle down overnight and see how things were in the AM. 4) All is back to normal now, machines being fed work, and results happily sending back to Rosetta servers. Again thanks foks, you were a big help. |
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
Thanks to doc, dcdc, and Ananas. Your comments helped determine root causes. You should have listened to Feet1st: https://boinc.bakerlab.org/forum_thread.php?id=2236#25957. ;-) It would be very interesting if you could unhide your hosts. As other pointed out, no information, which would allow to identify your hosts will be presented to other users, just the specs and OS plus the credits. :-) |
![]() ![]() Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
OK, so now we know what happened. For every valid result you return your daily WU quota doubles, so just crunch what you have, report it back and you should have sufficient quota to keep you busy. For each WU that failed, your daily quota was reduced by 1. So, by default, daily quota is 100. But, for obvious reasons, your quota cannot be any less then 1, and will quickly covery normally. In the short term, if you don't have all of your CPUs busy, you might set the WU runtime preference to an hour and get a few WUs reported. Another tip I use is that while I'm tinkering trying to get work downloaded or new WU runtimes established, on the Projects tab, you can select "no new tasks" to prevent your machine from getting too much work based on short WU runtimes. The problem is always remembering to set it back when you're estimated runtime is inline with your WU runtime preference. You can also select the option to suspend network activity. This is under the Activity tab. This is handy when you want to avoid getting any more WUs until you've completed the ones you have (to recognize their longer runtime perhaps). Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
SuperG //1.303.02% Send message Joined: 4 May 06 Posts: 14 Credit: 1,561,763 RAC: 0 |
Thanks Feet1st, tralala, doc, dcdc, and Ananas. Don't want to bore anyone, nor get too deeply into this, however.... ONLY the 4P/dual-core machines were effected. The 8P/duals, 8P/singles, 4P/singles, 2P/duals, 2P/singles were not. Nor did we make big changes to both WU settings and reconnect time, only the WU setting. I'm sure you see the problem... the General and Rosetta settings are universal, but the ridiculous work amount was only sent to the 4P/dual machines. Hence they were the only ones where we aborted un-started work, so they were the ones that got their quota cut, and so their problem. Once we left them alone for 12 hours, they got new work. Through-out the episode, all the other machines were kept busy 100% of the time. BTW - I do understand why folks would like for our computers to be visible, but given the testing environment, really can't happen for NDA reasons. And it is exactly the specs and OS that can't be visible. |
![]() Send message Joined: 19 Sep 05 Posts: 403 Credit: 537,991 RAC: 0 |
I get this message. Is it time again ?? Anders n ![]() |
Christoph Send message Joined: 10 Dec 05 Posts: 57 Credit: 1,512,386 RAC: 0 |
I couldn't reach the server for a few hours. Now I'm getting this message too. |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
The server status page says there are only 2 WUs "Ready to send". I suppose this is because all the new WUs are bombing out with download errors. |
Mod.Tymbrimi Volunteer moderator ![]() Send message Joined: 22 Aug 06 Posts: 148 Credit: 153 RAC: 0 |
Passed this on to the Rosetta Team, so there's probably someone at the download server trying to teach the NIC how to speak Internet again. <attempt at humor> We should get a response here when they've tracked down the problem, and given us a batch of error free WUs to download. Rosetta Moderator: Mod.Tymbrimi ROSETTA@home FAQ Moderator Contact |
FluffyChicken![]() Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
Passed this on to the Rosetta Team, so there's probably someone at the download server trying to teach the NIC how to speak Internet again. <attempt at humor> Probably because we just quickly plowed through all the task with errors ;-) Could you also pass on another error that they may not detect since they return valid results # random seed: 2214495 # cpu_run_time_pref: 7200 # cpu_run_time_pref: 7200 WARNING! error deleting file .aa1d5m.out ====================================================== DONE :: 1 starting structures built 36 (nstruct) times This process generated 36 decoys from 36 attempts 0 starting pdbs were skipped ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... </stderr_txt> Bold added to emphasis the error, seems to be happening with all the results I've returned with the new 5.32 client (if they havn't had the file transfer error) What happened to Ralph testing :-D Team mauisun.org |
![]() Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0 |
|
Chu Send message Joined: 23 Feb 06 Posts: 120 Credit: 112,439 RAC: 0 |
this is not a real error but more like a warning. The reason is that on Windows platform, there is a problem of removing the original source file after it is gzipped. We can probably turn this warning off in the next update and avoid confusion. The actual result is gzipped and validated correctly... Passed this on to the Rosetta Team, so there's probably someone at the download server trying to teach the NIC how to speak Internet again. <attempt at humor> |
FluffyChicken![]() Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
this is not a real error but more like a warning. The reason is that on Windows platform, there is a problem of removing the original source file after it is gzipped. We can probably turn this warning off in the next update and avoid confusion. The actual result is gzipped and validated correctly... Well as long as the file is left behind (i.e. is eventually delete so we) All is ok. Team mauisun.org |
Message boards :
Number crunching :
out of work
©2025 University of Washington
https://www.bakerlab.org