Message boards : Number crunching : Issues with 4.82
Author | Message |
---|---|
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
The increased frequency of problems with version 4.82 we think is probably due to the increased average work unit run time. If a significant fraction of your work units are having problems, please reduce the target run time to two hours (the default is currently 8 hours)--this should reduce the chance of an error during the run by a factor of four. we will also reduce the default target time to four hours. on RALPH we didn't see these problems probably because the default time was set to one hour so we could get test results back quickly. David Kim is working hard to get stack tracing implemented so we can eliminate the sources of the errors as soon as possible. this is our number one priority. On the few windows machines we have locally, we have seen almost no errors with 4.82. of course it would be very useful to know what machine configurations are most correlated with errors. for example, perhaps optimized clients could be more likely to have problems? it would be very useful if people who read this could briefly describe their machines and the fraction of work units that are having problems--hopefully patterns will emerge which will help isolate the problems. |
[B@H] Ray Send message Joined: 20 Sep 05 Posts: 118 Credit: 100,251 RAC: 0 |
Dave No problems yet over here. Two finished in about 8 hours time each. Pizza@Home Rays Place Rays place Forums |
Nothing But Idle Time Send message Joined: 28 Sep 05 Posts: 209 Credit: 139,545 RAC: 0 |
...we will also reduce the default target time to four hours. I haven't encountered any errors -- yet -- and would like to leave the run time at 8 hours but don't see any option to select 8 hours, per se. On the few windows machines we have locally, we have seen almost no errors with 4.82.For what it's worth: My machine is 3GHz/HT windows xp with no WU errors so far (fingers crossed). In fact, I rarely encounter any error on any WU on any project (except for ghost WUs). Makes me wonder if my machine setup is super good (yeah, right!) or the problems are more related to individual computer setups like networking or over-clocking, or using machines at near minimum requirements. My setup is straight out of the box from Dell and I don't tamper with it (don't know how anyway), though I did increase the memory considerably because I could afford it and I like big safety margins even if it's wasted. |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
From here you can change your "target CPU time". Default (I.E not selected) is 8 hours. |
arklms Send message Joined: 17 Dec 05 Posts: 7 Credit: 177,488 RAC: 0 |
The workunits I have had error have all crashed within the first few seconds of startup. Most of these have occurred on a P3 Coppermine 667MHz, but I've seen the occasional one on a Dual AMD setup too. |
Hoelder1in Send message Joined: 30 Sep 05 Posts: 169 Credit: 3,915,947 RAC: 0 |
The workunits I have had error have all crashed within the first few seconds of startup. Most of these have occurred on a P3 Coppermine 667MHz, but I've seen the occasional one on a Dual AMD setup too. See this post by David Kim. |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
...we will also reduce the default target time to four hours. thanks! collectively we should be able to figure out what machine configurations are having the most errors and then track down the problems. please post any ideas and success rates on your machines (good and bad). |
Scribe Send message Joined: 2 Nov 05 Posts: 284 Credit: 157,359 RAC: 0 |
This one has bombed out twice.... |
Hoelder1in Send message Joined: 30 Sep 05 Posts: 169 Credit: 3,915,947 RAC: 0 |
This one has bombed out twice.... Yes, but it won't do so again because the batch (*fullatom*318*) has been cancelled by David Kim (see my link to his post further down in this thread). |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
I have an Athlon XP 2400 with 512MB and Win 2000 sp2 with no errors (has crunched about a dozen 8-hour WUs). https://boinc.bakerlab.org/rosetta/results.php?hostid=109566 It does have an occational ghost WU. When this happened the message log said: Wed Feb 22 00:11:57 2006|rosetta@home|Started upload of PRODUCTION_ABINITIO_INCREASECYCLES50_1fkb__317_256_0_0 Wed Feb 22 00:11:57 2006|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi Wed Feb 22 00:11:57 2006|rosetta@home|Reason: To fetch work Wed Feb 22 00:11:57 2006|rosetta@home|Requesting 17537 seconds of new work Wed Feb 22 00:12:04 2006|rosetta@home|Finished upload of PRODUCTION_ABINITIO_INCREASECYCLES50_1fkb__317_256_0_0 Wed Feb 22 00:12:04 2006|rosetta@home|Throughput 10382 bytes/sec Wed Feb 22 00:14:03 2006|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi failed with a return value of 500 Wed Feb 22 00:14:03 2006|rosetta@home|No schedulers responded Wed Feb 22 00:15:03 2006|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi Wed Feb 22 00:15:03 2006|rosetta@home|Reason: To fetch work Wed Feb 22 00:15:03 2006|rosetta@home|Requesting 17352 seconds of new work, and reporting 1 results Wed Feb 22 00:15:08 2006|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded Wed Feb 22 00:15:08 2006|rosetta@home|Message from server: Not sending work - last RPC too recent: 185 sec Wed Feb 22 00:15:08 2006|rosetta@home|No work from project The above was for WU https://boinc.bakerlab.org/rosetta/result.php?resultid=11872367 With an earlier ghost the message log said: Sat Feb 18 04:18:26 2006|rosetta@home|Started upload of PRODUCTION_ABINITIO_INCREASECYCLES50_1cei__312_341_0_0 Sat Feb 18 04:18:26 2006|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi Sat Feb 18 04:18:26 2006|rosetta@home|Reason: To fetch work Sat Feb 18 04:18:26 2006|rosetta@home|Requesting 2286 seconds of new work Sat Feb 18 04:18:31 2006|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded Sat Feb 18 04:18:31 2006|rosetta@home|Message from server: Server can't open database Sat Feb 18 04:18:31 2006|rosetta@home|Project is down Sat Feb 18 04:18:32 2006|rosetta@home|Finished upload of PRODUCTION_ABINITIO_INCREASECYCLES50_1cei__312_341_0_0 Sat Feb 18 04:18:32 2006|rosetta@home|Throughput 10921 bytes/sec Sat Feb 18 05:18:32 2006|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi Sat Feb 18 05:18:32 2006|rosetta@home|Reason: To fetch work Sat Feb 18 05:18:32 2006|rosetta@home|Requesting 6523 seconds of new work, and reporting 1 results Sat Feb 18 05:22:02 2006|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi failed with a return value of 500 Sat Feb 18 05:22:02 2006|rosetta@home|No schedulers responded that was for WU https://boinc.bakerlab.org/rosetta/result.php?resultid=11663100 There is no sign that the files for these ghost WUs were ever downloaded. Perhaps there can be a problem if work is requested when an upload is in progress? I have a bunch of Linux machines that haven't had any problems, but they seem to be using 4.81. |
uioped1 Send message Joined: 9 Feb 06 Posts: 15 Credit: 1,058,481 RAC: 0 |
The increased frequency of problems with version 4.82 we think is probably due to the increased average work unit run time. If a significant fraction of your work units are having problems, please reduce the target run time to two hours (the default is currently 8 hours)--this should reduce the chance of an error during the run by a factor of four. we will also reduce the default target time to four hours. on RALPH we didn't see these problems probably because the default time was set to one hour so we could get test results back quickly. Not an error exactly, but a complaint: I just got my first batch of the new 4.82 workunits, which will take vastly more time than Boinc requested. I'm not clear on how the scheduler decides on how many workunits will fulfill a request for x seconds of work, but apparently this was not adjusted with the new search mode. (requested 48 hours of work, received possibly 120) |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
The increased frequency of problems with version 4.82 we think is probably due to the increased average work unit run time. If a significant fraction of your work units are having problems, please reduce the target run time to two hours (the default is currently 8 hours)--this should reduce the chance of an error during the run by a factor of four. we will also reduce the default target time to four hours. on RALPH we didn't see these problems probably because the default time was set to one hour so we could get test results back quickly. You can adjust the run length of the WUs in your preferences. See this post in the Rosetta FAQs for details. Moderator9 ROSETTA@home FAQ Moderator Contact |
genes Send message Joined: 8 Oct 05 Posts: 60 Credit: 694,934 RAC: 555 |
Just had another WU fail on this machine: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=13228 but this time it was with a Ralph WU. On Ralph the link to the machine is this: http://ralph.bakerlab.org/show_host_detail.php?hostid=953 The failure was exactly the same as with the three 4.82 WU's that failed earlier. BTW the last one I got completed successfully. A little bit of log around the error:
The machine is a dual P3 1GHz with 1GB of ram, running WinXP SP2. Running with "Leave in Memory" = YES, several other BOINC projects, ... did I forget anything? Oh yes, BOINC CC 5.2.15. Failed Rosetta WU's: https://boinc.bakerlab.org/rosetta/result.php?resultid=11823719 https://boinc.bakerlab.org/rosetta/result.php?resultid=11805479 https://boinc.bakerlab.org/rosetta/result.php?resultid=11796212 Failed Ralph WU: http://ralph.bakerlab.org/result.php?resultid=6153 |
Beezlebub Send message Joined: 18 Oct 05 Posts: 40 Credit: 260,375 RAC: 0 |
I am running 3 mach (1 P4 2ghz, 1 AMD 2800 2.1ghz 1 P D820 2.8ghz) on Rosetta, Seti, Einstein all running 24/7. With 91 total Rosetta results only 2 were unsuccessful due to download problems (server I think). I believe the people with multiple failures need to check out their hardware and stability (overclocking, heat, antivirus, etc.) BEFORE complaining about the program itself. I'm not attacking anyone, I'm just pointing out the fact that my mix of machines and others who posted "no problems" point to something OTHER than the 4.82 being the problem. My machine specs: "Black" P4 2ghz 1024mb PC2700 DDR Foxcon mb. "Boinc" AMD 2800+ 2.1ghz 1024mb Kingmax PC3200 DDR. "TxEagle" Pentium D820 dual core 2.8ghz. 1gig OCZ Gold PC2-5400 DDR2 dual channel, ASUS P5LD2-VM MB. All are running XP Pro Sp2 e6600 quad @ 2.5ghz 2418 floating point 5227 integer e6750 dual @ 3.71ghz 3598 floating point 7918 integer |
uioped1 Send message Joined: 9 Feb 06 Posts: 15 Credit: 1,058,481 RAC: 0 |
Ah, I hadn't realized that you could do that for units already downloaded. Still, that wouldn't have fixed the problem I experienced. At best, I can set my requested time back to two, and try to set it back before I start my last result so that I can correct the time correction factor for next time. Thanks for your help. |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
See if this post helps you at all. Moderator9 ROSETTA@home FAQ Moderator Contact |
uioped1 Send message Joined: 9 Feb 06 Posts: 15 Credit: 1,058,481 RAC: 0 |
That post was definitely helpful. I think a new FAQ entry or stickied thread specifically about the effects you will see before the boinc clients adjust to the new units. What I posted here was an attempt, but I haven't worked through my queue yet, so I don't even know that it's correct. The part I'm specifically referring to is that: This is a temporary problem, It can be mitigated temporarily by doing X That it will happen every time you download workunits after increasing proc time Aborting all your units won't fix the problem, until you have completed at least a few of the new units. Please correct me if I've got something wrong, I'd hate to be disseminating wrong info. |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,526,853 RAC: 5,737 |
I had one PC running an 'optimised' client and it started getting lots of errors. I swapped boinc.exe for the standard one from one of my other PCs and the errors seem to have stopped. Anything reported after the 24th Feb has been run using the standard updated client: https://boinc.bakerlab.org/rosetta/results.php?hostid=53007 Might be conincidence, or another unrelated problem, but it might be the problem with a number of the clients out there... |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
I have added to the FAQ on adjusting the time most of the content of my original post and modified a small portion to include some the items you mentioned. I will have to work on it a little more to add more text on some of your points. Good suggestions thank you. Moderator9 ROSETTA@home FAQ Moderator Contact |
Message boards :
Number crunching :
Issues with 4.82
©2024 University of Washington
https://www.bakerlab.org