Message boards : Number crunching : Help us solve the 1% bug!
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next
Author | Message |
---|---|
Rom Walton (BOINC) Volunteer moderator Project developer Send message Joined: 17 Sep 05 Posts: 18 Credit: 40,071 RAC: 0 |
What is the size of your BOINC directory? How many days worth of workunits do your have? Which projects are attached? Would you be willing to make a copy of the directory and in the copy abort all of the other workunits except the one that is stalling and zip everything up and send it to me? ----- Rom My Blog |
genes Send message Joined: 8 Oct 05 Posts: 60 Credit: 702,872 RAC: 1,035 |
What is the size of your BOINC directory? Hi Rom, My BOINC directory is 1.3GB. I am attached to CPDN (regular and seasonal), Rosetta, Ralph, Einstein, Seti, and Seti Beta. I currently have a CPDN seasonal and a CPDN sulphur WU, a ready-to-report Rosetta and the suspended Rosetta, a Seti Beta and an Einstein. I've set everything to "no new tasks" for now. Running BOINC CC 5.3.28. I keep a 0.1 day cache, so I don't have a lot of WU's around. I would not be happy to abort the CPDN WU's. I don't mind suspending everything for the time it takes to zip, etc., or aborting the other WU's. [edit] Wait a minute. Did I misunderstand you -- you mean abort the other WU's *after making a copy*, then send you that copy, then go about my merry way... sure, I'll do that. Please advise on where and how to send... [/edit] |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
All work units sent out since Friday have a maximum time limit of roughly 24 hours, so no computers should be getting stuck much longer than this |
genes Send message Joined: 8 Oct 05 Posts: 60 Credit: 702,872 RAC: 1,035 |
Rom- I tried this: First I suspended network activity and work. I made a backup copy of my BOINC directory, then I restarted BOINC in its original directory. I aborted everything but the stuck Rosetta. I let the Rosetta go, and it passed the stuck point. I killed BOINC, then deleted everything from Program FilesBOINC. Copied back the contents of BOINC_backup, started up again. Unsuspended the stuck Rosetta, it got stuck again. So, I have this backup copy of my BOINC directory where this Rosetta WU will stick, but it seems to require the other processes to be running. I can burn this backup to a DVD-R and send it to you, how about that? [edit] BTW, 4 at a time, dual Xeon with HT. [/edit] [edit] Going to sleep now, will check again in the AM...[/edit] |
Rom Walton (BOINC) Volunteer moderator Project developer Send message Joined: 17 Sep 05 Posts: 18 Credit: 40,071 RAC: 0 |
|
Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0 |
All work units sent out since Friday have a maximum time limit of roughly 24 hours, so no computers should be getting stuck much longer than this Not so I today have just aborted 3 that were at 1% for 28 to 38 Hrs. Your self abort is Not working I hope it at least sends you back data as to Why it did not abort and why it got stuck If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
Not so I today have just aborted 3 that were at 1% for 28 to 38 Hrs. Your self abort is Not working I hope it at least sends you back data as to Why it did not abort and why it got stuck If you look at the WU ID page (NOT the result ID) it gives a creation date for the WU. What are the creation dates for those stuck WUs? The "All work units sent out since Friday" would refer to the creation date, not when you actually got the WU. Your computers are hidden, so I couldn't figure out which WUs you are talking about. |
dag Send message Joined: 16 Dec 05 Posts: 106 Credit: 1,000,020 RAC: 0 |
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=12293043 https://boinc.bakerlab.org/rosetta/result.php?resultid=15133540 dag --Finding aliens is cool, but understanding the structure of proteins is useful. |
Nite Owl Send message Joined: 2 Nov 05 Posts: 87 Credit: 3,019,449 RAC: 0 |
I have one stuck at 1% (7:50:25) that was creates on the 25th (Sat) at 22:19 UTC... I suspect the fix is unfixed... (Oooops! I just noticed Davids comment about 24 hours) Result ID 14982362 Name HB_BARCODE_30_5croA_351_22702_0 Workunit 12161077 Created 25 Mar 2006 22:19:20 UTC Sent 26 Mar 2006 8:33:29 UTC Received --- Server state In Progress Outcome Unknown Client state New Exit status 0 (0x0) Computer ID 159713 Report deadline 9 Apr 2006 8:33:29 UTC CPU time 0 stderr out Validate state Initial |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
I have one stuck at 1% (7:50:25) that was creates on the 25th (Sat) at 22:19 UTC... I suspect the fix is unfixed... Jobs beginning HB_BARCODE... were queued before we reduced the maximum cpu time, and we can't change the time limit retroactively. if you are having a lot of trouble with stuck WU, you can delete these work units. |
Pappateam Send message Joined: 9 Jan 06 Posts: 2 Credit: 1,610,324 RAC: 0 |
Still having WU's stuck everyday at 1%. Computers range from Duron800 to T2300 (most of them are AMD) and no difference between them. Sometimes I notice the problem after about 50 hours, so this problem is very bad. Is there a solution in the horizon? |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
Still having WU's stuck everyday at 1%. Computers range from Duron800 to T2300 (most of them are AMD) and no difference between them. Sometimes I notice the problem after about 50 hours, so this problem is very bad. The new work units should not be getting stuck at 1%. Could you try removing all pre 4.83 (on windows) work units and let us know what happens? |
[DPC]TeamHC~LostPoints Send message Joined: 19 Mar 06 Posts: 1 Credit: 272,665 RAC: 0 |
Got the same 1% problem over here. Killing the WU didn't help, the next one also got the 1% problem. Then I reset the project. ( I've a dutch version so I don't know exactly the English name for the button) After resetting the project all WU's were deleted and new ones were downloaded. Now the system runs perfectly and since then no 1% errors occurred. |
Pappateam Send message Joined: 9 Jan 06 Posts: 2 Credit: 1,610,324 RAC: 0 |
The new work units should not be getting stuck at 1%. Could you try removing all pre 4.83 (on windows) work units and let us know what happens? This really seems to have solved the problem! Big thanks David! |
Osku87 Send message Joined: 1 Nov 05 Posts: 17 Credit: 280,268 RAC: 0 |
Nicely done, except there is a one little flaw. It may be called 1.042% bug. (The last number can be found in graphics). WU stopped after about fifteen minutes of crunching. Rebooting the client or suspending and resuming the WU doesn't help. Now aborting. There went 9 hours of crunching... Stage: Full atom relax Model: 1 Step: 320044 Program version is 4.83 https://boinc.bakerlab.org/rosetta/result.php?resultid=16235196 Hope this was the only one. |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
The 042 in the 1.042% is supposed to give the programmers a much better idea of where the program is getting stuck. But there's a few other numbers being passed around - so .042 may not be the only sticking point. By reporting the whole number of where the WU was stuck, they'll hopefully kill off the last traces of this bug. |
Corgi Send message Joined: 17 Oct 05 Posts: 2 Credit: 389,209 RAC: 0 |
I've got another one. Here's a copy of all the text on the BOINC display, plus the URL of a screenshot of the same. When I ran the test from the command prompt, it stopped at exactly the same point -- 39 min+ so far at time of this posting. FA_RLXpt_hom006_1ptq__361_426_1 (left in memory) ------------------------ 1.042% Complete CPU time: 9 hr 13 min 58 sec Corgi - Total credit: 1064.71 - RAC: 16.7777 GasBuddy Stage: Full atom relax Model: 1 Step: 314653 Accepted RMSD: 10.78 Accepted Energy: -51.5163 Rosetta@home v4.83 [URL] Screenshot: http://pics.livejournal.com/sff_corgi/pic/000k21q6 (39.6Kb) ------------------------ PC ID: 23940 'Sothis' GenuineIntel Intel(R) Pentium(R) M processor 1500MHz Microsoft Windows XP Home Edition, Service Pack 2, (05.01.2600.00) Corgi |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
I've got another one. Here's a copy of all the text on the BOINC display, plus the URL of a screenshot of the same. When I ran the test from the command prompt, it stopped at exactly the same point -- 39 min+ so far at time of this posting. Apparently this is one of the "old" pre-4.83 WUs (its date is 22-Mar-06) which obviously has a problem, as it failed on another PC: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11819500 I would just abort it. PS: AFAIK, the only info needed when reporting a stuck WU, is just WU number e.g. #11819500 in this case (or just its name). If you just abort it, the project will also know the random-seed (it shows in stderr.txt output in resultid) Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
Mike Send message Joined: 21 Dec 05 Posts: 9 Credit: 35,252 RAC: 0 |
Hi All. I have a 2.4 gb pc with 256mb of ram. Running Windows XP Home with SP2. I have had no failures since I turned off all screen savers (I turn the display off) and leave unfinished WU in memory (i.e. Hard drive.) I run Rosetta,Seti and Predictor. No failures since 17/03/06. |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
PS: AFAIK, the only info needed when reporting a stuck WU, is just WU number e.g. #11819500 in this case (or just its name). If you just abort it, the project will also know the random-seed (it shows in stderr.txt output in resultid) I believe they would like to know the exact percentage complete that the WU was stuck at. |
Message boards :
Number crunching :
Help us solve the 1% bug!
©2024 University of Washington
https://www.bakerlab.org