Help us solve the 1% bug!

Message boards : Number crunching : Help us solve the 1% bug!

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9

AuthorMessage
Rom Walton (BOINC)
Volunteer moderator
Project developer

Send message
Joined: 17 Sep 05
Posts: 18
Credit: 40,071
RAC: 0
Message 12747 - Posted: 28 Mar 2006, 5:05:26 UTC

Contact me offline and I'll let you know where to send it.

----- Rom
----- Rom
My Blog
ID: 12747 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 12748 - Posted: 28 Mar 2006, 7:33:48 UTC - in response to Message 12743.  

All work units sent out since Friday have a maximum time limit of roughly 24 hours, so no computers should be getting stuck much longer than this

Not so I today have just aborted 3 that were at 1% for 28 to 38 Hrs. Your self abort is Not working I hope it at least sends you back data as to Why it did not abort and why it got stuck
If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 12748 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 12760 - Posted: 28 Mar 2006, 15:28:01 UTC - in response to Message 12748.  

Not so I today have just aborted 3 that were at 1% for 28 to 38 Hrs. Your self abort is Not working I hope it at least sends you back data as to Why it did not abort and why it got stuck

If you look at the WU ID page (NOT the result ID) it gives a creation date for the WU. What are the creation dates for those stuck WUs? The "All work units sent out since Friday" would refer to the creation date, not when you actually got the WU.

Your computers are hidden, so I couldn't figure out which WUs you are talking about.
ID: 12760 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dag
Avatar

Send message
Joined: 16 Dec 05
Posts: 106
Credit: 1,000,020
RAC: 0
Message 12763 - Posted: 28 Mar 2006, 18:06:21 UTC
Last modified: 28 Mar 2006, 18:07:41 UTC

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=12293043
https://boinc.bakerlab.org/rosetta/result.php?resultid=15133540

dag
--Finding aliens is cool, but understanding the structure of proteins is useful.
ID: 12763 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Nite Owl
Avatar

Send message
Joined: 2 Nov 05
Posts: 87
Credit: 3,019,449
RAC: 0
Message 12764 - Posted: 28 Mar 2006, 18:57:35 UTC
Last modified: 28 Mar 2006, 19:04:08 UTC

I have one stuck at 1% (7:50:25) that was creates on the 25th (Sat) at 22:19 UTC... I suspect the fix is unfixed...

(Oooops! I just noticed Davids comment about 24 hours)

Result ID 14982362
Name HB_BARCODE_30_5croA_351_22702_0
Workunit 12161077
Created 25 Mar 2006 22:19:20 UTC
Sent 26 Mar 2006 8:33:29 UTC
Received ---
Server state In Progress
Outcome Unknown
Client state New
Exit status 0 (0x0)
Computer ID 159713
Report deadline 9 Apr 2006 8:33:29 UTC
CPU time 0
stderr out

Validate state Initial

ID: 12764 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12779 - Posted: 29 Mar 2006, 3:21:01 UTC - in response to Message 12764.  

I have one stuck at 1% (7:50:25) that was creates on the 25th (Sat) at 22:19 UTC... I suspect the fix is unfixed...

(Oooops! I just noticed Davids comment about 24 hours)

Result ID 14982362
Name HB_BARCODE_30_5croA_351_22702_0
Workunit 12161077
Created 25 Mar 2006 22:19:20 UTC
Sent 26 Mar 2006 8:33:29 UTC
Received ---
Server state In Progress
Outcome Unknown
Client state New
Exit status 0 (0x0)
Computer ID 159713
Report deadline 9 Apr 2006 8:33:29 UTC
CPU time 0
stderr out

Validate state Initial



Jobs beginning HB_BARCODE... were queued before we reduced the maximum cpu time, and we can't change the time limit retroactively. if you are having a lot of trouble with stuck WU, you can delete these work units.

ID: 12779 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pappateam

Send message
Joined: 9 Jan 06
Posts: 2
Credit: 1,610,324
RAC: 0
Message 12892 - Posted: 31 Mar 2006, 22:34:32 UTC

Still having WU's stuck everyday at 1%. Computers range from Duron800 to T2300 (most of them are AMD) and no difference between them. Sometimes I notice the problem after about 50 hours, so this problem is very bad.
Is there a solution in the horizon?
ID: 12892 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12893 - Posted: 1 Apr 2006, 0:18:00 UTC - in response to Message 12892.  
Last modified: 1 Apr 2006, 1:35:38 UTC

Still having WU's stuck everyday at 1%. Computers range from Duron800 to T2300 (most of them are AMD) and no difference between them. Sometimes I notice the problem after about 50 hours, so this problem is very bad.
Is there a solution in the horizon?



The new work units should not be getting stuck at 1%. Could you try removing all pre 4.83 (on windows) work units and let us know what happens?
ID: 12893 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[DPC]TeamHC~LostPoints

Send message
Joined: 19 Mar 06
Posts: 1
Credit: 272,665
RAC: 0
Message 12908 - Posted: 1 Apr 2006, 14:29:46 UTC

Got the same 1% problem over here.
Killing the WU didn't help, the next one also got the 1% problem.
Then I reset the project. ( I've a dutch version so I don't know exactly the English name for the button)

After resetting the project all WU's were deleted and new ones were downloaded.
Now the system runs perfectly and since then no 1% errors occurred.
ID: 12908 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pappateam

Send message
Joined: 9 Jan 06
Posts: 2
Credit: 1,610,324
RAC: 0
Message 12974 - Posted: 3 Apr 2006, 9:04:26 UTC - in response to Message 12893.  

The new work units should not be getting stuck at 1%. Could you try removing all pre 4.83 (on windows) work units and let us know what happens?

This really seems to have solved the problem! Big thanks David!
ID: 12974 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Osku87

Send message
Joined: 1 Nov 05
Posts: 17
Credit: 280,268
RAC: 0
Message 13096 - Posted: 5 Apr 2006, 21:01:33 UTC

Nicely done, except there is a one little flaw. It may be called 1.042% bug. (The last number can be found in graphics). WU stopped after about fifteen minutes of crunching. Rebooting the client or suspending and resuming the WU doesn't help. Now aborting. There went 9 hours of crunching...

Stage: Full atom relax
Model: 1 Step: 320044

Program version is 4.83

https://boinc.bakerlab.org/rosetta/result.php?resultid=16235196

Hope this was the only one.
ID: 13096 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 13102 - Posted: 6 Apr 2006, 0:12:12 UTC

The 042 in the 1.042% is supposed to give the programmers a much better idea of where the program is getting stuck. But there's a few other numbers being passed around - so .042 may not be the only sticking point. By reporting the whole number of where the WU was stuck, they'll hopefully kill off the last traces of this bug.
ID: 13102 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Corgi
Avatar

Send message
Joined: 17 Oct 05
Posts: 2
Credit: 389,209
RAC: 0
Message 13146 - Posted: 7 Apr 2006, 2:44:51 UTC

I've got another one. Here's a copy of all the text on the BOINC display, plus the URL of a screenshot of the same. When I ran the test from the command prompt, it stopped at exactly the same point -- 39 min+ so far at time of this posting.

FA_RLXpt_hom006_1ptq__361_426_1 (left in memory)
------------------------
1.042% Complete
CPU time: 9 hr 13 min 58 sec

Corgi - Total credit: 1064.71 - RAC: 16.7777
GasBuddy

Stage: Full atom relax
Model: 1 Step: 314653
Accepted RMSD: 10.78
Accepted Energy: -51.5163

Rosetta@home v4.83 [URL]

Screenshot: http://pics.livejournal.com/sff_corgi/pic/000k21q6 (39.6Kb)
------------------------
PC ID: 23940 'Sothis'
GenuineIntel Intel(R) Pentium(R) M processor 1500MHz
Microsoft Windows XP Home Edition, Service Pack 2, (05.01.2600.00)
Corgi

ID: 13146 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 13147 - Posted: 7 Apr 2006, 3:19:12 UTC - in response to Message 13146.  

I've got another one. Here's a copy of all the text on the BOINC display, plus the URL of a screenshot of the same. When I ran the test from the command prompt, it stopped at exactly the same point -- 39 min+ so far at time of this posting.

FA_RLXpt_hom006_1ptq__361_426_1 (left in memory)
------------------------
1.042% Complete
CPU time: 9 hr 13 min 58 sec

Model: 1 Step: 314653
Accepted RMSD: 10.78


Apparently this is one of the "old" pre-4.83 WUs (its date is 22-Mar-06) which obviously has a problem, as it failed on another PC:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11819500

I would just abort it.

PS: AFAIK, the only info needed when reporting a stuck WU, is just WU number e.g. #11819500 in this case (or just its name). If you just abort it, the project will also know the random-seed (it shows in stderr.txt output in resultid)
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 13147 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike

Send message
Joined: 21 Dec 05
Posts: 9
Credit: 35,252
RAC: 0
Message 13162 - Posted: 7 Apr 2006, 11:02:39 UTC
Last modified: 7 Apr 2006, 11:07:45 UTC

Hi All. I have a 2.4 gb pc with 256mb of ram. Running Windows XP Home with SP2. I have had no failures since I turned off all screen savers (I turn the display off) and leave unfinished WU in memory (i.e. Hard drive.) I run Rosetta,Seti and Predictor. No failures since 17/03/06.


ID: 13162 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 13195 - Posted: 7 Apr 2006, 22:23:04 UTC - in response to Message 13147.  

PS: AFAIK, the only info needed when reporting a stuck WU, is just WU number e.g. #11819500 in this case (or just its name). If you just abort it, the project will also know the random-seed (it shows in stderr.txt output in resultid)


I believe they would like to know the exact percentage complete that the WU was stuck at.
ID: 13195 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 13198 - Posted: 8 Apr 2006, 0:59:19 UTC - in response to Message 13195.  

PS: AFAIK, the only info needed when reporting a stuck WU, is just WU number e.g. #11819500 in this case (or just its name). If you just abort it, the project will also know the random-seed (it shows in stderr.txt output in resultid)


I believe they would like to know the exact percentage complete that the WU was stuck at.


yes, we need to know this. the name of the work unit is also helpful as we can then see at a glance whether particular types of work units are having the most problems.
ID: 13198 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Corgi
Avatar

Send message
Joined: 17 Oct 05
Posts: 2
Credit: 389,209
RAC: 0
Message 13233 - Posted: 8 Apr 2006, 14:26:17 UTC - in response to Message 13198.  

PS: AFAIK, the only info needed when reporting a stuck WU, is just WU number e.g. #11819500 in this case (or just its name). If you just abort it, the project will also know the random-seed (it shows in stderr.txt output in resultid)


I believe they would like to know the exact percentage complete that the WU was stuck at.


yes, we need to know this. the name of the work unit is also helpful as we can then see at a glance whether particular types of work units are having the most problems.


Heh, I'd rather be providing too much than too little. Would you remind me where I can find the ID # and date for any suspect workunits again, please?

Corgi

ID: 13233 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Desti

Send message
Joined: 16 Sep 05
Posts: 50
Credit: 3,018
RAC: 0
Message 13250 - Posted: 8 Apr 2006, 17:08:16 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=16615299

2 hours CPU time and still at 1%, i will abort it now.
LUE
ID: 13250 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 13256 - Posted: 8 Apr 2006, 17:57:31 UTC - in response to Message 13233.  

Heh, I'd rather be providing too much than too little. Would you remind me where I can find the ID # and date for any suspect workunits again, please?


I go to the particular PC's host-page on Rosetta's Boinc server, i.e. in your case it'd be https://boinc.bakerlab.org/rosetta/results.php?hostid=23940 and click on the "Work Unit ID" to see which has the name you see on your BOINC. Obviously it's not easy to do this, if one has 50 nameless PCs and/or his PC downloads 30 WUs at a time.

Probably just reporting the WU name e.g. HBLR_1.0_1di2_425... etc along with % done (or Model/Step #s?) it got stuck, is the as useful and easier afterall, as others suggested.
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 13256 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9

Message boards : Number crunching : Help us solve the 1% bug!



©2025 University of Washington
https://www.bakerlab.org