Help us solve the 1% bug!

Message boards : Number crunching : Help us solve the 1% bug!

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · Next

AuthorMessage
Stephen Miller

Send message
Joined: 18 Sep 05
Posts: 13
Credit: 16,294,215
RAC: 0
Message 12501 - Posted: 22 Mar 2006, 9:52:51 UTC - in response to Message 12489.  
Last modified: 22 Mar 2006, 10:08:29 UTC



as long as the graphics show movement, the calculation is proceeding, so best to stick with it..



I've got a stuck unit too.

FA_RLXpt_hom004_1ptq_361_27_0 is stuck at 8.63% at 48:41:25 CPU time in BOINC.

Per the instuctions at the bottom of this thread, I launched:

rosetta_4.82_windows_intelx86.exe xx 1ptq _ -output_silent_gz -silent -increase_cycles 10 -relax_score_filter -new_centroid_packing -abrelax -output_chi_silent -stringent_relax -vary_omega -omega_weight 0.5 -farlx -ex1 -ex2 -short_range_hb_weight 0.50 -long_range_hb_weight 1.0 -no_filters -nstruct 10 -protein_name_prefix hom004_ -frags_name_prefix hom004_ -filter1 -45 -filter2 -55 -termini -cpu_run_time 7200 -constant_seed -jran 2484844

which ran for 19 minutes (started with 18 minutes = 37 minutes total) and stuck at 22.7%, Stage: Ful atom relax, Model 2, step 255492. There is no graphic movement and no step changes.

CPU time is now 0 hr 48 min 0 sec.

Hope this helps.

I have a screen shot of the BOINC application if desired.

I am restarting BOINC to see if it will finish.

On this particular computer, Rosetta is the only project being processed.

update - after a reboot, BOINC is continuing to process the unit. It is currently at 20 minutes 27 secs and at model 3 step 67000+. It took only 10 minutes to get to this point.

Stephen

ID: 12501 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike

Send message
Joined: 21 Dec 05
Posts: 9
Credit: 35,252
RAC: 0
Message 12505 - Posted: 22 Mar 2006, 10:38:29 UTC

Hi. Ok,I'm running Roseta,Seti& Predictor. Since I turned off all screen savers, and keeping results in memory (hard disc) I've had no further problems.
PC runs 24/7. I just turn off the monitor when I away.
ID: 12505 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bruce boytler
Avatar

Send message
Joined: 17 Sep 05
Posts: 68
Credit: 3,565,442
RAC: 0
Message 12510 - Posted: 22 Mar 2006, 13:17:26 UTC

Hi All,

I experianced the 1 percent bug but not at 1 percent but at 15 percent. It had been spinning its wheels at 15 %for 15 hours before I realized it. Turned BOINC off then back on and roseeta went back to zero and started all over. checked on BOINC 8 hours later and same thing stuck at 15 percent so I just aborted the whole unit.

https://boinc.bakerlab.org/rosetta/result.php?resultid=14382390

FA_RLXpt_hom006_1ptq__361_86_0
ID: 12510 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dorphas

Send message
Joined: 14 Feb 06
Posts: 2
Credit: 60,275
RAC: 0
Message 12511 - Posted: 22 Mar 2006, 14:05:30 UTC

this 1% bug, i think, is a big turnoff for a lot of people. especially the ones with "farms" and can not get to them daily. i had 3 computers at my 2nd job that got stuck for 6 days last week. i reset them saturday and now it looks like 2 of them are stuck on 1% again for the past 2 days. our team is even talking about moving on to something else because of the 1% bug and wasted cpu cycles. we really like rosetta as a whole but it seems to require a lot more monitoring than other projects. hope it is solved soon.
ID: 12511 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Urban

Send message
Joined: 4 Oct 05
Posts: 6
Credit: 119,893
RAC: 0
Message 12512 - Posted: 22 Mar 2006, 15:21:46 UTC - in response to Message 12430.  

Arrgggg.....looks like the 1% stuck wu's are back:

FA_RLXc9_1c9oA_359_372_0

1% after 19 hr 44 min.


That's for me the reason to leave the Rosetta Project until this bug is really fixed!

Urban
ID: 12512 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Urban

Send message
Joined: 4 Oct 05
Posts: 6
Credit: 119,893
RAC: 0
Message 12513 - Posted: 22 Mar 2006, 15:25:00 UTC - in response to Message 12240.  

What do you have to offer those of us with large unattended farms?

If the WU goes past 2 or 3 times the user's selected run time, why not abort it? If I see it, that's what I'm going to do manually. One WU lost is not going to make any difference to the science, and we don't have the issue of holding up credit awards. Chances are very good it's a 1% problem, not some big ooglie new type of WU. Those big new ooglie things should probably have a hard lower limit for run time that overrides the user preference to get at least one model crunched.

I doubt many serious crunchers are going to be watching cycle-sucking screen savers... those are for the SETI LGM searchers. Most will only be running boinc.exe in CLI mode, and monitoring perhaps with BoincView.


Actually all WUs will produce at least one model no matter how long that takes and no matter what the users time setting is. So there is a low limit of one model. In some cases that model may take 6 or 8 hours. During that time the percent will only show 1% complete.

I agree many farmers do not use the screen saver. But there are more users that are not farmers and that is why I suggest people use the Display function to look at the graphic. While it may leave a residual function open when you close the window on some systems, that can be harmlessly aborted. In any case the display function does not eat cycles the way that the screen saver does so long as it is not in full display mode. So I am not suggesting you leave the display running all the time. Just use it to take a look as a diagnostic function. The point is that you need to be able to tell if the model is stepping or not. Boincview will not tell you that, the display will.


That isn't correct! I've runtime of 19 hours and 32 Hours normal it shows that it should complete in 6 hours. These Computers where I've run Rosatta are all configured to that the excecutable is 100% in the Memory!

As I say, I'll start back crunching for rosetta if these bug is fixed.

Urban
ID: 12513 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile m.mitch
Avatar

Send message
Joined: 10 Feb 06
Posts: 34
Credit: 1,928,904
RAC: 0
Message 12514 - Posted: 22 Mar 2006, 15:25:42 UTC - in response to Message 12511.  

this 1% bug, i think, is a big turnoff for a lot of people. especially the ones with "farms" and can not get to them daily. i had 3 computers at my 2nd job that got stuck for 6 days last week. i reset them saturday and now it looks like 2 of them are stuck on 1% again for the past 2 days. our team is even talking about moving on to something else because of the 1% bug and wasted cpu cycles. we really like rosetta as a whole but it seems to require a lot more monitoring than other projects. hope it is solved soon.


I only have a hobby farm and I still had trouble finding the stuck WU's. Unfortunatly or otherwise, or team tactics have changed so I'm only crunching four projects at the moment. I've run out of RAH, WU's but will be back when we have achived our next team goal ;-)


Click here to join the #1 Aussie Alliance on Rosetta
ID: 12514 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12517 - Posted: 22 Mar 2006, 15:43:54 UTC - in response to Message 12511.  

this 1% bug, i think, is a big turnoff for a lot of people. especially the ones with "farms" and can not get to them daily. i had 3 computers at my 2nd job that got stuck for 6 days last week. i reset them saturday and now it looks like 2 of them are stuck on 1% again for the past 2 days. our team is even talking about moving on to something else because of the 1% bug and wasted cpu cycles. we really like rosetta as a whole but it seems to require a lot more monitoring than other projects. hope it is solved soon.


I understand! this is why all of our efforts now are directed at fixing this problem. in the meantime we are lowering the maximum time cutoff so a machine cannot be stuck for more than a day. (see thread below)
ID: 12517 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 12520 - Posted: 22 Mar 2006, 16:18:15 UTC - in response to Message 12513.  

Urban said -

That isn't correct! I've runtime of 19 hours and 32 Hours normal it shows that it should complete in 6 hours. These Computers where I've run Rosatta are all configured to that the excecutable is 100% in the Memory!

As I say, I'll start back crunching for rosetta if these bug is fixed.

Urban
You should understand what Mod 9 is telling you. The "Max CPU Time" Dr baker is talking about has NOTHING to do with whatYOU set as a max. Even if YOU set the max to be 6 hours a 1% stuck can go way over that!. Dr Baker is talking about setting the Max Cpu within the WU to 24 hours......it is currently WAY over that to allow peeps to set times of a week and more in THEIR Max CPU setting in their profile

ID: 12520 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Stephen Miller

Send message
Joined: 18 Sep 05
Posts: 13
Credit: 16,294,215
RAC: 0
Message 12521 - Posted: 22 Mar 2006, 17:14:28 UTC - in response to Message 12520.  
Last modified: 22 Mar 2006, 17:17:58 UTC

[quote
I've got a stuck unit too.

FA_RLXpt_hom004_1ptq_361_27_0 is stuck at 8.63% at 7:28:12 CPU time in BOINC.

[/quote]

It hung again at 60.52% on Model 9, step 237186.

It had the same random seed as earlier before I dumped it.

I've aborted it and moved on.

This is the first one that failed to complete after a reboot.


ID: 12521 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BadThad

Send message
Joined: 8 Nov 05
Posts: 30
Credit: 71,834,523
RAC: 0
Message 12526 - Posted: 22 Mar 2006, 20:04:11 UTC - in response to Message 12511.  

this 1% bug, i think, is a big turnoff for a lot of people. especially the ones with "farms" and can not get to them daily. i had 3 computers at my 2nd job that got stuck for 6 days last week. i reset them saturday and now it looks like 2 of them are stuck on 1% again for the past 2 days. our team is even talking about moving on to something else because of the 1% bug and wasted cpu cycles. we really like rosetta as a whole but it seems to require a lot more monitoring than other projects. hope it is solved soon.


Indeed, that is my problem, I cannot baby sit my machines. I've had one PC hung since December 13 that I simply cannot get to....not for another month or two at least. I'm growing closer and closer to "bugging out" of Rosetta.

ID: 12526 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12581 - Posted: 23 Mar 2006, 20:26:25 UTC - in response to Message 12501.  



as long as the graphics show movement, the calculation is proceeding, so best to stick with it..



I've got a stuck unit too.

FA_RLXpt_hom004_1ptq_361_27_0 is stuck at 8.63% at 48:41:25 CPU time in BOINC.

Per the instuctions at the bottom of this thread, I launched:

rosetta_4.82_windows_intelx86.exe xx 1ptq _ -output_silent_gz -silent -increase_cycles 10 -relax_score_filter -new_centroid_packing -abrelax -output_chi_silent -stringent_relax -vary_omega -omega_weight 0.5 -farlx -ex1 -ex2 -short_range_hb_weight 0.50 -long_range_hb_weight 1.0 -no_filters -nstruct 10 -protein_name_prefix hom004_ -frags_name_prefix hom004_ -filter1 -45 -filter2 -55 -termini -cpu_run_time 7200 -constant_seed -jran 2484844

which ran for 19 minutes (started with 18 minutes = 37 minutes total) and stuck at 22.7%, Stage: Ful atom relax, Model 2, step 255492. There is no graphic movement and no step changes.

CPU time is now 0 hr 48 min 0 sec.

Hope this helps.

I have a screen shot of the BOINC application if desired.

I am restarting BOINC to see if it will finish.

On this particular computer, Rosetta is the only project being processed.

update - after a reboot, BOINC is continuing to process the unit. It is currently at 20 minutes 27 secs and at model 3 step 67000+. It took only 10 minutes to get to this point.

Stephen



Hi Stephen, so on your computer the identical work unit does not get stuck at the same point when you run it outside boinc? are other people seeing this as well? thanks, David
ID: 12581 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Stephen Miller

Send message
Joined: 18 Sep 05
Posts: 13
Credit: 16,294,215
RAC: 0
Message 12602 - Posted: 24 Mar 2006, 6:17:16 UTC - in response to Message 12581.  



I've got a stuck unit too.

FA_RLXpt_hom004_1ptq_361_27_0 is stuck at 8.63% at 48:41:25 CPU time in BOINC.

Per the instuctions at the bottom of this thread, I launched:

rosetta_4.82_windows_intelx86.exe xx 1ptq _ -output_silent_gz -silent -increase_cycles 10 -relax_score_filter -new_centroid_packing -abrelax -output_chi_silent -stringent_relax -vary_omega -omega_weight 0.5 -farlx -ex1 -ex2 -short_range_hb_weight 0.50 -long_range_hb_weight 1.0 -no_filters -nstruct 10 -protein_name_prefix hom004_ -frags_name_prefix hom004_ -filter1 -45 -filter2 -55 -termini -cpu_run_time 7200 -constant_seed -jran 2484844

which ran for 19 minutes (started with 18 minutes = 37 minutes total) and stuck at 22.7%, Stage: Ful atom relax, Model 2, step 255492. There is no graphic movement and no step changes.

CPU time is now 0 hr 48 min 0 sec.

Hope this helps.

I have a screen shot of the BOINC application if desired.

I am restarting BOINC to see if it will finish.

On this particular computer, Rosetta is the only project being processed.

update - after a reboot, BOINC is continuing to process the unit. It is currently at 20 minutes 27 secs and at model 3 step 67000+. It took only 10 minutes to get to this point.

Stephen



Hi Stephen, so on your computer the identical work unit does not get stuck at the same point when you run it outside boinc? are other people seeing this as well? thanks, David


Correct, it hung at a different place when ran outside BOINC. And hung at a different place within BOINC too.
ID: 12602 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Nite Owl
Avatar

Send message
Joined: 2 Nov 05
Posts: 87
Credit: 3,019,449
RAC: 0
Message 12663 - Posted: 25 Mar 2006, 7:50:22 UTC
Last modified: 25 Mar 2006, 8:00:43 UTC

I had one stuck at 1% showing 5 hours 56 minutes CPU time (using Boinc) ...(HB_Barcode_30_1ctf_351_16616_0)... I'm currently running it outside of Boinc, and it's at 44.6% in 54.5 minutes. Question: If I let it finish outside will I be able to send it in using Boinc or should I abort it??
ID: 12663 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12664 - Posted: 25 Mar 2006, 8:12:45 UTC - in response to Message 12663.  

I had one stuck at 1% showing 5 hours 56 minutes CPU time (using Boinc) ...(HB_Barcode_30_1ctf_351_16616_0)... I'm currently running it outside of Boinc, and it's at 44.6% in 54.5 minutes. Question: If I let it finish outside will I be able to send it in using Boinc or should I abort it??


I'm not sure if you can send it in using Boinc. but this tells us that the "stuck" problem on your computer is not an infinite loop inside rosetta but something about the rosetta-boinc interaction. thanks, David

ID: 12664 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Nite Owl
Avatar

Send message
Joined: 2 Nov 05
Posts: 87
Credit: 3,019,449
RAC: 0
Message 12667 - Posted: 25 Mar 2006, 8:35:45 UTC

I re-started it inside Boinc and it stopped at the same spot that it did the first time. I aborted it... <shrug>
Join the Teddies@WCG
ID: 12667 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Team_Elteor_Borislavj~Intelligence

Send message
Joined: 7 Dec 05
Posts: 14
Credit: 56,027
RAC: 0
Message 12671 - Posted: 25 Mar 2006, 10:17:09 UTC - in response to Message 12663.  
Last modified: 25 Mar 2006, 10:21:55 UTC

I had one stuck at 1% showing 5 hours 56 minutes CPU time (using Boinc) ...(HB_Barcode_30_1ctf_351_16616_0)... I'm currently running it outside of Boinc, and it's at 44.6% in 54.5 minutes. Question: If I let it finish outside will I be able to send it in using Boinc or should I abort it??


Same here! Also a HB_Barcode, 1%, after 9 hours. Now im running it manually, and it started at the console thing with 1% at 17 minutes :s Where is the other 8,25 hours of cpu time used for? :s

David, whats the full atom relax stage? The steps are slowing down on that stage, slowing down a lot!
ID: 12671 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Greg C. TNO

Send message
Joined: 18 Jan 06
Posts: 2
Credit: 250,065
RAC: 0
Message 12699 - Posted: 25 Mar 2006, 19:50:27 UTC - in response to Message 12511.  

this 1% bug, i think, is a big turnoff for a lot of people. especially the ones with "farms" and can not get to them daily. i had 3 computers at my 2nd job that got stuck for 6 days last week. i reset them saturday and now it looks like 2 of them are stuck on 1% again for the past 2 days. our team is even talking about moving on to something else because of the 1% bug and wasted cpu cycles. we really like rosetta as a whole but it seems to require a lot more monitoring than other projects. hope it is solved soon.



I'm on the same team as Dorphas. I have a 'farm', the 1% issue is annoying, but the new work units that beging with 'FA' are truly awful. They hang randomly, 40%, 88% etc... overnight I had multiple machines spinning their wheels, as soon as they're freed they run into another. I can live with the 1% issue, it reared it's ugly head occaisionally and it is a bug. Bug's happen, and I know your working on it. But things seem to be getting worse, not better.

I have two remote machines that have not reported results in 2 weeks. One machine is 350 miles away, I just 'unstuck' it and it seems to have run into a problem on the very next w/u. Obviously it is hard for me to administer that particular machine easily, I will have to re-assign it to another project as I can't run back and forth checking it constantly.

Regards
ID: 12699 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 8 Oct 05
Posts: 60
Credit: 702,872
RAC: 1,035
Message 12734 - Posted: 28 Mar 2006, 0:33:04 UTC

I've got one now, HB_BARCODE_30_2ci2I_351_30593_0.
This result:
https://boinc.bakerlab.org/rosetta/result.php?resultid=15136445

It is currently suspended so other units can run. I just found it when I came home, stuck for ~5hours. I stopped/restarted BOINC, it ran for about a minute, then got stuck at step 21292, Acc. RMSD 9.045, Acc. Energy 0.6126684. I stopped/restarted BOINC again (2 more times total) and it keeps getting stuck in the exact same spot at 1 minute, 14 seconds.

I'll try running it outside of BOINC later tonight when I get a chance. First stuck WU on this machine (but it IS a new machine).

Machine: Dual Xeon 3.06GHz, 2GB ram, WinXP SP2. HT is on, running 4 BOINC processes, leave in memory = YES (not that it matters for this WU).

ID: 12734 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 8 Oct 05
Posts: 60
Credit: 702,872
RAC: 1,035
Message 12740 - Posted: 28 Mar 2006, 3:18:18 UTC
Last modified: 28 Mar 2006, 3:20:23 UTC

OK, I ran the standalone test on the WU in my previous post, HB_BARCODE_30_2ci2I_351_30593_0.

As expected, in standalone mode it blew right past the spot it stopped at under BOINC. Interestingly, it already had the argument -constant_seed -jran xxxx on the "command executed" line. I killed the standalone process which had gotten much farther along by then, restarted BOINC, and unsuspended the WU. It started from the beginning, and hung at exactly the same spot.

It is now sitting there suspended. I await any suggestions as to what to do with it. (I know, stick it where the sun don't shine...)

This machine is also running Ralph, but hasn't had any problems there as yet.

ID: 12740 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · Next

Message boards : Number crunching : Help us solve the 1% bug!



©2024 University of Washington
https://www.bakerlab.org