Message boards : Number crunching : Help us solve the 1% bug!
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · Next
Author | Message |
---|---|
Stephen Miller Send message Joined: 18 Sep 05 Posts: 13 Credit: 16,294,215 RAC: 0 |
I've got a stuck unit too. FA_RLXpt_hom004_1ptq_361_27_0 is stuck at 8.63% at 48:41:25 CPU time in BOINC. Per the instuctions at the bottom of this thread, I launched: rosetta_4.82_windows_intelx86.exe xx 1ptq _ -output_silent_gz -silent -increase_cycles 10 -relax_score_filter -new_centroid_packing -abrelax -output_chi_silent -stringent_relax -vary_omega -omega_weight 0.5 -farlx -ex1 -ex2 -short_range_hb_weight 0.50 -long_range_hb_weight 1.0 -no_filters -nstruct 10 -protein_name_prefix hom004_ -frags_name_prefix hom004_ -filter1 -45 -filter2 -55 -termini -cpu_run_time 7200 -constant_seed -jran 2484844 which ran for 19 minutes (started with 18 minutes = 37 minutes total) and stuck at 22.7%, Stage: Ful atom relax, Model 2, step 255492. There is no graphic movement and no step changes. CPU time is now 0 hr 48 min 0 sec. Hope this helps. I have a screen shot of the BOINC application if desired. I am restarting BOINC to see if it will finish. On this particular computer, Rosetta is the only project being processed. update - after a reboot, BOINC is continuing to process the unit. It is currently at 20 minutes 27 secs and at model 3 step 67000+. It took only 10 minutes to get to this point. Stephen |
Mike Send message Joined: 21 Dec 05 Posts: 9 Credit: 35,252 RAC: 0 |
Hi. Ok,I'm running Roseta,Seti& Predictor. Since I turned off all screen savers, and keeping results in memory (hard disc) I've had no further problems. PC runs 24/7. I just turn off the monitor when I away. |
bruce boytler Send message Joined: 17 Sep 05 Posts: 68 Credit: 3,565,442 RAC: 0 |
Hi All, I experianced the 1 percent bug but not at 1 percent but at 15 percent. It had been spinning its wheels at 15 %for 15 hours before I realized it. Turned BOINC off then back on and roseeta went back to zero and started all over. checked on BOINC 8 hours later and same thing stuck at 15 percent so I just aborted the whole unit. https://boinc.bakerlab.org/rosetta/result.php?resultid=14382390 FA_RLXpt_hom006_1ptq__361_86_0 |
Dorphas Send message Joined: 14 Feb 06 Posts: 2 Credit: 60,275 RAC: 0 |
this 1% bug, i think, is a big turnoff for a lot of people. especially the ones with "farms" and can not get to them daily. i had 3 computers at my 2nd job that got stuck for 6 days last week. i reset them saturday and now it looks like 2 of them are stuck on 1% again for the past 2 days. our team is even talking about moving on to something else because of the 1% bug and wasted cpu cycles. we really like rosetta as a whole but it seems to require a lot more monitoring than other projects. hope it is solved soon. |
Urban Send message Joined: 4 Oct 05 Posts: 6 Credit: 119,893 RAC: 0 |
Arrgggg.....looks like the 1% stuck wu's are back: That's for me the reason to leave the Rosetta Project until this bug is really fixed! Urban |
Urban Send message Joined: 4 Oct 05 Posts: 6 Credit: 119,893 RAC: 0 |
What do you have to offer those of us with large unattended farms? That isn't correct! I've runtime of 19 hours and 32 Hours normal it shows that it should complete in 6 hours. These Computers where I've run Rosatta are all configured to that the excecutable is 100% in the Memory! As I say, I'll start back crunching for rosetta if these bug is fixed. Urban |
m.mitch Send message Joined: 10 Feb 06 Posts: 34 Credit: 1,928,904 RAC: 0 |
this 1% bug, i think, is a big turnoff for a lot of people. especially the ones with "farms" and can not get to them daily. i had 3 computers at my 2nd job that got stuck for 6 days last week. i reset them saturday and now it looks like 2 of them are stuck on 1% again for the past 2 days. our team is even talking about moving on to something else because of the 1% bug and wasted cpu cycles. we really like rosetta as a whole but it seems to require a lot more monitoring than other projects. hope it is solved soon. I only have a hobby farm and I still had trouble finding the stuck WU's. Unfortunatly or otherwise, or team tactics have changed so I'm only crunching four projects at the moment. I've run out of RAH, WU's but will be back when we have achived our next team goal ;-) Click here to join the #1 Aussie Alliance on Rosetta |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
this 1% bug, i think, is a big turnoff for a lot of people. especially the ones with "farms" and can not get to them daily. i had 3 computers at my 2nd job that got stuck for 6 days last week. i reset them saturday and now it looks like 2 of them are stuck on 1% again for the past 2 days. our team is even talking about moving on to something else because of the 1% bug and wasted cpu cycles. we really like rosetta as a whole but it seems to require a lot more monitoring than other projects. hope it is solved soon. I understand! this is why all of our efforts now are directed at fixing this problem. in the meantime we are lowering the maximum time cutoff so a machine cannot be stuck for more than a day. (see thread below) |
Scribe Send message Joined: 2 Nov 05 Posts: 284 Credit: 157,359 RAC: 0 |
Urban said -You should understand what Mod 9 is telling you. The "Max CPU Time" Dr baker is talking about has NOTHING to do with whatYOU set as a max. Even if YOU set the max to be 6 hours a 1% stuck can go way over that!. Dr Baker is talking about setting the Max Cpu within the WU to 24 hours......it is currently WAY over that to allow peeps to set times of a week and more in THEIR Max CPU setting in their profile |
Stephen Miller Send message Joined: 18 Sep 05 Posts: 13 Credit: 16,294,215 RAC: 0 |
[quote I've got a stuck unit too. FA_RLXpt_hom004_1ptq_361_27_0 is stuck at 8.63% at 7:28:12 CPU time in BOINC. [/quote] It hung again at 60.52% on Model 9, step 237186. It had the same random seed as earlier before I dumped it. I've aborted it and moved on. This is the first one that failed to complete after a reboot. |
BadThad Send message Joined: 8 Nov 05 Posts: 30 Credit: 71,834,523 RAC: 0 |
this 1% bug, i think, is a big turnoff for a lot of people. especially the ones with "farms" and can not get to them daily. i had 3 computers at my 2nd job that got stuck for 6 days last week. i reset them saturday and now it looks like 2 of them are stuck on 1% again for the past 2 days. our team is even talking about moving on to something else because of the 1% bug and wasted cpu cycles. we really like rosetta as a whole but it seems to require a lot more monitoring than other projects. hope it is solved soon. Indeed, that is my problem, I cannot baby sit my machines. I've had one PC hung since December 13 that I simply cannot get to....not for another month or two at least. I'm growing closer and closer to "bugging out" of Rosetta. |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
Hi Stephen, so on your computer the identical work unit does not get stuck at the same point when you run it outside boinc? are other people seeing this as well? thanks, David |
Stephen Miller Send message Joined: 18 Sep 05 Posts: 13 Credit: 16,294,215 RAC: 0 |
Correct, it hung at a different place when ran outside BOINC. And hung at a different place within BOINC too. |
Nite Owl Send message Joined: 2 Nov 05 Posts: 87 Credit: 3,019,449 RAC: 0 |
I had one stuck at 1% showing 5 hours 56 minutes CPU time (using Boinc) ...(HB_Barcode_30_1ctf_351_16616_0)... I'm currently running it outside of Boinc, and it's at 44.6% in 54.5 minutes. Question: If I let it finish outside will I be able to send it in using Boinc or should I abort it?? |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
I had one stuck at 1% showing 5 hours 56 minutes CPU time (using Boinc) ...(HB_Barcode_30_1ctf_351_16616_0)... I'm currently running it outside of Boinc, and it's at 44.6% in 54.5 minutes. Question: If I let it finish outside will I be able to send it in using Boinc or should I abort it?? I'm not sure if you can send it in using Boinc. but this tells us that the "stuck" problem on your computer is not an infinite loop inside rosetta but something about the rosetta-boinc interaction. thanks, David |
Nite Owl Send message Joined: 2 Nov 05 Posts: 87 Credit: 3,019,449 RAC: 0 |
I re-started it inside Boinc and it stopped at the same spot that it did the first time. I aborted it... <shrug> Join the Teddies@WCG |
Team_Elteor_Borislavj~Intelligence Send message Joined: 7 Dec 05 Posts: 14 Credit: 56,027 RAC: 0 |
I had one stuck at 1% showing 5 hours 56 minutes CPU time (using Boinc) ...(HB_Barcode_30_1ctf_351_16616_0)... I'm currently running it outside of Boinc, and it's at 44.6% in 54.5 minutes. Question: If I let it finish outside will I be able to send it in using Boinc or should I abort it?? Same here! Also a HB_Barcode, 1%, after 9 hours. Now im running it manually, and it started at the console thing with 1% at 17 minutes :s Where is the other 8,25 hours of cpu time used for? :s David, whats the full atom relax stage? The steps are slowing down on that stage, slowing down a lot! |
Greg C. TNO Send message Joined: 18 Jan 06 Posts: 2 Credit: 250,065 RAC: 0 |
this 1% bug, i think, is a big turnoff for a lot of people. especially the ones with "farms" and can not get to them daily. i had 3 computers at my 2nd job that got stuck for 6 days last week. i reset them saturday and now it looks like 2 of them are stuck on 1% again for the past 2 days. our team is even talking about moving on to something else because of the 1% bug and wasted cpu cycles. we really like rosetta as a whole but it seems to require a lot more monitoring than other projects. hope it is solved soon. I'm on the same team as Dorphas. I have a 'farm', the 1% issue is annoying, but the new work units that beging with 'FA' are truly awful. They hang randomly, 40%, 88% etc... overnight I had multiple machines spinning their wheels, as soon as they're freed they run into another. I can live with the 1% issue, it reared it's ugly head occaisionally and it is a bug. Bug's happen, and I know your working on it. But things seem to be getting worse, not better. I have two remote machines that have not reported results in 2 weeks. One machine is 350 miles away, I just 'unstuck' it and it seems to have run into a problem on the very next w/u. Obviously it is hard for me to administer that particular machine easily, I will have to re-assign it to another project as I can't run back and forth checking it constantly. Regards |
genes Send message Joined: 8 Oct 05 Posts: 60 Credit: 695,284 RAC: 504 |
I've got one now, HB_BARCODE_30_2ci2I_351_30593_0. This result: https://boinc.bakerlab.org/rosetta/result.php?resultid=15136445 It is currently suspended so other units can run. I just found it when I came home, stuck for ~5hours. I stopped/restarted BOINC, it ran for about a minute, then got stuck at step 21292, Acc. RMSD 9.045, Acc. Energy 0.6126684. I stopped/restarted BOINC again (2 more times total) and it keeps getting stuck in the exact same spot at 1 minute, 14 seconds. I'll try running it outside of BOINC later tonight when I get a chance. First stuck WU on this machine (but it IS a new machine). Machine: Dual Xeon 3.06GHz, 2GB ram, WinXP SP2. HT is on, running 4 BOINC processes, leave in memory = YES (not that it matters for this WU). |
genes Send message Joined: 8 Oct 05 Posts: 60 Credit: 695,284 RAC: 504 |
OK, I ran the standalone test on the WU in my previous post, HB_BARCODE_30_2ci2I_351_30593_0. As expected, in standalone mode it blew right past the spot it stopped at under BOINC. Interestingly, it already had the argument -constant_seed -jran xxxx on the "command executed" line. I killed the standalone process which had gotten much farther along by then, restarted BOINC, and unsuspended the WU. It started from the beginning, and hung at exactly the same spot. It is now sitting there suspended. I await any suggestions as to what to do with it. (I know, stick it where the sun don't shine...) This machine is also running Ralph, but hasn't had any problems there as yet. |
Message boards :
Number crunching :
Help us solve the 1% bug!
©2024 University of Washington
https://www.bakerlab.org