Report stuck work units here

Message boards : Number crunching : Report stuck work units here

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7250 - Posted: 22 Dec 2005, 20:42:32 UTC - in response to Message 7235.  

...I firmly believe that a moderator should moderate as little as possible...


Moderate moderation perhaps ??
ID: 7250 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brett Kneisley

Send message
Joined: 17 Dec 05
Posts: 2
Credit: 3,593,841
RAC: 0
Message 7353 - Posted: 23 Dec 2005, 11:17:19 UTC

New here and I read down to see what information would help in fixing a problem I am having. I have several work units waiting that start with sample 207. 2 have already been listed by my system as ;computational error. The original estimated work time is 2:05. My system has been running seti as well, switching between the 2 every 10 minutes. When the Rosetta cumputation reaches over 10 minutes 20% and the timer shifts to Seti then comes back it will drop the comp time to less than 10 minutes run to 18 - 20 stop after the 10 miinutes ran and drops back to less than 8 minutes total time. It never went over 20 % complete even after 9 hours.

I have at this time changed the computation time to 3 hours to see what will happen.

I have Windows EP if that helps any. Is there any other info that is needed?


ID: 7353 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 7355 - Posted: 23 Dec 2005, 11:43:53 UTC - in response to Message 7353.  
Last modified: 23 Dec 2005, 11:51:57 UTC

When the Rosetta cumputation reaches over 10 minutes 20% and the timer shifts to Seti then comes back it will drop the comp time to less than 10 minutes


Sounds like the usual problem of not keeping work units in memory, in this case combined with very short times between switches (if we had an FAQ, this would probably be at the top of the list)

Check your preferences - to run Rosetta alongside other projects, you need to set "Leave applications in memory while preempted?" to yes or you will likely never finish a work unit.

I'd also change the setting for "Switch between applications every" to at least the recommended 60 minutes, but if you don't keep the work in memory, you will lose work done (back to 10%, 20%, 30% or whichever percentage it was at before the switch) every time you switch.

Of course, there is also the problem with a bad batch of work units, mostly in batches 204 to 207. They will crash soon after starting - nothing you can do about that.
*** Join BOINC@Australia today ***
ID: 7355 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brett Kneisley

Send message
Joined: 17 Dec 05
Posts: 2
Credit: 3,593,841
RAC: 0
Message 7356 - Posted: 23 Dec 2005, 12:20:16 UTC

current rosetta work unit hit 30 %, I then updated my preferences to'Leave applications in memory while preempted'. After update the computation time did drop but not below the 30% level. One problem fixed. I'm leaving it runnig to get some of the work units out of the way. If they are bad units I'll find out soon enough.

2 new units were downloaded, one says : Default **** 206 and the Second is : NO_RAND_WTS

I know there was a problem with a Default 205 and those were to be ditched. Same with these new ones?
ID: 7356 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,359
RAC: 13
Message 7372 - Posted: 23 Dec 2005, 14:06:52 UTC - in response to Message 7356.  

2 new units were downloaded, one says : Default **** 206 and the Second is : NO_RAND_WTS

I know there was a problem with a Default 205 and those were to be ditched. Same with these new ones?


Nope! Only the DEFAULT_xxxx_205's should be aborted. The _other_ DEFAULT ones are "good", or at least as good as any other WU generated in that batch. In other words, they _could_ be "short WUs" and fail quickly, but it's not probable. Most likely is that they're the "best" of the bunches you could get, and the most likely to both earn you credit, and to be useful to the project.

The project uses names that are descriptive of what's going on. Those who expect every WU name to be a boring random string of letters and numbers sometimes get concerned by names like "random jitter whatever", but the names are explained over in Science, and really mean something.

ID: 7372 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
pieface

Send message
Joined: 20 Sep 05
Posts: 17
Credit: 797,661
RAC: 0
Message 7400 - Posted: 23 Dec 2005, 19:44:53 UTC

This is just an update to my message nr 7186 from yesterday on a 'stuck' wu.
I left it (and rosetta) suspended overnite to see if there would be any reply, and since there wasn't anything new this morning I thought I would just abort the WU and get on with it. I 'resumed' it before aborting, the pct complete went back to zero and wouldn't you it, the danged thing went from zero to completion in 4,892 cpu secs. Odd behavior for something that should be 'repeatable' ???

ID: 7400 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7404 - Posted: 23 Dec 2005, 19:48:09 UTC - in response to Message 7400.  

This is just an update to my message nr 7186 from yesterday on a 'stuck' wu.
I left it (and rosetta) suspended overnite to see if there would be any reply, and since there wasn't anything new this morning I thought I would just abort the WU and get on with it. I 'resumed' it before aborting, the pct complete went back to zero and wouldn't you it, the danged thing went from zero to completion in 4,892 cpu secs. Odd behavior for something that should be 'repeatable' ???


Some Rosetta WU take a random number seed from the clock time, so are not repeatable if they go back to 0% I don't know if that applies to your WU or not.

River~~
ID: 7404 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,359
RAC: 13
Message 7408 - Posted: 23 Dec 2005, 20:30:23 UTC

I have moved several postings that did not relate to stuck work units to the Moderated messages moved here thread. As this thread is a staff-created sticky for a particular problem, the other discussions should take place elsewhere.

ID: 7408 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile ecafkid

Send message
Joined: 5 Oct 05
Posts: 40
Credit: 15,177,319
RAC: 0
Message 7613 - Posted: 25 Dec 2005, 15:44:40 UTC

default_1hz6_219_3398_0 stuck at 1% after 16:32:55 with 21:04:49 remaining.

I guess I will abort.
ID: 7613 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mgabriel

Send message
Joined: 18 Sep 05
Posts: 5
Credit: 96,494
RAC: 0
Message 7710 - Posted: 27 Dec 2005, 2:20:00 UTC

DEFAULT_2reb_219_3444_0
1% after 11:36:19 on an x2 3800+
ID: 7710 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 7723 - Posted: 27 Dec 2005, 8:52:35 UTC
Last modified: 27 Dec 2005, 8:52:42 UTC

If it goes more than 4 hours, take a screen shot (prtscrn key) of the graphics display (highling the WU when running and select "Show Graphics"), Save the image as jpg file (use paint), stop and restart BOINC. The work unit will restart at 0 (sorry) and run from there ...

One of the questions we have is if the system is doing ANYTHING other than updateing the clock ...

You *MAY* be able to lobby for extra credit ... :)

If you win, let me know, I have one worth 175 CS ... :)

ID: 7723 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Los Alcoholicos~DJNL

Send message
Joined: 10 Nov 05
Posts: 1
Credit: 248,497
RAC: 0
Message 7729 - Posted: 27 Dec 2005, 12:45:51 UTC

Looks stuck at 1%,

Rosetta Version 481 [workunit: DEFAULT_2reb_220_2101]
1% complete
CPU time: 7 hr 39 min 31 sec
stage: Ab Initio
Step: 2118
Accepted Rmsd: 8.359
Accepted energy: 0.5129008

It's running on a Amd sempron +2600, win xp home sp2, and left in-memory when swapped.

I have it suspended now, the stderr.txt is empty and here are the first/last 10 lines of stdout.txt:

[2005-12-27 05:28:14] :: BOINC :: boinc_init()
command executed: projects/boinc.bakerlab.org_rosetta/rosetta_4.81_windows_intelx86.exe aa 2reb _ -abrelax -stringent_relax -more_relax_cycles -relax_score_filter -output_chi_silent -vary_omega -sim_aneal -rand_envpair_res_wt -rand_SS_wt -farlx -ex1 -ex2 -silent -barcode_from_fragments -barcode_from_fragments_length 10 -ssblocks -barcode_mode 3 -omega_weight 0.5 -jitter_frag -jitter_variation gauss -max_frags 400 -output_silent_gz -paths frags400.txt -filter1 -90 -filter2 -115 -nstruct 10
[STR OPT]New value for [-paths] frags400.txt.
[T/F OPT]Default FALSE value for [-unix_paths]
[T/F OPT]Default FALSE value for [-version]
[T/F OPT]Default FALSE value for [-score]
[T/F OPT]Default FALSE value for [-abinitio]
[T/F OPT]Default FALSE value for [-refine]
[T/F OPT]Default FALSE value for [-assemble]
[T/F OPT]Default FALSE value for [-idealize]
__________________________________________________________________________________

score0 done: (best, low) rms
0 0 13.7818356
---------------------------------------------------------
score1 done: (best, low) rms (best,low)
-12.5528154 -18.1141415 13.5991316 8.55768871
standard trials: 2000 accepts: 877 %: 43.85
-----------------------------------------------------
Alternate score2/score5...
kk score2 score5 low_score n_low_accept rms rms_min low_rms
0 -5.522 -5.522 -5.522 22 8.558 7.168 8.558
[REAL OPT]Default value for [-cpu_frac] 0.100000001
[REAL OPT]Default value for [-frame_rate] 10


ID: 7729 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7765 - Posted: 27 Dec 2005, 23:25:25 UTC

I have had a run of errors on this Linux box. Sadly I can't give you any files from it as itis remote and I didn't have access to ssh yesterday/today - all I had was control via BOINCview.

Only one (marked with stars) is a 'stuck' wu, all the rest died early, however in view of the suspicion of one job possible tainting the next I thought I would let you see the whole lot.

Also of concern is the fact that a CPDN wu died as well - with error code 11 not 131. I have seen error code 11 a few times on failed Rosetta wu over the last few days, so if there is some 'taint' it may be that it causes that error code.

Or maybe the cpdn wu would have died then anyway. Who knows.

By the way this is a twin cpu box, so the duplicated times are not an error. That again may, or may not, suggest tainting from one cpu to another.

26th
19:54:50 DEFAULT_2tif_219_7301_0 error 131 after 1h 4m 25sec
19:57:44 NO_RANDOM_WTS_OR_FRAGS_1dtj_223_812_0 error 131 after 14sec
19:57:44 NO_RANDOM_WTS_OR_FRAGS_1mky_223_530_0 error 131 after 4sec
19:57:45 DEFAULT_1b72_219_7654_0 error 131 after 29sec
20:03:23 DEFAULT_2tif_220_2806_0 error 131 after 53sec

27th,
0700-1137
*** MORE_FRAGS_2reb_222_897_0 repeatedly stuck at 80%, 6h 51m 36sec,
restarted client several times, clock restarted from 80% checkpoint & ran ok at first then stopped again at around this figure. I can't be sure it was the same time, but always just short of 7hr.

11:37:30 ditto aborted by user
14:02:34 sulphur (CPDN WU) error 11 after 18days cpu time :-(
14:04:14 DEFAULT_1ogw_220_1383_1 error 131 after 58sec

Most recent successful outcome, 14:40 on 26th. Time now 23:06 on 27th. I have two Rosetta wu apparently running OK, be interesting to see if they finish or not!

R~~
ID: 7765 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7781 - Posted: 28 Dec 2005, 2:27:29 UTC



Result = NO_BARCODE_FRAGS_1r69_227_1143_0


This ran on a linux box for around 12 hours before I noticed it was stuck with less than 2hrs cpu recorded.

top showed that it was not actually running, and was therefore aborted.

I kept the std*.txt files and parts of these are given here as they would stratch this thread

One thing I noted was the heartbeat message in the stderr file - this is a known BOINC problem and is caused when the inter process communication from the client to the app is delayed by other events in the box. The app should exit without generating an error - on other BOINC platforms it simply restarts from the previous benchmark. I wonder if this is another manifestation of the way Rosetta does not like being removed from memory? Just a guess and not fully consistent with earlier observations on another box (see erlier post).
ID: 7781 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Steve Dodd

Send message
Joined: 13 Dec 05
Posts: 7
Credit: 3,773,676
RAC: 762
Message 7782 - Posted: 28 Dec 2005, 2:28:54 UTC

Sorry for the lack of information being provided. New to Rosetta.
1hz6A_topology_sample_129151_0 -- 10:40:36 @ 1%. Intel 2.8GHz, HT, 1G Ram
Aborted before I found this thread. (Rosetta 4.80)
ID: 7782 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7784 - Posted: 28 Dec 2005, 2:45:50 UTC

This result was running on a single cpu linux box. Clock stopped & client restarted several times, eventually result aborted.

std*.txt files given here

similar comment as before - note heartbeat again
ID: 7784 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7786 - Posted: 28 Dec 2005, 2:49:35 UTC - in response to Message 7782.  

Sorry for the lack of information being provided. New to Rosetta.
1hz6A_topology_sample_129151_0 -- 10:40:36 @ 1%. Intel 2.8GHz, HT, 1G Ram
Aborted before I found this thread. (Rosetta 4.80)


hi Steve - every little helps. But can you say if both the clock and progress were stuck, or was the clock running and only the progress stuck? If you don't remember, no worries.
ID: 7786 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Hoelder1in
Avatar

Send message
Joined: 30 Sep 05
Posts: 169
Credit: 3,915,947
RAC: 0
Message 7791 - Posted: 28 Dec 2005, 4:51:38 UTC
Last modified: 28 Dec 2005, 4:52:54 UTC

I wanted to report two cases of the "Clock Stops error" (as dscribed by River~~ in the "Four kinds of errors" thread). They both happened on Linux a few weeks ago. In both cases 'top' showed that the task status had gone from RN to SN, i.e., the task just sat there and prevented other Rosetta work from being done. After killing the respective rosetta job the WU continued from the last checkpoint to completion (no points were lost). Oh and all of this happened on a hyperthreading cpu. In the second of the two cases I actually tar'ed the respective slot directory before killing the job, intending to report this after collecting a few more cases (which didn't happen so far). If the saved slot directory is of interest I can make this available when I am back home again next week.
ID: 7791 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7815 - Posted: 28 Dec 2005, 12:49:40 UTC
Last modified: 28 Dec 2005, 12:51:35 UTC

INCREASE_CYCLES_10_1hz6_226_241 got stuck for 27 hours at 1%, edit: win-2k box, clock still running, still 99%+ in top. So we are still seeing the other kind of stuck WU.

I have suspended this one, not aborted, as it has already been somewhere alse first. Will abort when advised that files removed from server.

Files will be posted here

R~~
ID: 7815 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Steve Dodd

Send message
Joined: 13 Dec 05
Posts: 7
Credit: 3,773,676
RAC: 762
Message 7845 - Posted: 28 Dec 2005, 21:18:22 UTC - in response to Message 7786.  

Sorry for the lack of information being provided. New to Rosetta.
1hz6A_topology_sample_129151_0 -- 10:40:36 @ 1%. Intel 2.8GHz, HT, 1G Ram
Aborted before I found this thread. (Rosetta 4.80)


hi Steve - every little helps. But can you say if both the clock and progress were stuck, or was the clock running and only the progress stuck? If you don't remember, no worries.


Clock was humming right along. Just no progress. Sorry it took so long to respond.
ID: 7845 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Report stuck work units here



©2024 University of Washington
https://www.bakerlab.org