Message boards : Number crunching : Report stuck work units here
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
...I firmly believe that a moderator should moderate as little as possible... Moderate moderation perhaps ?? |
Brett Kneisley Send message Joined: 17 Dec 05 Posts: 2 Credit: 3,593,841 RAC: 0 |
New here and I read down to see what information would help in fixing a problem I am having. I have several work units waiting that start with sample 207. 2 have already been listed by my system as ;computational error. The original estimated work time is 2:05. My system has been running seti as well, switching between the 2 every 10 minutes. When the Rosetta cumputation reaches over 10 minutes 20% and the timer shifts to Seti then comes back it will drop the comp time to less than 10 minutes run to 18 - 20 stop after the 10 miinutes ran and drops back to less than 8 minutes total time. It never went over 20 % complete even after 9 hours. I have at this time changed the computation time to 3 hours to see what will happen. I have Windows EP if that helps any. Is there any other info that is needed? |
Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0 |
When the Rosetta cumputation reaches over 10 minutes 20% and the timer shifts to Seti then comes back it will drop the comp time to less than 10 minutes Sounds like the usual problem of not keeping work units in memory, in this case combined with very short times between switches (if we had an FAQ, this would probably be at the top of the list) Check your preferences - to run Rosetta alongside other projects, you need to set "Leave applications in memory while preempted?" to yes or you will likely never finish a work unit. I'd also change the setting for "Switch between applications every" to at least the recommended 60 minutes, but if you don't keep the work in memory, you will lose work done (back to 10%, 20%, 30% or whichever percentage it was at before the switch) every time you switch. Of course, there is also the problem with a bad batch of work units, mostly in batches 204 to 207. They will crash soon after starting - nothing you can do about that. *** Join BOINC@Australia today *** |
Brett Kneisley Send message Joined: 17 Dec 05 Posts: 2 Credit: 3,593,841 RAC: 0 |
current rosetta work unit hit 30 %, I then updated my preferences to'Leave applications in memory while preempted'. After update the computation time did drop but not below the 30% level. One problem fixed. I'm leaving it runnig to get some of the work units out of the way. If they are bad units I'll find out soon enough. 2 new units were downloaded, one says : Default **** 206 and the Second is : NO_RAND_WTS I know there was a problem with a Default 205 and those were to be ditched. Same with these new ones? |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 9 |
2 new units were downloaded, one says : Default **** 206 and the Second is : NO_RAND_WTS Nope! Only the DEFAULT_xxxx_205's should be aborted. The _other_ DEFAULT ones are "good", or at least as good as any other WU generated in that batch. In other words, they _could_ be "short WUs" and fail quickly, but it's not probable. Most likely is that they're the "best" of the bunches you could get, and the most likely to both earn you credit, and to be useful to the project. The project uses names that are descriptive of what's going on. Those who expect every WU name to be a boring random string of letters and numbers sometimes get concerned by names like "random jitter whatever", but the names are explained over in Science, and really mean something. |
pieface Send message Joined: 20 Sep 05 Posts: 17 Credit: 797,661 RAC: 0 |
This is just an update to my message nr 7186 from yesterday on a 'stuck' wu. I left it (and rosetta) suspended overnite to see if there would be any reply, and since there wasn't anything new this morning I thought I would just abort the WU and get on with it. I 'resumed' it before aborting, the pct complete went back to zero and wouldn't you it, the danged thing went from zero to completion in 4,892 cpu secs. Odd behavior for something that should be 'repeatable' ??? |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
This is just an update to my message nr 7186 from yesterday on a 'stuck' wu. Some Rosetta WU take a random number seed from the clock time, so are not repeatable if they go back to 0% I don't know if that applies to your WU or not. River~~ |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 9 |
I have moved several postings that did not relate to stuck work units to the Moderated messages moved here thread. As this thread is a staff-created sticky for a particular problem, the other discussions should take place elsewhere. |
ecafkid Send message Joined: 5 Oct 05 Posts: 40 Credit: 15,177,319 RAC: 0 |
default_1hz6_219_3398_0 stuck at 1% after 16:32:55 with 21:04:49 remaining. I guess I will abort. |
mgabriel Send message Joined: 18 Sep 05 Posts: 5 Credit: 96,494 RAC: 0 |
DEFAULT_2reb_219_3444_0 1% after 11:36:19 on an x2 3800+ |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
If it goes more than 4 hours, take a screen shot (prtscrn key) of the graphics display (highling the WU when running and select "Show Graphics"), Save the image as jpg file (use paint), stop and restart BOINC. The work unit will restart at 0 (sorry) and run from there ... One of the questions we have is if the system is doing ANYTHING other than updateing the clock ... You *MAY* be able to lobby for extra credit ... :) If you win, let me know, I have one worth 175 CS ... :) |
Los Alcoholicos~DJNL Send message Joined: 10 Nov 05 Posts: 1 Credit: 248,497 RAC: 0 |
Looks stuck at 1%, Rosetta Version 481 [workunit: DEFAULT_2reb_220_2101] 1% complete CPU time: 7 hr 39 min 31 sec stage: Ab Initio Step: 2118 Accepted Rmsd: 8.359 Accepted energy: 0.5129008 It's running on a Amd sempron +2600, win xp home sp2, and left in-memory when swapped. I have it suspended now, the stderr.txt is empty and here are the first/last 10 lines of stdout.txt: [2005-12-27 05:28:14] :: BOINC :: boinc_init() command executed: projects/boinc.bakerlab.org_rosetta/rosetta_4.81_windows_intelx86.exe aa 2reb _ -abrelax -stringent_relax -more_relax_cycles -relax_score_filter -output_chi_silent -vary_omega -sim_aneal -rand_envpair_res_wt -rand_SS_wt -farlx -ex1 -ex2 -silent -barcode_from_fragments -barcode_from_fragments_length 10 -ssblocks -barcode_mode 3 -omega_weight 0.5 -jitter_frag -jitter_variation gauss -max_frags 400 -output_silent_gz -paths frags400.txt -filter1 -90 -filter2 -115 -nstruct 10 [STR OPT]New value for [-paths] frags400.txt. [T/F OPT]Default FALSE value for [-unix_paths] [T/F OPT]Default FALSE value for [-version] [T/F OPT]Default FALSE value for [-score] [T/F OPT]Default FALSE value for [-abinitio] [T/F OPT]Default FALSE value for [-refine] [T/F OPT]Default FALSE value for [-assemble] [T/F OPT]Default FALSE value for [-idealize] __________________________________________________________________________________ score0 done: (best, low) rms 0 0 13.7818356 --------------------------------------------------------- score1 done: (best, low) rms (best,low) -12.5528154 -18.1141415 13.5991316 8.55768871 standard trials: 2000 accepts: 877 %: 43.85 ----------------------------------------------------- Alternate score2/score5... kk score2 score5 low_score n_low_accept rms rms_min low_rms 0 -5.522 -5.522 -5.522 22 8.558 7.168 8.558 [REAL OPT]Default value for [-cpu_frac] 0.100000001 [REAL OPT]Default value for [-frame_rate] 10 |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
I have had a run of errors on this Linux box. Sadly I can't give you any files from it as itis remote and I didn't have access to ssh yesterday/today - all I had was control via BOINCview. Only one (marked with stars) is a 'stuck' wu, all the rest died early, however in view of the suspicion of one job possible tainting the next I thought I would let you see the whole lot. Also of concern is the fact that a CPDN wu died as well - with error code 11 not 131. I have seen error code 11 a few times on failed Rosetta wu over the last few days, so if there is some 'taint' it may be that it causes that error code. Or maybe the cpdn wu would have died then anyway. Who knows. By the way this is a twin cpu box, so the duplicated times are not an error. That again may, or may not, suggest tainting from one cpu to another. 26th 19:54:50 DEFAULT_2tif_219_7301_0 error 131 after 1h 4m 25sec 19:57:44 NO_RANDOM_WTS_OR_FRAGS_1dtj_223_812_0 error 131 after 14sec 19:57:44 NO_RANDOM_WTS_OR_FRAGS_1mky_223_530_0 error 131 after 4sec 19:57:45 DEFAULT_1b72_219_7654_0 error 131 after 29sec 20:03:23 DEFAULT_2tif_220_2806_0 error 131 after 53sec 27th, 0700-1137 *** MORE_FRAGS_2reb_222_897_0 repeatedly stuck at 80%, 6h 51m 36sec, restarted client several times, clock restarted from 80% checkpoint & ran ok at first then stopped again at around this figure. I can't be sure it was the same time, but always just short of 7hr. 11:37:30 ditto aborted by user 14:02:34 sulphur (CPDN WU) error 11 after 18days cpu time :-( 14:04:14 DEFAULT_1ogw_220_1383_1 error 131 after 58sec Most recent successful outcome, 14:40 on 26th. Time now 23:06 on 27th. I have two Rosetta wu apparently running OK, be interesting to see if they finish or not! R~~ |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
Result = NO_BARCODE_FRAGS_1r69_227_1143_0 This ran on a linux box for around 12 hours before I noticed it was stuck with less than 2hrs cpu recorded. top showed that it was not actually running, and was therefore aborted. I kept the std*.txt files and parts of these are given here as they would stratch this thread One thing I noted was the heartbeat message in the stderr file - this is a known BOINC problem and is caused when the inter process communication from the client to the app is delayed by other events in the box. The app should exit without generating an error - on other BOINC platforms it simply restarts from the previous benchmark. I wonder if this is another manifestation of the way Rosetta does not like being removed from memory? Just a guess and not fully consistent with earlier observations on another box (see erlier post). |
Steve Dodd Send message Joined: 13 Dec 05 Posts: 7 Credit: 3,777,638 RAC: 888 |
Sorry for the lack of information being provided. New to Rosetta. 1hz6A_topology_sample_129151_0 -- 10:40:36 @ 1%. Intel 2.8GHz, HT, 1G Ram Aborted before I found this thread. (Rosetta 4.80) |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
This result was running on a single cpu linux box. Clock stopped & client restarted several times, eventually result aborted. std*.txt files given here similar comment as before - note heartbeat again |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
Sorry for the lack of information being provided. New to Rosetta. hi Steve - every little helps. But can you say if both the clock and progress were stuck, or was the clock running and only the progress stuck? If you don't remember, no worries. |
Hoelder1in Send message Joined: 30 Sep 05 Posts: 169 Credit: 3,915,947 RAC: 0 |
I wanted to report two cases of the "Clock Stops error" (as dscribed by River~~ in the "Four kinds of errors" thread). They both happened on Linux a few weeks ago. In both cases 'top' showed that the task status had gone from RN to SN, i.e., the task just sat there and prevented other Rosetta work from being done. After killing the respective rosetta job the WU continued from the last checkpoint to completion (no points were lost). Oh and all of this happened on a hyperthreading cpu. In the second of the two cases I actually tar'ed the respective slot directory before killing the job, intending to report this after collecting a few more cases (which didn't happen so far). If the saved slot directory is of interest I can make this available when I am back home again next week. |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
INCREASE_CYCLES_10_1hz6_226_241 got stuck for 27 hours at 1%, edit: win-2k box, clock still running, still 99%+ in top. So we are still seeing the other kind of stuck WU. I have suspended this one, not aborted, as it has already been somewhere alse first. Will abort when advised that files removed from server. Files will be posted here R~~ |
Steve Dodd Send message Joined: 13 Dec 05 Posts: 7 Credit: 3,777,638 RAC: 888 |
Sorry for the lack of information being provided. New to Rosetta. Clock was humming right along. Just no progress. Sorry it took so long to respond. |
Message boards :
Number crunching :
Report stuck work units here
©2024 University of Washington
https://www.bakerlab.org