Report stuck work units here

Message boards : Number crunching : Report stuck work units here

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7869 - Posted: 29 Dec 2005, 9:13:48 UTC - in response to Message 7791.  

I wanted to report two cases of the "Clock Stops error" (as dscribed by River~~ in the "Four kinds of errors" thread). They both happened on Linux a few weeks ago.


It comes to my mind that I have occasionally seen this on other projects - but maybe four or five times in ten months. Then suddenly see a lot of it here - is it a separate bug, or is it that the other problems are putting more stress on the boxes so that a very rare BOINC bug is triggered more often?

Especially if it does turn out that the clock stopped problem is down to a dropped message between the client and the app, this is exactly where extra stress on the box is more likely to bring obscure issues to the surface.

Just a thought, and it may be a red herring.
River~~
ID: 7869 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7875 - Posted: 29 Dec 2005, 10:03:23 UTC

I have also noticed that the _topology_ wu seem to error out (when they do) with a different error code to the other short rnning jobs. Is this significant?
ID: 7875 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 7914 - Posted: 29 Dec 2005, 17:45:17 UTC - in response to Message 7875.  

I have also noticed that the _topology_ wu seem to error out (when they do) with a different error code to the other short rnning jobs. Is this significant?

Don't know, what are the error codes?
ID: 7914 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7922 - Posted: 29 Dec 2005, 19:43:42 UTC - in response to Message 7914.  

I have also noticed that the _topology_ wu seem to error out (when they do) with a different error code to the other short rnning jobs. Is this significant?

Don't know, what are the error codes?


The _topology_ jobs are giving [large negative number] = 0xc0000005

As far as I remember all (most?) of the other wu have been giving small positive numbers like 11, 131, etc, which is what made me notice the difference.

Sorry can't be more detailed but have not had chance to note much down. If I am right the exact values will be in the db, and if my memory is wrong the db will say so too.

R~~
ID: 7922 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
doc :)

Send message
Joined: 4 Oct 05
Posts: 47
Credit: 1,106,102
RAC: 0
Message 7960 - Posted: 30 Dec 2005, 5:51:24 UTC
Last modified: 30 Dec 2005, 5:52:48 UTC

this WU was stuck for more than 3 hours at 1% on step 2733 of the ab initio phase, exiting and restarting boinc got it to 10% in less than 10 minutes.

i got a backup of the stdout.txt from before the restart if needed.

pic:
[img=http://img522.imageshack.us/img522/670/stuck2ch.th.jpg]

edit: cant figure out how to post the thumbnail correct, bbcode is not my friend :)
ID: 7960 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 19 Sep 05
Posts: 403
Credit: 537,991
RAC: 0
Message 7974 - Posted: 30 Dec 2005, 8:56:43 UTC

I have a work unit that has been at 1% for 15H 50 min.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=4133381

What do u want me to do?

Anders n
ID: 7974 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
John Hetherington

Send message
Joined: 7 Oct 05
Posts: 1
Credit: 28,875
RAC: 0
Message 7992 - Posted: 30 Dec 2005, 15:59:45 UTC

NOt sure if this is the right place to ask - but I've had all the batch of WU's fail (message in BOINC "computation error) but with a Windows error message mentioning "fortran error" - only since upgrading to BOINC Version 5.2.13. Before that Rosetta was fine.

John


ID: 7992 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,359
RAC: 13
Message 7996 - Posted: 30 Dec 2005, 17:08:35 UTC - in response to Message 7992.  

NOt sure if this is the right place to ask - but I've had all the batch of WU's fail (message in BOINC "computation error) but with a Windows error message mentioning "fortran error" - only since upgrading to BOINC Version 5.2.13. Before that Rosetta was fine.


Since they aren't stuck, no this isn't the place to ask... but if I might guess, you're also running Predictor, which has the "Fortran errors"; all the errors on Rosetta I see in your results look like the "bad batch" or Rosetta being removed from memory, possibly when Predictor failed. If you're not running Predictor, please open a new thread and we'll see what we can do to figure this out.

ID: 7996 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7999 - Posted: 30 Dec 2005, 19:04:25 UTC - in response to Message 7974.  

I have a work unit that has been at 1% for 15H 50 min.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=4133381

What do u want me to do?

Anders n


hi Anders,

the project staff are still on leave.

I can't tell you what the project want, only try to make my best guess at it.

It depends on the speed of your box. What is the longest time you have seen a Rosetta wu take on that box and still succeed in the end?

Halve that time. If a wu is stuck without the progress changing for that amount of time, I call it time to abort/suspend and move on.

Eventually you will need to abort it if credit matters to you - but please don't abort yet. Go to the work tab (not the project tab), highlight the stuck wu, and click the suspend button (which then changes to become a resume button).

BOINC should download more work (if need be) and start running it.

Later on, when the project team say it is time to abort the wu, you will highlight it again, first click abort, then click resume (the last to allow the wu to be reported for credit).

Hope that helps - it is what I would do
R~~
ID: 7999 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 19 Sep 05
Posts: 403
Credit: 537,991
RAC: 0
Message 8005 - Posted: 30 Dec 2005, 20:05:44 UTC - in response to Message 7999.  


Halve that time. If a wu is stuck without the progress changing for that amount of time, I call it time to abort/suspend and move on.

Eventually you will need to abort it if credit matters to you - but please don't abort yet. Go to the work tab (not the project tab), highlight the stuck wu, and click the suspend button (which then changes to become a resume button).

BOINC should download more work (if need be) and start running it.

Later on, when the project team say it is time to abort the wu, you will highlight it again, first click abort, then click resume (the last to allow the wu to be reported for credit).

Hope that helps - it is what I would do
R~~



Thanks for the reply.

I did just that after some more time.

Have a snapshot of the grafics to , just in case.

Anders n
ID: 8005 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
cwangersky

Send message
Joined: 6 Nov 05
Posts: 6
Credit: 325,556
RAC: 0
Message 8151 - Posted: 2 Jan 2006, 1:52:40 UTC

On my Windows boxes, and on my Linux boxes, I have seen a number of "clock stopped" WU. Is there a separate thread for these? Or should I report them here? Usually they stop at even multiples of 20%, and are readily identifiable by the fact that CPU usage drops to 0 when they are "running", but I had one just today that stopped (with 0 CPU and no clock update) at 99.95% (NO_BARCODE_FRAGS_1di2_227_3864_0).
ID: 8151 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 8166 - Posted: 2 Jan 2006, 9:09:24 UTC - in response to Message 8151.  

On my Windows boxes, and on my Linux boxes, I have seen a number of "clock stopped" WU. Is there a separate thread for these? Or should I report them here? Usually they stop at even multiples of 20%, and are readily identifiable by the fact that CPU usage drops to 0 when they are "running", but I had one just today that stopped (with 0 CPU and no clock update) at 99.95% (NO_BARCODE_FRAGS_1di2_227_3864_0).


Yes report them here please.

There seem to be three cases[list]
* clock never started, progress grinds to a halt later
* clock runs ok but progress grinds to a halt with clock still increasing
* clock and progress start ok but both stop later on
[list] so please make clear which it is, and mention to operating system especially in the first case.

It is believed that the first case, where cpu stays at zero throughout, is confined to win95/98/ME - so reports of it happening on Win2k/XP or Linux would be particularly relevant.

Thanks, R~~
ID: 8166 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Plum Ugly

Send message
Joined: 3 Nov 05
Posts: 24
Credit: 2,005,763
RAC: 0
Message 8285 - Posted: 3 Jan 2006, 18:46:24 UTC - in response to Message 8166.  
Last modified: 3 Jan 2006, 19:18:09 UTC

I have one at 1% with 12-14 hrs running.I suspended it.what is it that you need to see on it.
Defaut_1b72_220_3516_0 run 12 hrs 33min 06 before I suspended it at 1%.It says 16hrs an 59 miniutes to completion. pentium 4/W2k system.
ID: 8285 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Plum Ugly

Send message
Joined: 3 Nov 05
Posts: 24
Credit: 2,005,763
RAC: 0
Message 8297 - Posted: 3 Jan 2006, 20:45:51 UTC
Last modified: 3 Jan 2006, 20:46:42 UTC

also have one NEW_SOFT_CENTROID_PACKING_2reb_225_2962_0 sutck at 1% with 8hrs 32 min ran.I have suspended it also. w2k on a p4 2.4
ID: 8297 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,359
RAC: 13
Message 8300 - Posted: 3 Jan 2006, 21:15:09 UTC

I finally have one; MORE_FRAGS_2reb_222_4379_0 result 5588925, stuck at 1% after 6 hours 10 minutes. Have suspended, will backup BOINC folder so will have any/all relevant files for whoever wants them.

ID: 8300 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
hyperfusion

Send message
Joined: 2 Oct 05
Posts: 1
Credit: 120
RAC: 0
Message 8309 - Posted: 3 Jan 2006, 22:42:22 UTC

I also have a workunit (INCREASE_CYCLES_10_2tif_226_3820) stuck for a few hours at 40%. I know it is not running, since a simple ps aux | grep ^boinc reveals that all the rosetta* processes are sleeping. I tried renicing the processes to 0 (from 19), but that didn't do anything.

Here's my stderr.txt:
*** glibc detected *** corrupted double-linked list: 0x08d728f8 ***
[0x87074eb]
[0x871f4bc]
[0xffffe420]
[0x8785674]
[0x879a4c6]
[0x879ee8d]
[0x879f463]
[0x879f8bf]
[0x8770365]
[0x87700d1]
[0x80572dd]
[0x81087e8]
[0x810814a]
[0x8785b7f]
[0x871738f]
[0x872049d]
[0x87b1cba]
*** glibc detected *** corrupted double-linked list: 0x093a0ab0 ***
[0x87074eb]
[0x871f4bc]
[0xffffe420]
[0x8785674]
[0x879a4c6]
[0x879eeda]
[0x879fb6c]
[0x87a143d]
[0x8770107]
[0x86002f1]
[0x85ef650]
[0x85f843c]
[0x860d529]
[0x860df34]
[0x840c3fe]
[0x86a48c9]
[0x85b1bbf]
[0x85b36ec]
[0x85b45f4]
[0x83ca2af]
[0x83cc2cf]
[0x877e534]
[0x8048121]


The first 10 lines of stdout.txt:
[2006-01-02 16:25:55] :: BOINC :: boinc_init()
command executed: rosetta_4.80_i686-pc-linux-gnu aa 2tif _ -increase_cycles 10 -abrelax -stringent_relax -more_relax_cycles -relax_score_filter -output_chi_silent -vary_omega -sim_aneal -rand_envpair_res_wt -rand_SS_wt -farlx -ex1 -ex2 -silent -barcode_from_fragments -barcode_from_fragments_length 10 -ssblocks -barcode_mode 3 -omega_weight 0.5 -jitter_frag -jitter_variation gauss -max_frags 400 -output_silent_gz -paths frags400.txt -filter1 -90 -filter2 -115 -nstruct 10
[STR  OPT]New value for [-paths] frags400.txt.
[T/F  OPT]Default FALSE value for [-version]
[T/F  OPT]Default FALSE value for [-score]
[T/F  OPT]Default FALSE value for [-abinitio]
[T/F  OPT]Default FALSE value for [-refine]
[T/F  OPT]Default FALSE value for [-assemble]
[T/F  OPT]Default FALSE value for [-idealize]
[T/F  OPT]Default FALSE value for [-relax]


And stdout.txt's last 10 lines:
smooth                         trials: 80000 accepts: 2354 %: 2.9425
-----------------------------------------------------
-----------------------------------------------------
CYCLES::number is  1 x total_residue: 59
initializing full atom coordinates
starting score  2173.93115 rms  4.0322547
starting full atom minimization
CYCLES::number is  1 x total_residue: 177
starting score -109.494316 rms  3.95362806
starting full atom simulated anealing


Looking into stdout.txt, it seems something is going wrong (to me, at least):
Here are lines 1848-1864
Looking for psipred file: ./2tif_.psipred_ss2
Protein type: alpha/beta  Fraction beta:   0.615384638
disabling sheet filter
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
WARNING: CONSTRAINT FILE NOT FOUND
Searched for: ./2tif_.cst
Running without distance constraints
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
WARNING: DIPOLAR CONSTRAINT FILE NOT FOUND
 Searched for: ./2tif_.dpl
 Dipolar constraints will not be used
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
fragment file: ./aa2tif_03_05.400_v1_3.gz
Total Residue 59
frag size: 3    frags/residue: 400
fragment file: ./aa2tif_09_05.400_v1_3.gz


My computer is a 2.66ghz Pentium 4 (no hyperthreading) that runs Linux kernel 2.6.12.

I hope this helps you guys resolve this issue!
ID: 8309 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,359
RAC: 13
Message 8328 - Posted: 4 Jan 2006, 5:51:44 UTC - in response to Message 8300.  

I finally have one; MORE_FRAGS_2reb_222_4379_0 result 5588925, stuck at 1% after 6 hours 10 minutes. Have suspended, will backup BOINC folder so will have any/all relevant files for whoever wants them.


After making the copy, I restarted BOINC and resumed the "stuck" result; it restarted at 0 CPU time and finished in under 2 hours. I do still have the backup with it in the "stuck at 1%" state.

ID: 8328 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 8329 - Posted: 4 Jan 2006, 6:37:18 UTC - in response to Message 8328.  

I finally have one; MORE_FRAGS_2reb_222_4379_0 result 5588925, stuck at 1% after 6 hours 10 minutes. Have suspended, will backup BOINC folder so will have any/all relevant files for whoever wants them.


After making the copy, I restarted BOINC and resumed the "stuck" result; it restarted at 0 CPU time and finished in under 2 hours. I do still have the backup with it in the "stuck at 1%" state.



What are the last 20 lines of stdout.txt?

(I wonder if the "sticking" point is always in the simulated annealing as in hyperfusion's case? )

Do people have an idea about what fracton of work units that are getting stuck, and whether any proteins or run conditions are getting stuck more frequently than others?
ID: 8329 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,359
RAC: 13
Message 8337 - Posted: 4 Jan 2006, 7:35:56 UTC - in response to Message 8329.  

I finally have one; MORE_FRAGS_2reb_222_4379_0 result 5588925, stuck at 1% after 6 hours 10 minutes.


What are the last 20 lines of stdout.txt?


RSD_WT: 1.2
RSD_WT: 1.0961
[T/F OPT]New TRUE value for [-rand_SS_wt]
[T/F OPT]Default FALSE value for [-random_parallel_antiparallel]
SS_WT: 0.764209032 0.950430512 1.02168155 1.42587173
[T/F OPT]Default FALSE value for [-rand_cst_res_wt]
[T/F OPT]Default FALSE value for [-random_frag]
starting fragment insertions...
[T/F OPT]New TRUE value for [-jitter_frag]
[REAL OPT]Default value for [-jitter_amount] 2
[STR OPT]New value for [-jitter_variation] gauss.
score0 done: (best, low) rms
0 0 18.3866825
---------------------------------------------------------
score1 done: (best, low) rms (best,low)
-5.50960827 -22.704874 11.2940054 5.82286787
standard trials: 2000 accepts: 600 %: 30
-----------------------------------------------------
Alternate score2/score5...
kk score2 score5 low_score n_low_accept rms rms_min low_rms
0 -10.644 -10.644 -10.644 35 5.823 5.442 5.823

ID: 8337 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nothing But Idle Time

Send message
Joined: 28 Sep 05
Posts: 209
Credit: 139,545
RAC: 0
Message 8349 - Posted: 4 Jan 2006, 12:53:02 UTC - in response to Message 8337.  
Last modified: 4 Jan 2006, 12:54:04 UTC

I finally have one; MORE_FRAGS_2reb_222_4379_0 result 5588925, stuck at 1% after 6 hours 10 minutes.


What are the last 20 lines of stdout.txt?


RSD_WT: 1.2
RSD_WT: 1.0961
[T/F OPT]New TRUE value for [-rand_SS_wt]
[T/F OPT]Default FALSE value for [-random_parallel_antiparallel]
SS_WT: 0.764209032 0.950430512 1.02168155 1.42587173
[T/F OPT]Default FALSE value for [-rand_cst_res_wt]
[T/F OPT]Default FALSE value for [-random_frag]
starting fragment insertions...
[T/F OPT]New TRUE value for [-jitter_frag]
[REAL OPT]Default value for [-jitter_amount] 2
[STR OPT]New value for [-jitter_variation] gauss.
score0 done: (best, low) rms
0 0 18.3866825
---------------------------------------------------------
score1 done: (best, low) rms (best,low)
-5.50960827 -22.704874 11.2940054 5.82286787
standard trials: 2000 accepts: 600 %: 30
-----------------------------------------------------
Alternate score2/score5...
kk score2 score5 low_score n_low_accept rms rms_min low_rms
0 -10.644 -10.644 -10.644 35 5.823 5.442 5.823


COOL! Now if SETI can just find an alien to interpret this ;)
Sorry, couldn't resist the urge.
ID: 8349 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Report stuck work units here



©2024 University of Washington
https://www.bakerlab.org