Report stuck work units here

Message boards : Number crunching : Report stuck work units here

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,362
RAC: 9
Message 6902 - Posted: 20 Dec 2005, 18:40:00 UTC - in response to Message 6897.  

Where would I look for the "slots" directories?


C:Program FilesBOINCslots (or 1, 2, 3, 4...)

I'm sure someone else will have this again soon if you can't grab it.

ID: 6902 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Los Alcoholicos~La Muis

Send message
Joined: 4 Nov 05
Posts: 34
Credit: 1,041,724
RAC: 0
Message 6927 - Posted: 20 Dec 2005, 20:49:49 UTC
Last modified: 20 Dec 2005, 21:05:45 UTC

1hz6A_topology_sample_106743_0 is now at 1% after 14:25:00 hours (on a P4 ht 3.0Mhz)

stderr.txt
# =====================================
# random seed: 504801
# =====================================


stdout.txt
2005-12-20 07:00:08 :: BOINC :: boinc_init()
command executed: projects/boinc.bakerlab.org_rosetta/rosetta_4.80_windows_intelx86.exe aa 1hz6 A -abrelax_mode -relax_score_filter -filter1 -110 -filter2 -145 -stringent_relax -more_relax_cycles -output_chi_silent -vary_omega -sim_aneal -rand_envpair_res_wt -rand_SS_wt -farlx -ex1 -ex2 -silent -barcode_from_fragments -barcode_from_fragments_length 10 -ssblocks -barcode_mode 3 -barcode_file 1hz6.top7_lowenergy.cst -jitter_frag -jitter_variation gauss -output_silent_gz -nstruct 10
[STR OPT]Default value for [-paths] paths.txt.
[T/F OPT]Default FALSE value for [-unix_paths]
--------------------------------------------
WARNING:: paths.txt file not found!!
Setting all paths to .
Using default fragment file names:
aa*****03_05.200_v1_3
aa*****03_05.200_v1_3
--------------------------------------------
[T/F OPT]Default FALSE value for [-version]
-
-

-
-
[T/F OPT]New TRUE value for [-jitter_frag]
[REAL OPT]Default value for [-jitter_amount] 2
[STR OPT]New value for [-jitter_variation] gauss.
score0 done: (best, low) rms
0 0 22.1686611
---------------------------------------------------------
score1 done: (best, low) rms (best,low)
19.9913731 15.6340599 15.2607765 14.5612974
standard trials: 2000 accepts: 666 %: 33.3
-----------------------------------------------------
Alternate score2/score5...
kk score2 score5 low_score n_low_accept rms rms_min low_rms
0 29.008 29.008 29.008 17 14.561 10.744 14.561
[REAL OPT]Default value for [-cpu_frac] 0.100000001
[REAL OPT]Default value for [-frame_rate] 10
[REAL OPT]Default value for [-cpu_frac] 0.100000001
[REAL OPT]Default value for [-frame_rate] 10
[REAL OPT]Default value for [-cpu_frac] 0.100000001
[REAL OPT]Default value for [-frame_rate] 10


I will give it a another few hours (but I will make a copy of slot 1) before I abort it.

[edit]

To late... it just error out after 14:37:32 hour (Maximum cpu time exceeded)
ID: 6927 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mark Rush

Send message
Joined: 6 Oct 05
Posts: 13
Credit: 52,229,983
RAC: 9,613
Message 7035 - Posted: 21 Dec 2005, 16:28:55 UTC

Bill:

Last night my WU hit an "unrecoverable error" and so was trashed. Sorry about not getting the slots directory copied.

Mark
ID: 7035 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,362
RAC: 9
Message 7042 - Posted: 21 Dec 2005, 17:01:01 UTC

I think the current application, that is having the "short" WU errors at the moment, has the fix to the "stuck at 1%" problem in it...

Still, if anyone has this error now, whether from the same cause or a new one, I'm sure the staff would love whatever information anyone could get. Thanks everyone for what you've done to help out so far!

ID: 7042 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jack Schonbrun

Send message
Joined: 1 Nov 05
Posts: 115
Credit: 5,954
RAC: 0
Message 7060 - Posted: 21 Dec 2005, 18:18:25 UTC - in response to Message 7042.  

I think the current application, that is having the "short" WU errors at the moment, has the fix to the "stuck at 1%" problem in it...

Still, if anyone has this error now, whether from the same cause or a new one, I'm sure the staff would love whatever information anyone could get. Thanks everyone for what you've done to help out so far!


Yes, we would be especially interested in cases of stuck at 1% that occur with 4.81. I know that these might be hard to notice while wading through the various other problems we've been having.
ID: 7060 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 7128 - Posted: 22 Dec 2005, 3:51:25 UTC
Last modified: 22 Dec 2005, 3:52:22 UTC

I have been getting WU's that the time clock just stops and takes a reboot to get it going again I have had to reboot over 100 systems in the past 2 days
My points per day is 1/2 what it is on a norm. I guess I should just shut down my network till you can solve this problem. As I have little time to babysit your client with the holly days here or just change to another project
If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 7128 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,362
RAC: 9
Message 7137 - Posted: 22 Dec 2005, 6:27:32 UTC - in response to Message 7128.  

I have been getting WU's that the time clock just stops and takes a reboot to get it going again I have had to reboot over 100 systems in the past 2 days
My points per day is 1/2 what it is on a norm. I guess I should just shut down my network till you can solve this problem. As I have little time to babysit your client with the holly days here or just change to another project


Several issues here. "Time clock just stops" is a new problem, if it's really a problem. Of course, with zero information from you on this, even though you have had it occur "over 100 times", it is hard to give any information. This is your FIRST posting on the issue. When the clock stops, is the status of the result by any chance "preempted"? And please, explain why a _reboot_ would be necessary? Are you sure that the problem isn't the OPERATING SYSTEM locking up, maybe because you're way overclocked, and not anything to do with Rosetta?

If the problem is NOT as you describe, if the problem is instead the one being discussed in this thread, then you have had over 100 examples of something the project is asking for help to solve, yet you have not given the project any assistance. Instead you prefer to complain about the WU _names_ (in another thread) and now blame the project for what sounds like a problem on your end, or a total misunderstanding of the way the system works.

In general, as much as I'm sure the project appreciates your (considerable) computer power, if you are only in this for the "points per day" and not to help the project, and expect the project to cater to your whims and jump to solve your problems, while you are unwilling to give the project any help in solving these problems, my _personal_ opinion is that you WOULD be happier with another project. Somewhere that the science would be less important, and you could get all the credits you want. I would suggest SETI.

If you are here to volunteer your CPU time to a worthy effort, and not just to earn credits, then you need to start asking questions instead of jumping to conclusions. We are all happy to help anyone with a problem.

ID: 7137 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,362
RAC: 9
Message 7138 - Posted: 22 Dec 2005, 6:32:02 UTC

Does anyone know which posting is "stretching" this thread?

I see several that have long lines of text from stderr files, but none that I can say shouldn't have wrapped.

If we can identify the posting, I can copy it and repost it and delete the original.

If we can't identify it, I may create a new thread and start moving posts around until I can see which is the problem...

ID: 7138 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 7143 - Posted: 22 Dec 2005, 7:38:55 UTC
Last modified: 22 Dec 2005, 7:45:17 UTC

https://boinc.bakerlab.org/rosetta/forum_thread.php?id=680#6927

the long command line....maybe not....shrug...
ID: 7143 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 7146 - Posted: 22 Dec 2005, 8:29:00 UTC - in response to Message 7143.  
Last modified: 22 Dec 2005, 8:33:18 UTC

It's message 6479 and a few others that have the long command
line wrapped in a <pre> element, which means the formatting
will be preserved.

Remove the <pre> and </pre> from those posts (or insert some
line breaks) and they should wrap.



*** Join BOINC@Australia today ***
ID: 7146 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,362
RAC: 9
Message 7151 - Posted: 22 Dec 2005, 9:06:40 UTC
Last modified: 22 Dec 2005, 9:14:38 UTC

The following information was originally entered by River~~ in Message 6477 - Posted 16 Dec 2005 22:20:26 UTC - Last modified: 16 Dec 2005 22:25:44 UTC. The original has been moved to thread 750. Below is the original information, with formatting changes ONLY.

BBCode is pretty limited - and apparently the 'pre' tag forces no-line-wrap.

Same box, result = 1n0u__topology_sample_128114_0

Aborted after 1700sec, fraction done = 0.01, no checkpoint, nothing written to stdout for over ten minutes.

[edit begins]
Followijng this I detached the box and attached again as this new host, the downloads worked fine, and the first result had checkpointed twice in the first 486 seconds.

Is it possible that after some problem with downloads, one or more files is missing or corrupt and this is not corrected? Or is this a complete red herring?
[edit ends]

stdout contained the same warnings about those 2 missing files, but I havent posted them this time. I have kept stdout & stderr from this and previous aborted result in case you'd like them emailing to you.


$ cat stderr.txt
# =====================================
# random seed: 1217781
# =====================================


$ head stdout.txt

BOINC :: 2005-12-16 16:54:07 :: boinc_init()
command executed: rosetta_4.79_i686-pc-linux-gnu aa 1n0u _ -relax_score_filter -filter1 -100 -
filter2 -140 -abrelax_mode -stringent_relax -more_relax_cycles -output_chi_silent -vary_omega -
sim_aneal -rand_envpair_res_wt -rand_SS_wt -farlx -ex1 -ex2 -silent -barcode_from_fragments -
barcode_from_fragments_length 10 -ssblocks -barcode_mode 3 -barcode_file 
1n0u.top7_lowenergy.cst -jitter_frag -jitter_variation gauss -output_silent_gz -nstruct 10

[STR  OPT]Default value for [-paths] paths.txt.
--------------------------------------------
WARNING::  paths.txt file not found!!
           Setting all paths to ./
           Using default fragment file names:
             aa*****03_05.200_v1_3
             aa*****03_05.200_v1_3
--------------------------------------------


$ tail stdout.txt

[T/F  OPT]New TRUE value for [-relax_score_filter]
[T/F  OPT]New TRUE value for [-filter1]
[T/F  OPT]New TRUE value for [-filter1]
[REAL OPT]New value for [-filter1]  -100
[T/F  OPT]New TRUE value for [-filter2]
[REAL OPT]New value for [-filter2]  -140
CYCLES::number is  1 x total_residue: 207
starting score -114.071526 rms  9.88577175
starting full atom simulated anealing
pre-computing chuck/gunn move set for frag length 1




ID: 7151 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,362
RAC: 9
Message 7152 - Posted: 22 Dec 2005, 9:11:38 UTC
Last modified: 22 Dec 2005, 9:15:29 UTC

Followup post, also from River~~, same wrap problem: Message 6479 - Posted 16 Dec 2005 22:57:44 UTC - Last modified: 16 Dec 2005 23:20:47 UTC

Same box, result = 1n0u__topology_sample_128114_0

Aborted after 1700sec, fraction done = 0.01, no checkpoint, nothing written to stdout for over ten minutes.

[edit begins]
Followijng this I detached the box and attached again as this new host, the downloads worked fine, and the first result had checkpointed twice in the first 486 seconds.

Is it possible that after some problem with downloads, one or more files is missing or corrupt and this is not corrected? Or is this a complete red herring?
[edit ends]

stdout contained the same warnings about those 2 missing files, but I havent posted them this time. I have kept stdout & stderr from this and previous aborted result in case you'd like them emailing to you.


$ cat stderr.txt
# =====================================
# random seed: 1217781
# =====================================


$ head stdout.txt

BOINC :: 2005-12-16 16:54:07 :: boinc_init()
command executed: rosetta_4.79_i686-pc-linux-gnu aa 1n0u _ -relax_score_filter -filter1 -100 -
filter2 -140 -abrelax_mode -stringent_relax -more_relax_cycles -output_chi_silent -vary_omega -
sim_aneal -rand_envpair_res_wt -rand_SS_wt -farlx -ex1 -ex2 -silent -barcode_from_fragments -
barcode_from_fragments_length 10 -ssblocks -barcode_mode 3 -barcode_file 
1n0u.top7_lowenergy.cst -jitter_frag -jitter_variation gauss -output_silent_gz -nstruct 10
[STR  OPT]Default value for [-paths] paths.txt.
--------------------------------------------
WARNING::  paths.txt file not found!!
           Setting all paths to ./
           Using default fragment file names:
             aa*****03_05.200_v1_3
             aa*****03_05.200_v1_3
--------------------------------------------


$ tail stdout.txt

[T/F  OPT]New TRUE value for [-relax_score_filter]
[T/F  OPT]New TRUE value for [-filter1]
[T/F  OPT]New TRUE value for [-filter1]
[REAL OPT]New value for [-filter1]  -100
[T/F  OPT]New TRUE value for [-filter2]
[REAL OPT]New value for [-filter2]  -140
CYCLES::number is  1 x total_residue: 207
starting score -114.071526 rms  9.88577175
starting full atom simulated anealing
pre-computing chuck/gunn move set for frag length 1




ID: 7152 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 7181 - Posted: 22 Dec 2005, 15:08:19 UTC - in response to Message 7137.  

if you are only in this for the "points per day" ... I would suggest SETI.

CPDN is another good candidate based on my experienced CS/sec ... :)

Before the last batch of optimized clients the CS/sec was nearly double that of other projects ...

YMMV
ID: 7181 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
pieface

Send message
Joined: 20 Sep 05
Posts: 17
Credit: 797,661
RAC: 0
Message 7186 - Posted: 22 Dec 2005, 16:05:49 UTC

I think I have one of those 'stuck' WU's as well. I have 'suspended' rosetta for a bit, and took a full backup of the BOINC directory if you want it (or any part of it). Let me know if you want it aborted.

Rosetta Version 481 [workunit: 1hz6a_abrelaxmode_test_20349]
1% complete
CPU time: 6 hr 46 min 43 sec
stage: Ab Initio
Step: 2699
Accepted Rmsd: 14.14
Accepted energy: 29.42311

It's running on a P4 2ghz machine, win xp home sp2, BM 5.2.13, sharing 50/50 with einstein, and left in-memory when swapped. Both the cpu time and time to completion increased every 5 secs or so. The 'step' hasn't changed since i noticed it was having a problem.


ID: 7186 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 7206 - Posted: 22 Dec 2005, 18:18:28 UTC - in response to Message 7137.  

I have been getting WU's that the time clock just stops and takes a reboot to get it going again I have had to reboot over 100 systems in the past 2 days
My points per day is 1/2 what it is on a norm. I guess I should just shut down my network till you can solve this problem. As I have little time to babysit your client with the holly days here or just change to another project


Several issues here. "Time clock just stops" is a new problem, if it's really a problem. Of course, with zero information from you on this, even though you have had it occur "over 100 times", it is hard to give any information. This is your FIRST posting on the issue. When the clock stops, is the status of the result by any chance "preempted"? And please, explain why a _reboot_ would be necessary? Are you sure that the problem isn't the OPERATING SYSTEM locking up, maybe because you're way overclocked, and not anything to do with Rosetta?

If the problem is NOT as you describe, if the problem is instead the one being discussed in this thread, then you have had over 100 examples of something the project is asking for help to solve, yet you have not given the project any assistance. Instead you prefer to complain about the WU _names_ (in another thread) and now blame the project for what sounds like a problem on your end, or a total misunderstanding of the way the system works.

In general, as much as I'm sure the project appreciates your (considerable) computer power, if you are only in this for the "points per day" and not to help the project, and expect the project to cater to your whims and jump to solve your problems, while you are unwilling to give the project any help in solving these problems, my _personal_ opinion is that you WOULD be happier with another project. Somewhere that the science would be less important, and you could get all the credits you want. I would suggest SETI.

If you are here to volunteer your CPU time to a worthy effort, and not just to earn credits, then you need to start asking questions instead of jumping to conclusions. We are all happy to help anyone with a problem.
Well with a reply like this one accuseing me of just doing it for the points will do NOTHING but but push me way.
If this what you want just say the word and I can pull the Plug .

I do not over clock any onf my nodes the OS is Win ME the clock just stops I am sory I am new to this project and do not know how to get you the Info to get help I am just a DUM Plumber but that should be no reason to act in such a belitteling way

If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 7206 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,362
RAC: 9
Message 7209 - Posted: 22 Dec 2005, 18:45:57 UTC - in response to Message 7206.  
Last modified: 22 Dec 2005, 18:47:40 UTC

Well with a reply like this one accuseing me of just doing it for the points will do NOTHING but but push me way.


First, realize that _I_ am not "project staff" - I'm a volunteer participant just like you are. That tag to the left under my name says "forum moderator", not "project" anything. However, I volunteer my time to help people who have a problem and ask for help on these boards.

I do not over clock any onf my nodes the OS is Win ME the clock just stops I am sory I am new to this project and do not know how to get you the Info to get help I am just a DUM Plumber but that should be no reason to act in such a belitteling way


Ok. We are now getting some information from you, namely that you're on ME and the clock just stops. Can you narrow down _when_ the clock just stops? Is it when projects are switched? Are you running multiple projects? Do you have the preference "leave applications in memory when preempted" set to "yes"? (If not, please do so.) Are you running the graphics when this happens? Running any other programs on the system? The more info you give us, the better.

I have no problem helping you if you ask for help. But if you come in with the "I'll just shut down if you don't solve your problem" attitude, then you're going to get attitude right back. There is NO reason to reboot a system because of ANY Rosetta problem. So the first step in solving this is to stop rebooting and instead, describe what is happening, copy/paste any messages from the Messages tab, and give us some information so we can begin to solve the problem.

ID: 7209 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 7225 - Posted: 22 Dec 2005, 19:31:31 UTC - in response to Message 7209.  

Well with a reply like this one accuseing me of just doing it for the points will do NOTHING but but push me way.


First, realize that _I_ am not "project staff" - I'm a volunteer participant just like you are. That tag to the left under my name says "forum moderator", not "project" anything. However, I volunteer my time to help people who have a problem and ask for help on these boards.
.

Well maybe you should take a look at your style of help
When propel come here looking for help or just expressing that they see as a problem
they may not express them selfs in a clear or to the point manner.
if this is a hard thing for you to handle perhaps you should stop giving help
I did not come here to get insulted or to be made a fool of by you or to do damage to this project , Just to express things that I am having a problem with.





If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 7225 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,362
RAC: 9
Message 7226 - Posted: 22 Dec 2005, 19:39:32 UTC - in response to Message 7225.  
Last modified: 22 Dec 2005, 19:53:34 UTC

if this is a hard thing for you to handle perhaps you should stop giving help


I'll make a deal with you - I'll stop giving you help. There are plenty of others here that can do so if they choose.

EDIT:: I just double-checked something. I know I said I'd stop helping, but... Windows ME is not supported by Rosetta. Seems it doesn't report CPU times back to the application correctly.

ID: 7226 · Rating: -1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7231 - Posted: 22 Dec 2005, 19:50:07 UTC - in response to Message 7151.  
Last modified: 22 Dec 2005, 19:52:32 UTC

The following information was originally entered by River~~ ...


yep, mea culpa!

The bbcode [ pre ] translates directly to the html < pre > which preserves formatting. It can be important to know where the line breaks occur in a file, so as we were asked for lines to be posted form a file I used pre.

It also stretches the page, so is less helpful if the thread turns into discussion rather than simply a place to 'upload' error files.

Thanks for fixing it.
ID: 7231 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,362
RAC: 9
Message 7235 - Posted: 22 Dec 2005, 19:56:17 UTC - in response to Message 7231.  

It can be important to know where the line breaks occur in a file, so as we were asked for lines to be posted form a file I used pre.


That's why I moved them rather than _just_ copying and re-pasting. :-)

(Well, that and I firmly believe that a moderator should moderate as little as possible...)

ID: 7235 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Report stuck work units here



©2024 University of Washington
https://www.bakerlab.org