Report stuck work units here

Author	Message
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,702,007 RAC: 0	Message 6902 - Posted: 20 Dec 2005, 18:40:00 UTC - in response to Message 6897. Where would I look for the "slots" directories? C:Program FilesBOINCslots (or 1, 2, 3, 4...) I'm sure someone else will have this again soon if you can't grab it. ID: 6902 · Rating: 0 · rate: / Reply Quote

Los Alcoholicos~La Muis Send message Joined: 4 Nov 05 Posts: 34 Credit: 1,041,724 RAC: 0	Message 6927 - Posted: 20 Dec 2005, 20:49:49 UTC Last modified: 20 Dec 2005, 21:05:45 UTC 1hz6A_topology_sample_106743_0 is now at 1% after 14:25:00 hours (on a P4 ht 3.0Mhz) stderr.txt # ===================================== # random seed: 504801 # ===================================== stdout.txt 2005-12-20 07:00:08 :: BOINC :: boinc_init() command executed: projects/boinc.bakerlab.org_rosetta/rosetta_4.80_windows_intelx86.exe aa 1hz6 A -abrelax_mode -relax_score_filter -filter1 -110 -filter2 -145 -stringent_relax -more_relax_cycles -output_chi_silent -vary_omega -sim_aneal -rand_envpair_res_wt -rand_SS_wt -farlx -ex1 -ex2 -silent -barcode_from_fragments -barcode_from_fragments_length 10 -ssblocks -barcode_mode 3 -barcode_file 1hz6.top7_lowenergy.cst -jitter_frag -jitter_variation gauss -output_silent_gz -nstruct 10 [STR OPT]Default value for [-paths] paths.txt. [T/F OPT]Default FALSE value for [-unix_paths] -------------------------------------------- WARNING:: paths.txt file not found!! Setting all paths to . Using default fragment file names: aa***03_05.200_v1_3 aa***03_05.200_v1_3 -------------------------------------------- [T/F OPT]Default FALSE value for [-version] - - - - [T/F OPT]New TRUE value for [-jitter_frag] [REAL OPT]Default value for [-jitter_amount] 2 [STR OPT]New value for [-jitter_variation] gauss. score0 done: (best, low) rms 0 0 22.1686611 --------------------------------------------------------- score1 done: (best, low) rms (best,low) 19.9913731 15.6340599 15.2607765 14.5612974 standard trials: 2000 accepts: 666 %: 33.3 ----------------------------------------------------- Alternate score2/score5... kk score2 score5 low_score n_low_accept rms rms_min low_rms 0 29.008 29.008 29.008 17 14.561 10.744 14.561 [REAL OPT]Default value for [-cpu_frac] 0.100000001 [REAL OPT]Default value for [-frame_rate] 10 [REAL OPT]Default value for [-cpu_frac] 0.100000001 [REAL OPT]Default value for [-frame_rate] 10 [REAL OPT]Default value for [-cpu_frac] 0.100000001 [REAL OPT]Default value for [-frame_rate] 10 I will give it a another few hours (but I will make a copy of slot 1) before I abort it. [edit] To late... it just error out after 14:37:32 hour (Maximum cpu time exceeded) ID: 6927 · Rating: 0 · rate: / Reply Quote

Mark Rush Send message Joined: 6 Oct 05 Posts: 13 Credit: 54,331,609 RAC: 0	Message 7035 - Posted: 21 Dec 2005, 16:28:55 UTC Bill: Last night my WU hit an "unrecoverable error" and so was trashed. Sorry about not getting the slots directory copied. Mark ID: 7035 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,702,007 RAC: 0	Message 7042 - Posted: 21 Dec 2005, 17:01:01 UTC I think the current application, that is having the "short" WU errors at the moment, has the fix to the "stuck at 1%" problem in it... Still, if anyone has this error now, whether from the same cause or a new one, I'm sure the staff would love whatever information anyone could get. Thanks everyone for what you've done to help out so far! ID: 7042 · Rating: 0 · rate: / Reply Quote

Jack Schonbrun Send message Joined: 1 Nov 05 Posts: 115 Credit: 5,954 RAC: 0	Message 7060 - Posted: 21 Dec 2005, 18:18:25 UTC - in response to Message 7042. I think the current application, that is having the "short" WU errors at the moment, has the fix to the "stuck at 1%" problem in it... Still, if anyone has this error now, whether from the same cause or a new one, I'm sure the staff would love whatever information anyone could get. Thanks everyone for what you've done to help out so far! Yes, we would be especially interested in cases of stuck at 1% that occur with 4.81. I know that these might be hard to notice while wading through the various other problems we've been having. ID: 7060 · Rating: 0 · rate: / Reply Quote

Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0	Message 7128 - Posted: 22 Dec 2005, 3:51:25 UTC Last modified: 22 Dec 2005, 3:52:22 UTC I have been getting WU's that the time clock just stops and takes a reboot to get it going again I have had to reboot over 100 systems in the past 2 days My points per day is 1/2 what it is on a norm. I guess I should just shut down my network till you can solve this problem. As I have little time to babysit your client with the holly days here or just change to another project If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- ID: 7128 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,702,007 RAC: 0	Message 7137 - Posted: 22 Dec 2005, 6:27:32 UTC - in response to Message 7128. I have been getting WU's that the time clock just stops and takes a reboot to get it going again I have had to reboot over 100 systems in the past 2 days My points per day is 1/2 what it is on a norm. I guess I should just shut down my network till you can solve this problem. As I have little time to babysit your client with the holly days here or just change to another project Several issues here. "Time clock just stops" is a new problem, if it's really a problem. Of course, with zero information from you on this, even though you have had it occur "over 100 times", it is hard to give any information. This is your FIRST posting on the issue. When the clock stops, is the status of the result by any chance "preempted"? And please, explain why a _reboot_ would be necessary? Are you sure that the problem isn't the OPERATING SYSTEM locking up, maybe because you're way overclocked, and not anything to do with Rosetta? If the problem is NOT as you describe, if the problem is instead the one being discussed in this thread, then you have had over 100 examples of something the project is asking for help to solve, yet you have not given the project any assistance. Instead you prefer to complain about the WU _names_ (in another thread) and now blame the project for what sounds like a problem on your end, or a total misunderstanding of the way the system works. In general, as much as I'm sure the project appreciates your (considerable) computer power, if you are only in this for the "points per day" and not to help the project, and expect the project to cater to your whims and jump to solve your problems, while you are unwilling to give the project any help in solving these problems, my _personal_ opinion is that you WOULD be happier with another project. Somewhere that the science would be less important, and you could get all the credits you want. I would suggest SETI. If you are here to volunteer your CPU time to a worthy effort, and not just to earn credits, then you need to start asking questions instead of jumping to conclusions. We are all happy to help anyone with a problem. ID: 7137 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,702,007 RAC: 0	Message 7138 - Posted: 22 Dec 2005, 6:32:02 UTC Does anyone know which posting is "stretching" this thread? I see several that have long lines of text from stderr files, but none that I can say shouldn't have wrapped. If we can identify the posting, I can copy it and repost it and delete the original. If we can't identify it, I may create a new thread and start moving posts around until I can see which is the problem... ID: 7138 · Rating: 0 · rate: / Reply Quote

Scribe Send message Joined: 2 Nov 05 Posts: 284 Credit: 157,359 RAC: 0	Message 7143 - Posted: 22 Dec 2005, 7:38:55 UTC Last modified: 22 Dec 2005, 7:45:17 UTC https://boinc.bakerlab.org/rosetta/forum_thread.php?id=680#6927 the long command line....maybe not....shrug... ID: 7143 · Rating: 0 · rate: / Reply Quote

Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0	Message 7146 - Posted: 22 Dec 2005, 8:29:00 UTC - in response to Message 7143. Last modified: 22 Dec 2005, 8:33:18 UTC It's message 6479 and a few others that have the long command line wrapped in a <pre> element, which means the formatting will be preserved. Remove the <pre> and </pre> from those posts (or insert some line breaks) and they should wrap. * Join BOINC@Australia today * ID: 7146 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,702,007 RAC: 0	Message 7151 - Posted: 22 Dec 2005, 9:06:40 UTC Last modified: 22 Dec 2005, 9:14:38 UTC The following information was originally entered by River~~ in Message 6477 - Posted 16 Dec 2005 22:20:26 UTC - Last modified: 16 Dec 2005 22:25:44 UTC. The original has been moved to thread 750. Below is the original information, with formatting changes ONLY. BBCode is pretty limited - and apparently the 'pre' tag forces no-line-wrap. Same box, result = 1n0u__topology_sample_128114_0 Aborted after 1700sec, fraction done = 0.01, no checkpoint, nothing written to stdout for over ten minutes. [edit begins] Followijng this I detached the box and attached again as this new host, the downloads worked fine, and the first result had checkpointed twice in the first 486 seconds. Is it possible that after some problem with downloads, one or more files is missing or corrupt and this is not corrected? Or is this a complete red herring? [edit ends] stdout contained the same warnings about those 2 missing files, but I havent posted them this time. I have kept stdout & stderr from this and previous aborted result in case you'd like them emailing to you. $ cat stderr.txt # ===================================== # random seed: 1217781 # ===================================== $ head stdout.txt BOINC :: 2005-12-16 16:54:07 :: boinc_init() command executed: rosetta_4.79_i686-pc-linux-gnu aa 1n0u _ -relax_score_filter -filter1 -100 - filter2 -140 -abrelax_mode -stringent_relax -more_relax_cycles -output_chi_silent -vary_omega - sim_aneal -rand_envpair_res_wt -rand_SS_wt -farlx -ex1 -ex2 -silent -barcode_from_fragments - barcode_from_fragments_length 10 -ssblocks -barcode_mode 3 -barcode_file 1n0u.top7_lowenergy.cst -jitter_frag -jitter_variation gauss -output_silent_gz -nstruct 10 [STR OPT]Default value for [-paths] paths.txt. -------------------------------------------- WARNING:: paths.txt file not found!! Setting all paths to ./ Using default fragment file names: aa***03_05.200_v1_3 aa***03_05.200_v1_3 -------------------------------------------- $ tail stdout.txt [T/F OPT]New TRUE value for [-relax_score_filter] [T/F OPT]New TRUE value for [-filter1] [T/F OPT]New TRUE value for [-filter1] [REAL OPT]New value for [-filter1] -100 [T/F OPT]New TRUE value for [-filter2] [REAL OPT]New value for [-filter2] -140 CYCLES::number is 1 x total_residue: 207 starting score -114.071526 rms 9.88577175 starting full atom simulated anealing pre-computing chuck/gunn move set for frag length 1 ID: 7151 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,702,007 RAC: 0	Message 7152 - Posted: 22 Dec 2005, 9:11:38 UTC Last modified: 22 Dec 2005, 9:15:29 UTC Followup post, also from River~~, same wrap problem: Message 6479 - Posted 16 Dec 2005 22:57:44 UTC - Last modified: 16 Dec 2005 23:20:47 UTC Same box, result = 1n0u__topology_sample_128114_0 Aborted after 1700sec, fraction done = 0.01, no checkpoint, nothing written to stdout for over ten minutes. [edit begins] Followijng this I detached the box and attached again as this new host, the downloads worked fine, and the first result had checkpointed twice in the first 486 seconds. Is it possible that after some problem with downloads, one or more files is missing or corrupt and this is not corrected? Or is this a complete red herring? [edit ends] stdout contained the same warnings about those 2 missing files, but I havent posted them this time. I have kept stdout & stderr from this and previous aborted result in case you'd like them emailing to you. $ cat stderr.txt # ===================================== # random seed: 1217781 # ===================================== $ head stdout.txt BOINC :: 2005-12-16 16:54:07 :: boinc_init() command executed: rosetta_4.79_i686-pc-linux-gnu aa 1n0u _ -relax_score_filter -filter1 -100 - filter2 -140 -abrelax_mode -stringent_relax -more_relax_cycles -output_chi_silent -vary_omega - sim_aneal -rand_envpair_res_wt -rand_SS_wt -farlx -ex1 -ex2 -silent -barcode_from_fragments - barcode_from_fragments_length 10 -ssblocks -barcode_mode 3 -barcode_file 1n0u.top7_lowenergy.cst -jitter_frag -jitter_variation gauss -output_silent_gz -nstruct 10 [STR OPT]Default value for [-paths] paths.txt. -------------------------------------------- WARNING:: paths.txt file not found!! Setting all paths to ./ Using default fragment file names: aa***03_05.200_v1_3 aa***03_05.200_v1_3 -------------------------------------------- $ tail stdout.txt [T/F OPT]New TRUE value for [-relax_score_filter] [T/F OPT]New TRUE value for [-filter1] [T/F OPT]New TRUE value for [-filter1] [REAL OPT]New value for [-filter1] -100 [T/F OPT]New TRUE value for [-filter2] [REAL OPT]New value for [-filter2] -140 CYCLES::number is 1 x total_residue: 207 starting score -114.071526 rms 9.88577175 starting full atom simulated anealing pre-computing chuck/gunn move set for frag length 1 ID: 7152 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0	Message 7181 - Posted: 22 Dec 2005, 15:08:19 UTC - in response to Message 7137. if you are only in this for the "points per day" ... I would suggest SETI. CPDN is another good candidate based on my experienced CS/sec ... :) Before the last batch of optimized clients the CS/sec was nearly double that of other projects ... YMMV ID: 7181 · Rating: 0 · rate: / Reply Quote

pieface Send message Joined: 20 Sep 05 Posts: 17 Credit: 797,661 RAC: 0	Message 7186 - Posted: 22 Dec 2005, 16:05:49 UTC I think I have one of those 'stuck' WU's as well. I have 'suspended' rosetta for a bit, and took a full backup of the BOINC directory if you want it (or any part of it). Let me know if you want it aborted. Rosetta Version 481 [workunit: 1hz6a_abrelaxmode_test_20349] 1% complete CPU time: 6 hr 46 min 43 sec stage: Ab Initio Step: 2699 Accepted Rmsd: 14.14 Accepted energy: 29.42311 It's running on a P4 2ghz machine, win xp home sp2, BM 5.2.13, sharing 50/50 with einstein, and left in-memory when swapped. Both the cpu time and time to completion increased every 5 secs or so. The 'step' hasn't changed since i noticed it was having a problem. ID: 7186 · Rating: 0 · rate: / Reply Quote

Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0	Message 7206 - Posted: 22 Dec 2005, 18:18:28 UTC - in response to Message 7137. I have been getting WU's that the time clock just stops and takes a reboot to get it going again I have had to reboot over 100 systems in the past 2 days My points per day is 1/2 what it is on a norm. I guess I should just shut down my network till you can solve this problem. As I have little time to babysit your client with the holly days here or just change to another project Several issues here. "Time clock just stops" is a new problem, if it's really a problem. Of course, with zero information from you on this, even though you have had it occur "over 100 times", it is hard to give any information. This is your FIRST posting on the issue. When the clock stops, is the status of the result by any chance "preempted"? And please, explain why a _reboot_ would be necessary? Are you sure that the problem isn't the OPERATING SYSTEM locking up, maybe because you're way overclocked, and not anything to do with Rosetta? If the problem is NOT as you describe, if the problem is instead the one being discussed in this thread, then you have had over 100 examples of something the project is asking for help to solve, yet you have not given the project any assistance. Instead you prefer to complain about the WU _names_ (in another thread) and now blame the project for what sounds like a problem on your end, or a total misunderstanding of the way the system works. In general, as much as I'm sure the project appreciates your (considerable) computer power, if you are only in this for the "points per day" and not to help the project, and expect the project to cater to your whims and jump to solve your problems, while you are unwilling to give the project any help in solving these problems, my _personal_ opinion is that you WOULD be happier with another project. Somewhere that the science would be less important, and you could get all the credits you want. I would suggest SETI. If you are here to volunteer your CPU time to a worthy effort, and not just to earn credits, then you need to start asking questions instead of jumping to conclusions. We are all happy to help anyone with a problem. Well with a reply like this one accuseing me of just doing it for the points will do NOTHING but but push me way. If this what you want just say the word and I can pull the Plug . I do not over clock any onf my nodes the OS is Win ME the clock just stops I am sory I am new to this project and do not know how to get you the Info to get help I am just a DUM Plumber but that should be no reason to act in such a belitteling way If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- ID: 7206 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,702,007 RAC: 0	Message 7209 - Posted: 22 Dec 2005, 18:45:57 UTC - in response to Message 7206. Last modified: 22 Dec 2005, 18:47:40 UTC Well with a reply like this one accuseing me of just doing it for the points will do NOTHING but but push me way. First, realize that _I_ am not "project staff" - I'm a volunteer participant just like you are. That tag to the left under my name says "forum moderator", not "project" anything. However, I volunteer my time to help people who have a problem and ask for help on these boards. I do not over clock any onf my nodes the OS is Win ME the clock just stops I am sory I am new to this project and do not know how to get you the Info to get help I am just a DUM Plumber but that should be no reason to act in such a belitteling way Ok. We are now getting some information from you, namely that you're on ME and the clock just stops. Can you narrow down _when_ the clock just stops? Is it when projects are switched? Are you running multiple projects? Do you have the preference "leave applications in memory when preempted" set to "yes"? (If not, please do so.) Are you running the graphics when this happens? Running any other programs on the system? The more info you give us, the better. I have no problem helping you if you ask for help. But if you come in with the "I'll just shut down if you don't solve your problem" attitude, then you're going to get attitude right back. There is NO reason to reboot a system because of ANY Rosetta problem. So the first step in solving this is to stop rebooting and instead, describe what is happening, copy/paste any messages from the Messages tab, and give us some information so we can begin to solve the problem. ID: 7209 · Rating: 0 · rate: / Reply Quote

Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0	Message 7225 - Posted: 22 Dec 2005, 19:31:31 UTC - in response to Message 7209. Well with a reply like this one accuseing me of just doing it for the points will do NOTHING but but push me way. First, realize that _I_ am not "project staff" - I'm a volunteer participant just like you are. That tag to the left under my name says "forum moderator", not "project" anything. However, I volunteer my time to help people who have a problem and ask for help on these boards. . Well maybe you should take a look at your style of help When propel come here looking for help or just expressing that they see as a problem they may not express them selfs in a clear or to the point manner. if this is a hard thing for you to handle perhaps you should stop giving help I did not come here to get insulted or to be made a fool of by you or to do damage to this project , Just to express things that I am having a problem with. If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- ID: 7225 · Rating: 1 · rate: / Reply Quote

Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,702,007 RAC: 0	Message 7226 - Posted: 22 Dec 2005, 19:39:32 UTC - in response to Message 7225. Last modified: 22 Dec 2005, 19:53:34 UTC if this is a hard thing for you to handle perhaps you should stop giving help I'll make a deal with you - I'll stop giving you help. There are plenty of others here that can do so if they choose. EDIT:: I just double-checked something. I know I said I'd stop helping, but... Windows ME is not supported by Rosetta. Seems it doesn't report CPU times back to the application correctly. ID: 7226 · Rating: -1 · rate: / Reply Quote

River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0	Message 7231 - Posted: 22 Dec 2005, 19:50:07 UTC - in response to Message 7151. Last modified: 22 Dec 2005, 19:52:32 UTC The following information was originally entered by River~~ ... yep, mea culpa! The bbcode [ pre ] translates directly to the html < pre > which preserves formatting. It can be important to know where the line breaks occur in a file, so as we were asked for lines to be posted form a file I used pre. It also stretches the page, so is less helpful if the thread turns into discussion rather than simply a place to 'upload' error files. Thanks for fixing it. ID: 7231 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,702,007 RAC: 0	Message 7235 - Posted: 22 Dec 2005, 19:56:17 UTC - in response to Message 7231. It can be important to know where the line breaks occur in a file, so as we were asked for lines to be posted form a file I used pre. That's why I moved them rather than _just_ copying and re-pasting. :-) (Well, that and I firmly believe that a moderator should moderate as little as possible...) ID: 7235 · Rating: 0 · rate: / Reply Quote