Report stuck & aborted WU here please

Message boards : Number crunching : Report stuck & aborted WU here please

To post messages, you must log in.

Previous · 1 . . . 15 · 16 · 17 · 18

AuthorMessage
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 14829 - Posted: 28 Apr 2006, 7:08:25 UTC - in response to Message 14825.  
Last modified: 28 Apr 2006, 7:14:27 UTC

Hiya:

If you see a workunit going on for more than four times your preferred CPU run time (by default it has been 3 hours, so >12 hours), I'd delete the job. We had some reports of old WUs getting stuck on some machines. We've put in a feature in the newest application Rosetta@home 5.06 (a "watchdog" timer) that should automatically carry out an abort if the job has been going on too long. So hopefully this will be the last time you'll need to manually abort jobs that seem to be going on forever. Also, please note that we will grant credit for your aborted jobs even if they are reported as errors, about a week after you abort them.

Hello,

I have several units that are into the very high numbers for computing.

NO_TERM__STRAND_1ogw_423_2138_1 (v5.01)
NO_TERM__STRAND_1ogw_423_6238_1 (v5.01)

Both have run for approx 100 hours on a dual PII 233, I know that they are still processing, as looking at the Graphics options shows the Step counter increasing. How many steps are in the work units?

I have another unit: HB_BARCODE_30_1bm8__351_25694_2 (v 5.01) that is at over 30 hours on a P4 3GHz, 2 gig ram. There is a possibility that this unit on this computer got fubarred by a system re-boot for the hours of computation, but should not be that bad.

Any ideas what is going on?



ID: 14829 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
yoner
Avatar

Send message
Joined: 17 Sep 05
Posts: 10
Credit: 2,581,874
RAC: 0
Message 14899 - Posted: 28 Apr 2006, 18:04:57 UTC

Thanks,

As a side note, I found out exactly what was happening with the unit that was running on the P4. Unit was completes model 1 and then starts over from step 1 again. Happened to catch it as it was doing that.

The other two units are still counting upwards on the Dual PII though, going to see what happens.


ID: 14899 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Walter Roberson

Send message
Joined: 5 Dec 05
Posts: 2
Credit: 13,937
RAC: 0
Message 14959 - Posted: 29 Apr 2006, 3:55:12 UTC - in response to Message 14706.  

I've just aborted an overdue WU "stuck at 1%". Windows XP SP1, 512 Mb,
running under BOINC.

This was the first WU issued to me after the recent Rosetta upgrade. Now that
I have aborted it, I will run another unit and see if the same problem occurs.


Clarification: the "recent Rosetta upgrade" I referred to was about April 12th,
one of the 4.x improvements.

When I allowed new work, 5.x was downloaded, and so far the WU have been
progressing fine with that.
ID: 14959 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Bespin Reactor Shaft

Send message
Joined: 29 Nov 05
Posts: 1
Credit: 100,592
RAC: 0
Message 14973 - Posted: 29 Apr 2006, 9:16:33 UTC - in response to Message 8741.  

OK. Here's one:

rosetta 5.01
FACONTACTS_RECENTER_NOFILTERS_1b3aA_448_266_2
CPU time: 35:52:47
Progress: 1.15%
To completion: 38:23:33
Deadline: 6 May 2006


ID: 14973 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Winkle

Send message
Joined: 22 May 06
Posts: 88
Credit: 1,354,930
RAC: 0
Message 19042 - Posted: 21 Jun 2006, 8:02:03 UTC

I have t307__CASP7_ABRELAX_SAVE_ALL_OUT_BARCODE_hom001__714_20997_0 using rosetta version 5.22 and it has been running now for 24 hrs. It has been stuck on 100% for at least the last hour I have been watching it. Mem usage of Rosetta was 88M and id now 94M after 30 mins. Now 97M ans climbing.
CPU usage doesn't change when I suspend the task from the BOINC manager.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=20861564

The show graphics screen says...
68.601% complete
CPU time: 24 hr 0 min
Stage: Ab initio + relax
Model 116 step 0
Accepted Enrgy 44.55485

Nothing is changing on the screen. The protein looks like a single zig-zag line

Target CPU time is set to 8 hrs.

The machine became unworkable, but is back to normal after the abort.
ID: 19042 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rich

Send message
Joined: 30 Nov 05
Posts: 5
Credit: 594,384
RAC: 0
Message 19315 - Posted: 26 Jun 2006, 12:08:02 UTC

Good morning. I have just sumitted 2:
FRA_t323_CASP7_hom001_2_IGNORE_THE_RESTt323_2_dec00_1.pdb_771_81 and FRA_t323_CASP7_hom001_2_IGNORE_THE_RESTt323_2_dec23_4.pdb_771_80. Both originally were in the 33hr range, one at 1.65% and one around 1.07%. I also noticed that my stats were not updating, so I rebooted. After an hour or so they got stuck again, this time at 19.15% and 18.51% respectively. I did another reboot, they regressed to 18.50% and 17.90% and stayed there for more than an hour.

I have to run to work now but hope that this information might be useful.

Take care and have a good day.

Rich
Rich Seyfert
Eatontown, NJ
SeyfertR@att.net
ID: 19315 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 19321 - Posted: 26 Jun 2006, 14:40:10 UTC

This thread was originally started in January, and has been retired. Please use the appropriate link in this post for reporting errors, including stuck or aborted Work units.
Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 19321 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 15 · 16 · 17 · 18

Message boards : Number crunching : Report stuck & aborted WU here please



©2024 University of Washington
https://www.bakerlab.org