Message boards : Number crunching : Report stuck & aborted WU here please
Previous · 1 . . . 15 · 16 · 17 · 18
Author | Message |
---|---|
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Hiya: If you see a workunit going on for more than four times your preferred CPU run time (by default it has been 3 hours, so >12 hours), I'd delete the job. We had some reports of old WUs getting stuck on some machines. We've put in a feature in the newest application Rosetta@home 5.06 (a "watchdog" timer) that should automatically carry out an abort if the job has been going on too long. So hopefully this will be the last time you'll need to manually abort jobs that seem to be going on forever. Also, please note that we will grant credit for your aborted jobs even if they are reported as errors, about a week after you abort them. Hello, |
yoner Send message Joined: 17 Sep 05 Posts: 10 Credit: 2,581,874 RAC: 0 |
Thanks, As a side note, I found out exactly what was happening with the unit that was running on the P4. Unit was completes model 1 and then starts over from step 1 again. Happened to catch it as it was doing that. The other two units are still counting upwards on the Dual PII though, going to see what happens. |
Walter Roberson Send message Joined: 5 Dec 05 Posts: 2 Credit: 13,937 RAC: 0 |
I've just aborted an overdue WU "stuck at 1%". Windows XP SP1, 512 Mb, Clarification: the "recent Rosetta upgrade" I referred to was about April 12th, one of the 4.x improvements. When I allowed new work, 5.x was downloaded, and so far the WU have been progressing fine with that. |
Bespin Reactor Shaft Send message Joined: 29 Nov 05 Posts: 1 Credit: 100,592 RAC: 0 |
OK. Here's one: rosetta 5.01 FACONTACTS_RECENTER_NOFILTERS_1b3aA_448_266_2 CPU time: 35:52:47 Progress: 1.15% To completion: 38:23:33 Deadline: 6 May 2006 |
Winkle Send message Joined: 22 May 06 Posts: 88 Credit: 1,354,930 RAC: 0 |
I have t307__CASP7_ABRELAX_SAVE_ALL_OUT_BARCODE_hom001__714_20997_0 using rosetta version 5.22 and it has been running now for 24 hrs. It has been stuck on 100% for at least the last hour I have been watching it. Mem usage of Rosetta was 88M and id now 94M after 30 mins. Now 97M ans climbing. CPU usage doesn't change when I suspend the task from the BOINC manager. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=20861564 The show graphics screen says... 68.601% complete CPU time: 24 hr 0 min Stage: Ab initio + relax Model 116 step 0 Accepted Enrgy 44.55485 Nothing is changing on the screen. The protein looks like a single zig-zag line Target CPU time is set to 8 hrs. The machine became unworkable, but is back to normal after the abort. |
Rich Send message Joined: 30 Nov 05 Posts: 5 Credit: 594,384 RAC: 0 |
Good morning. I have just sumitted 2: FRA_t323_CASP7_hom001_2_IGNORE_THE_RESTt323_2_dec00_1.pdb_771_81 and FRA_t323_CASP7_hom001_2_IGNORE_THE_RESTt323_2_dec23_4.pdb_771_80. Both originally were in the 33hr range, one at 1.65% and one around 1.07%. I also noticed that my stats were not updating, so I rebooted. After an hour or so they got stuck again, this time at 19.15% and 18.51% respectively. I did another reboot, they regressed to 18.50% and 17.90% and stayed there for more than an hour. I have to run to work now but hope that this information might be useful. Take care and have a good day. Rich Rich Seyfert Eatontown, NJ SeyfertR@att.net |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
|
Message boards :
Number crunching :
Report stuck & aborted WU here please
©2024 University of Washington
https://www.bakerlab.org