Message boards : Number crunching : Report stuck work units here
Previous · 1 · 2 · 3 · 4 · 5
Author | Message |
---|---|
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
In my view you won't get a reliable answer to this, David, as the variation between machines is enormous - probably luck of the draw rather than anything systematic. I had 17 machines runnning Rosetta over the break. I could pick out spurious patterns on one and refute them on another. Four boxes got 90% downtime for a couple of days, one box got off unscathed, the others varying in between. If I'd had just one of those boxes what story I'd give you would depend on which box it was. I'd suggest getting an SQL wizard to coax the frequencies of errors etc and run-times out of the returned work pile. Even better because it is less dependent on cpu speed, frequecies of errors against credit claims for the various flavours of wu. It won't have so much detail but at least it won't depend so much on the vagaries of what wu went to which observant/too-busy users. River~~ |
[DPC]FOKschaap~Jumparound Send message Joined: 17 Dec 05 Posts: 2 Credit: 60,626 RAC: 0 |
plesae check this WU, look at the points wasted :( (and yes, the last 95 points were mine :() https://boinc.bakerlab.org/rosetta/workunit.php?wuid=3761771 |
buffylove Send message Joined: 2 Nov 05 Posts: 2 Credit: 41,715 RAC: 0 |
Do you mean one like this? |
buffylove Send message Joined: 2 Nov 05 Posts: 2 Credit: 41,715 RAC: 0 |
Along with that, the computation errors, and bandwidth required for data downloads, it's not looking so good. |
Cureseekers~Nightanimal Send message Joined: 20 Nov 05 Posts: 19 Credit: 26,396 RAC: 0 |
Job NO_SIM_ANNEAL_1hz6_228_5495_0 went stuck here. 12hrs on 2% on a PIII 733mhz The signature is away on the moment, just leave a message after the beep |
Ian_D Send message Joined: 21 Sep 05 Posts: 55 Credit: 4,216,173 RAC: 0 |
Job MORE_FRAGS_W_BARCODE_2reb_229_5862 stuck for 9hrs at 1% MORE_FRAGS_W_BARCODE_2reb_229_5862 XP2100+ o/c 13x166 (Lister) stable prime95 / memtest86+ et al |
O&O Send message Joined: 11 Dec 05 Posts: 25 Credit: 66,900 RAC: 0 |
Hi,... MORE_FRAGS_W_BARCODE_2reb_229_6182_0 (Result ID 5827255) been running for 5:30 CPU time, with 1% progress and more than 12 hours to completion. Should I Abort? ... be Credited? O&O |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 9 |
These "stuck at 1%" results _seem_ to restart and complete successfully if you exit BOINC and relaunch it. The project has not said that credit will be granted if you abort them, so I can't advise that - it would help them locate the problem, however, if you could get the last 20 lines from the stdout.txt file in the slots directory and paste them here before you relaunch BOINC. |
O&O Send message Joined: 11 Dec 05 Posts: 25 Credit: 66,900 RAC: 0 |
These "stuck at 1%" results _seem_ to restart and complete successfully if you exit BOINC and relaunch it. The project has not said that credit will be granted if you abort them, so I can't advise that - it would help them locate the problem, however, if you could get the last 20 lines from the stdout.txt file in the slots directory and paste them here before you relaunch BOINC. Thanks Bill. I did abort the result before having the chance to read your advise... which resulted ... 07/01/2006 17:14:42|rosetta@home|Unrecoverable error for result MORE_FRAGS_W_BARCODE_2reb_229_6182_0 (aborted via GUI RPC) 07/01/2006 17:14:43|rosetta@home|Computation for result MORE_FRAGS_W_BARCODE_2reb_229_6182_0 finished Reporting 2 good results for Rosetta or PrimeGrid...probably 5 hours of CPU crunching time be added to a Climateprediction unit if this result was not a...stuck...turns to be a waste of a normal BOINC operation. The project did not say that by crunching its results you may experience a lower Average creditis/day neither, did it? Until "they" locate the problems causing such "abnormalities", I'll be allowing 30 minutes to have a Rosetta result...progresses beyond the 1%...otherwise I'll abort it. O&O (Systems management 101: Kill the monster while it is young) |
ExtraTerrestrial Apes Send message Joined: 3 Jan 06 Posts: 3 Credit: 6,087,435 RAC: 2,473 |
Hi, I didn't read much of this thread, so I'm just assuming you still need information. I had this WU which stayed at 1% for 2 hours. It was still in the ab initio phase, which I've never seen before. I restarted BOINC and it completed normally. MrS Scanning for our furry friends since Jan 2002 |
Yeti Send message Joined: 2 Nov 05 Posts: 45 Credit: 14,945,062 RAC: 0 |
Wao, I got one :-) WU-Name: INCREASE_CYCLES_10_1hz6_226_6922_0 The last 20 lines of stdout: Size: 3 NUMBER OF FRAGS FOR POS: 53 200 Size: 3 NUMBER OF FRAGS FOR POS: 54 200 Size: 3 NUMBER OF FRAGS FOR POS: 55 200 Size: 3 NUMBER OF FRAGS FOR POS: 56 200 Size: 3 NUMBER OF FRAGS FOR POS: 57 200 Size: 3 NUMBER OF FRAGS FOR POS: 58 200 Size: 3 NUMBER OF FRAGS FOR POS: 59 200 [T/F OPT]New TRUE value for [-jitter_frag] [REAL OPT]Default value for [-jitter_amount] 2 [STR OPT]New value for [-jitter_variation] gauss. score0 done: (best, low) rms 2 0 10.9471054 --------------------------------------------------------- score1 done: (best, low) rms (best,low) -4.41134071 -17.6313858 8.64116001 4.5229826 standard trials: 20000 accepts: 585 %: 2.925 ----------------------------------------------------- Alternate score2/score5... kk score2 score5 low_score n_low_accept rms rms_min low_rms 0 -27.503 -12.841 -27.503 28 4.523 3.605 4.523 I have saved the whole boinc-directory, so, if you are interested, I can zip it for you and put it on one of my servers so that you can download it from me The WU should normally take 6 hours on this machine; until now it has taken 6 hours and says 1% Supporting BOINC, a great concept ! |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
Wao, I got one :-) thanks. I'll try to reproduce this locally. |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
Wao, I got one :-) thanks. I'll try to reproduce this locally. |
O&O Send message Joined: 11 Dec 05 Posts: 25 Credit: 66,900 RAC: 0 |
David,... I have two INCREASE_CYCLES_10_1xxx_226_xxxx_0 (Same Batch I presume) in a "ready to run" status, should I abort? Thanks, O&O |
Message boards :
Number crunching :
Report stuck work units here
©2024 University of Washington
https://www.bakerlab.org