Message boards : Number crunching : Report stuck work units here
Author | Message |
---|---|
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
Our apologies for the apparent problems with the recent batch of jobs. We should be able to track down the infinite loop, if there is one, pretty quickly with your help. Please post screen shots of your stuck work units here (or alternatively the information at the top and bottom of the screensaver)--this will help us identify the problem work units and the stage in the calculations where jobs are getting stuck. |
Yeti Send message Joined: 2 Nov 05 Posts: 45 Credit: 14,945,062 RAC: 0 |
a) From which point define we stuck ? I have one stuck at 1% with 1 hour, 30 Minutes, the total estimate on this box for this kind of WUs is 6 hours, 30 Minutes b) My boxes run as a Service with specialized User Account; so I can't see the graphics. Is there another way to find the needed informations for you ? Supporting BOINC, a great concept ! |
Jack Schonbrun Send message Joined: 1 Nov 05 Posts: 115 Credit: 5,954 RAC: 0 |
a) From which point define we stuck ? I have one stuck at 1% with 1 hour, 30 Minutes, the total estimate on this box for this kind of WUs is 6 hours, 30 Minutes 1% after 1.5 hours sounds stuck to me. If you can't send graphics, the complete Work Unit would still help. |
Yeti Send message Joined: 2 Nov 05 Posts: 45 Credit: 14,945,062 RAC: 0 |
|
Yeti Send message Joined: 2 Nov 05 Posts: 45 Credit: 14,945,062 RAC: 0 |
|
Jack Schonbrun Send message Joined: 1 Nov 05 Posts: 115 Credit: 5,954 RAC: 0 |
Only the name of the unit or the whole active Slot ? Hi Yeti, sorry to be a censor, but can you edit the profanity out of your previous post? I'm not sure if your WU is still a candidate for being hung, perhaps you are having a different problem. Until we figure out a good way for you to give us the whole slot, if you could go into the active Slot and post the content of stderr.txt, as well as the first 10 and last 10 lines of stdout.txt, that would help a lot. Thanks, Jack |
Yeti Send message Joined: 2 Nov 05 Posts: 45 Credit: 14,945,062 RAC: 0 |
Okay, edit done, sorry stderr: # ===================================== # random seed: 1639541 # ===================================== stdout first: 2005-12-15 22:44:20 :: BOINC :: boinc_init() command executed: projects/boinc.bakerlab.org_rosetta/rosetta_4.80_windows_intelx86.exe aa 1ogw _ -abrelax_mode -stringent_relax -more_relax_cycles -relax_score_filter -filter1 -105 -filter2 -145 -output_chi_silent -vary_omega -sim_aneal -rand_envpair_res_wt -rand_SS_wt -farlx -ex1 -ex2 -silent -barcode_from_fragments -barcode_from_fragments_length 10 -ssblocks -barcode_mode 3 -barcode_file 1ogw.top7_lowenergy.cst -jitter_frag -jitter_variation gauss -output_silent_gz -nstruct 10 [STR OPT]Default value for [-paths] paths.txt. [T/F OPT]Default FALSE value for [-unix_paths] -------------------------------------------- WARNING:: paths.txt file not found!! Setting all paths to . Using default fragment file names: aa*****03_05.200_v1_3 aa*****03_05.200_v1_3 -------------------------------------------- [T/F OPT]Default FALSE value for [-version] Stderr last: Size: 3 NUMBER OF FRAGS FOR POS: 69 50 Size: 3 NUMBER OF FRAGS FOR POS: 70 50 Size: 3 NUMBER OF FRAGS FOR POS: 71 50 Size: 3 NUMBER OF FRAGS FOR POS: 72 50 Size: 3 NUMBER OF FRAGS FOR POS: 73 50 Size: 3 NUMBER OF FRAGS FOR POS: 74 50 score0 done: (best, low) rms 0 0 21.2227993 --------------------------------------------------------- score1 done: (best, low) rms (best,low) 8.89822865 3.30363035 18.3562622 13.2358065 standard trials: 2000 accepts: 629 %: 31.45 ----------------------------------------------------- Alternate score2/score5... kk score2 score5 low_score n_low_accept rms rms_min low_rms 0 10.208 17.806 10.208 17 13.236 11.080 13.236 1 8.439 18.236 13.531 21 15.868 10.943 13.702 2 -44.739 -33.730 -44.738 36 11.227 8.677 11.227 3 -71.017 -60.008 -60.002 40 11.134 8.677 11.134 Supporting BOINC, a great concept ! |
Jack Schonbrun Send message Joined: 1 Nov 05 Posts: 115 Credit: 5,954 RAC: 0 |
Thank you Yeti, we will look into this. For completeness, do you also have the Work Unit name? Also is the percentage complete now climbing at a more typical rate? |
Yeti Send message Joined: 2 Nov 05 Posts: 45 Credit: 14,945,062 RAC: 0 |
I'm not sure if your WU is still a candidate for being hung, perhaps you are having a different problem. Really hung WU I only had several weeks ago, when I started with Rosetta. In the last week, I watched several times, that WUs keep very long the 1%, but normally, after 1 / 2 / 3 hours they jump to 10%. After this, they go on much faster. (The jump from 1% to 10% has been at 1:35, now the WU has 2:05 and says 20%)
I have made a copy of the whole slot and saved; I can zip or rar them and e-mail to you. If you don't want to post an email-adress, send me the adress via my registered e-mail-adress. Or, I tell you an adress where you can download the slot from one of my servers ... The WU-Name: 1ogw_topology_sample_88400_0 Supporting BOINC, a great concept ! |
Yeti Send message Joined: 2 Nov 05 Posts: 45 Credit: 14,945,062 RAC: 0 |
|
STE\/E Send message Joined: 17 Sep 05 Posts: 125 Credit: 4,100,301 RAC: 169 |
Also is the percentage complete now climbing at a more typical rate? ========== How can anybody answer that with the Rosetta WU's, I see some WU's take 3 hours to get to 20% then jump 30% to 50% in the next 10 minutes. So whats a typical rate for that WU ... ??? These WU's have a mind of their own and I don't think there is any set rate for them to Progress. I can finish 10 WU's and none of them have the same amount of time to finish @ 100%, there may be a variance of 3 or 4 hours difference between them. This is not a Rant but just stating my observance's of the Rosetta WU's ... :) |
Jack Schonbrun Send message Joined: 1 Nov 05 Posts: 115 Credit: 5,954 RAC: 0 |
Fair enough :) At the moment we we'd like to find the problem that's causing WUs to stick on 1% for over 10 hours. This should not be typical. Yeti: If you see a case where it is stuck for more than 10 hours I'd very much like to see your Slot. In the meantime I'll try to figure out what was going on with the one you already sent. |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
This box, running 1ogw__topology_sample_84611_2 Did over 6000 sec and ./boinc_cmd --get_results reported fraction done = 0.01 I am sorry, didn't get the stderr & stdout, I had not realised they would disappear as soon as I aborted the result! This happened because the client promptly reported the result and deleted the files. One thing I did notice tho was that the result never reached its first checkpoint, which might help you pinpoint where it enters its infloop. On my 600MHz linux box the first checkpoint came at 322 sec; whereas this box runs at 700MHz so the first checkpoint should have been a little sooner than that all things being equal. ***Please note if you wait till after the abort, and wait a little longer, you may be seeing the stdout & stderr relating to the new work, not the aborted work. I think maybe disable network (to stop the client reporting) or copy the info while the result is still running. Which technique would be more useful please? On my linux boxes I am using command line only. regards, |
Jack Schonbrun Send message Joined: 1 Nov 05 Posts: 115 Credit: 5,954 RAC: 0 |
River~~ This is perfect, just the information we need. It's possible that this is normal behavior, we are testing that now. You've given us what we nee to do this. Some of the protocols take a while before they hit the first checkpoint. The ones that we'd really like catch are those that are stuck at 1% for 10 or more hours. Jack |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
...The ones that we'd really like catch are those that are stuck at 1% for 10 or more hours. OK, if I get another one I will leave it overnight before aborting & see what happens. I'm off to bed now, as I live in UTC timezone... Jack: Do please notice the addition I edited into my previous post, as you were posting as I was editing. R~~ |
Hammer Send message Joined: 11 Dec 05 Posts: 2 Credit: 9,597 RAC: 0 |
Until we figure out a good way for you to give us the whole slot, if you could go into the active Slot and post the content of stderr.txt, as well as the first 10 and last 10 lines of stdout.txt, that would help a lot. 1ogw_topology_sample_106451_2 1ogw_topology_sample_131011_0 Stuck at 1% for some time, but just like Yeti got bumped up to 10% then kept going. Had another stuck at 80% the other day for 2 days before I noticed. As has been described before, the percent jumps are always odd, sometimes taking an hour to move 10%, and sometimes taking only a few minutes, all on the same WU. Could stick to 1% for an hour before it jumps to 10%. Hard to tell if it's actually stuck. |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
At the moment we we'd like to find the problem that's causing WUs to stick on 1% for over 10 hours. Should we scale this to machine speed? ie if 10hours is the reporting point on a 2.8GHz box it would seem premature to be reporting at 10hours on an 665MHz box as well. I've got boxes running at both those speeds already attached to Rosetta, so it is a practical question from me. It's useful to have a guideline like 10hrs, and I'm suggesting it would be even more helpful for you to give a guideline for a mythical 1GHz box and donors can scale it appropriately up or down for their slower or faster boxes. R~~ |
Mark Rush Send message Joined: 6 Oct 05 Posts: 13 Credit: 52,170,536 RAC: 7,314 |
OK, I have a unit that is 1% complete after 28 hours of work. I don't know how to do a screen shot, so here's the stuff at the top of the screen saver: rosetta [workunit:2reb_abrelax_rand_len10_jit02_omega_sim_filters_53131] At the bottom left of the screen is: 1% complete CPU time: 27 hr 53 min 10 sec Mark Rush - Total credit: 2014.34 - RAC 30.7302 Rosetta Fools Rosetta@home v0 http:/boinc.bakerlab.org/rosetta/ At the bottom right of the screen is: Stage: Ab initio Step: 2291 Accepted RMSD: 7.908 Accepted Energy: -3.025729 I won't abort this unit for a while in case you need more information from it. Also, for what it's worth, this computer is using Boinc Manager 4.45, running Seti, climateprediction, Einstein, LHC (though there are no WUs), and Predictor. It's a 3.0 GHZ machine running Windows XP Pro with 512 MB of RAM. The WUs stay in the memory after they pause. Mark |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,359 RAC: 13 |
Mark, if you can locate which one of the "Slots" directories that Rosetta is running in, and just make a copy of that whole directory before you do anything else it would probably be the biggest help. Someone from the project can tell you what part of that they actually need. Normally for a "backup" of any files, you have to quit BOINC first, but in this case I would think it would be better to grab it "open". Heck, if you have the disk space, I'd just make a copy of the whole BOINC folder! |
Mark Rush Send message Joined: 6 Oct 05 Posts: 13 Credit: 52,170,536 RAC: 7,314 |
Bill: Where would I look for the "slots" directories? And, I apologize in advance but I have meetings all afternoon and so probably won't get a chance to look until tomorrow. Mark |
Message boards :
Number crunching :
Report stuck work units here
©2024 University of Washington
https://www.bakerlab.org