Message boards : Number crunching : Crunching stucks on 83.33%
Author | Message |
---|---|
S@NL - SIMMEL Send message Joined: 27 Sep 05 Posts: 2 Credit: 40,398 RAC: 0 |
Hi all, See the progress of rosetta to stuck to 83.33 % for all my work units (on 3 out of 3 threads). CPU time and time to completion increasing. For one alreade more than 24 hrs. Is this normal or are there actions to take? Greetz, Simmel Simmel |
[B^S] Paul@home Send message Joined: 18 Sep 05 Posts: 34 Credit: 393,096 RAC: 0 |
Hi, when you say 'more than 24 huors', so you mean more than that estimated remaining or more than 24 CPU given already? The % done does not increase at a steady rate while the work unit is being processed - it jumps to various % done values at several (12) times during processing and its value depending on what stage in the processing the unit is at. 83.33% seems to be a normal value for the WU to sit - I see mine sitting there regularly. Keep an eye on it.. it will probably move on after some time. Wanna visit BOINC Synergy team site? Click below! Join BOINC Synergy Team |
Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0 |
Hmm, I don't know - there's definitely some problems with some work units. I have a 2.4GHz Pentium 4 with 333MHZ Ram and a 1.7GHz Celeron with 266MHz Ram. Both running XP and both with 512MB. They started on a work unit about the same time (just under 3 hours ago) and the Celeron, which is normally about 50% slower, is ahead of the P4 (at 75% and 58.33% respectively). That same P4 was stuck on a work unit at 100% for at least half an hour last night, before it got shut down for the night. On restarting this morning, it was back at 83% but it eventually finished and returned a seemingly valid result. Interesting reading through the result though (pasted here for the devs): <core_client_version>4.45</core_client_version> <stderr_txt> ***UNHANDLED EXCEPTION**** Reason: Access Violation (0xc0000005) at address 0x004A8E1D read attempt to address 0x0A4D49E8 Exiting... ***UNHANDLED EXCEPTION**** Reason: Access Violation (0xc0000005) at address 0x7C910F29 read attempt to address 0x3ED1F7FA Exiting... </stderr_txt> Wonder what the current WU says... *** Join BOINC@Australia today *** |
Red Squirrel Send message Joined: 26 Sep 05 Posts: 13 Credit: 3,613 RAC: 0 |
If this happening on your slower computer, and you're switching between projects, the normal one hour run slot may not be enough for the WU to reach the next check point (at 91.66%), so each time it restarts it's dropping back to 83.33%. The computing is quite intensive for the last 2 steps of each WU. Most people suggest that you have the WU set to remain in memory when preempted, or you could increase the time slice for each project to, say, 90 mins. Regards,Alan |
Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0 |
If this happening on your slower computer, and you're switching between projects, the normal one hour run slot may not be enough for the WU to reach the next check point (at 91.66%), so each time it restarts it's dropping back to 83.33%. If you're talking about my post... It's the faster PC that's getting stuck, it's only running Rosetta and even if it wasn't, it's set to keep work units in memory, of which it has 512MB. It is only the odd WU that gets stuck - this PC has crunched over 50 Rosetta work units and I know the drill. I have restarted the computer (and thus BOINC and the work unit) and it has since overtaken the slower computer, which highlights that it did get stuck. But if you're referring to the original post by Simmel... Rosetta, as opposed to other projects, will cause problems on slow computers that take hours between steps, unless the devs can do something about the time between steps (e.g. save at intermediate intervals). I'm sure it can be done, since all the other projects I have run can do it (Einstein, Predictor, SETI, CPDN and LHC). *** Join BOINC@Australia today *** |
Red Squirrel Send message Joined: 26 Sep 05 Posts: 13 Credit: 3,613 RAC: 0 |
Hi Yoda, I was referring to the original post by Simmel, as that could have been his problem. But I take your point - there do seem to be the occasional WU's that do seem to "get stuck" at 83.33% and this problem does need to be sorted out. And the WU's do need more save points - this definitely needs attention from the devs, especially before they start sending out more complex proteins to work on. Alan |
Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0 |
And the WU's do need more save points - this definitely needs attention from the devs, especially before they start sending out more complex proteins to work on. Yep, I'll go along with that. If a Predictor (or whatever) WU gets stuck, it soon becomes obvious. But with Rosetta, there's no way of knowing how long it should take before the WU moves to the next step. Will it be another 10 minutes? 2 hours? Work units vary a lot in time to process on the same PC already and without having an accurate guide as to where it's at, we may THINK a WU is stuck when it's just a longer WU. At least with the other projects, the progress meter moves regularly (usually at increments of 0.01%), so you know whether it's working or stuck. That's what I'd like to see on Rosetta - even 1% increments (and saves) would be an improvement. *** Join BOINC@Australia today *** |
S@NL - SIMMEL Send message Joined: 27 Sep 05 Posts: 2 Credit: 40,398 RAC: 0 |
HI all, It seems that (regarding your posts) all worked fine and it did. After more than 24 hrs computing time (P4 3.2 GHz) WU finished. All three WU which seemed to stuck I gave all CPU time and finished after all. Now I experience though a WU (1btn_abrelax_no_cst_23604_0) with a %done of 125% and still crunching. How about that! Time already spend 18 hrs on this one. Know Rosetta is beta so let fix this one too. Simmel Simmel |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
HI all, whoops, another bug to fix. :) |
FZB Send message Joined: 17 Sep 05 Posts: 84 Credit: 4,948,999 RAC: 0 |
|
Padanian Send message Joined: 27 Sep 05 Posts: 14 Credit: 15,190 RAC: 0 |
The dreaded 83.33% bug is affecting many of us. There's another thread about this topic... |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
The dreaded 83.33% bug is affecting many of us. Since this is the point where the current WUs do full-atom relax which takes a bit longer to run per structure and uses much more memory, it may be that it is just taking longer, particularly if enough memory is not available and virtual memory is being used. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
btw. what does the _no_cst_ stand for? have one of that wu'S as well _no_cst_ identifies the WU's that are not using constraint information for the final full-atom relax predictions. Constraints are used based on homologous (determined by similarity in amino acid sequence) structures that have been experimentally determined. |
Padanian Send message Joined: 27 Sep 05 Posts: 14 Credit: 15,190 RAC: 0 |
One more of those stuck WU. Detaching Rosetta. |
STE\/E Send message Joined: 17 Sep 05 Posts: 125 Credit: 4,100,301 RAC: 93 |
Nevermind, I don't have a clue on how this Project runs ... |
Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0 |
I noticed both WU's were about 1/2 hour past the total amount of time it should have took them to finish for that Computer. How do you know "the total amount of time it should have took"? There's no "standard" work unit time - some work units may take 3 times as long as others. Maybe even more than that *** Join BOINC@Australia today *** |
STE\/E Send message Joined: 17 Sep 05 Posts: 125 Credit: 4,100,301 RAC: 93 |
I noticed both WU's were about 1/2 hour past the total amount of time it should have took them to finish for that Computer. Ya your right, I'm Stupid and don't know what I'm talking about, I'll stop reporting abnormal running WU's and just let them run for 2 days without any progress before I abort them. |
Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0 |
Ya your right, I'm Stupid and don't know what I'm talking about No need for sarcasm. I was only trying to say that half an hour longer (a figure you mentioned) is not necessarily cause for alarm, given the big variation between work units. My laptop finishes an Einstein WU in 8 hours and 20 minutes, give or take 5 minutes. But on Rosetta I have no idea how long each WU will take - the longest has taken about 4 hours, while some were done in half that time. *** Join BOINC@Australia today *** |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 42 |
----- If this happening on your slower computer, and you're switching between projects, the normal one hour run slot may not be enough for the WU to reach the next check point (at 91.66%), so each time it restarts it's dropping back to 83.33%. ----- If you have your General Prefs set to "Leave in memory", this should not be an issue though, since the model just suspends, and does not need to reload from a checkpoint. Unless they've goofed of course and are loading the checkpoint on restart anyway! Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
STE\/E Send message Joined: 17 Sep 05 Posts: 125 Credit: 4,100,301 RAC: 93 |
----- The slowest computer I have is a P4 3.06 HT CPU, the WU's were on a P4 3.4 HT Computer and Rosetta was the only Project running. In fact it was the only Project I had WU's for on that Computer at the Time. I also have my Preferences set to 300 Min's before switching & to Leave in Memory, but those are not issue's if I only have WU's from one Project. |
Message boards :
Number crunching :
Crunching stucks on 83.33%
©2024 University of Washington
https://www.bakerlab.org