Crunching stucks on 83.33%

Message boards : Number crunching : Crunching stucks on 83.33%

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile S@NL - SIMMEL
Avatar

Send message
Joined: 27 Sep 05
Posts: 2
Credit: 40,398
RAC: 0
Message 968 - Posted: 5 Oct 2005, 3:47:23 UTC

Hi all,
See the progress of rosetta to stuck to 83.33 % for all my work units (on 3 out of 3 threads). CPU time and time to completion increasing. For one alreade more than 24 hrs.

Is this normal or are there actions to take?

Greetz, Simmel
Simmel

ID: 968 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B^S] Paul@home
Avatar

Send message
Joined: 18 Sep 05
Posts: 34
Credit: 393,096
RAC: 0
Message 971 - Posted: 5 Oct 2005, 8:26:00 UTC

Hi,

when you say 'more than 24 huors', so you mean more than that estimated remaining or more than 24 CPU given already?

The % done does not increase at a steady rate while the work unit is being processed - it jumps to various % done values at several (12) times during processing and its value depending on what stage in the processing the unit is at. 83.33% seems to be a normal value for the WU to sit - I see mine sitting there regularly.

Keep an eye on it.. it will probably move on after some time.


Wanna visit BOINC Synergy team site? Click below!

Join BOINC Synergy Team
ID: 971 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 972 - Posted: 5 Oct 2005, 8:58:25 UTC
Last modified: 5 Oct 2005, 9:00:41 UTC

Hmm, I don't know - there's definitely some problems with some work units.

I have a 2.4GHz Pentium 4 with 333MHZ Ram and a 1.7GHz Celeron with 266MHz Ram. Both running XP and both with 512MB.

They started on a work unit about the same time (just under 3 hours ago) and the Celeron, which is normally about 50% slower, is ahead of the P4 (at 75% and 58.33% respectively).

That same P4 was stuck on a work unit at 100% for at least half an hour last night, before it got shut down for the night. On restarting this morning, it was back at 83% but it eventually finished and returned a seemingly valid result. Interesting reading through the result though (pasted here for the devs):

<core_client_version>4.45</core_client_version>
<stderr_txt>

***UNHANDLED EXCEPTION****
Reason: Access Violation (0xc0000005) at address 0x004A8E1D read attempt to address 0x0A4D49E8

Exiting...

***UNHANDLED EXCEPTION****
Reason: Access Violation (0xc0000005) at address 0x7C910F29 read attempt to address 0x3ED1F7FA

Exiting...

</stderr_txt>


Wonder what the current WU says...
*** Join BOINC@Australia today ***
ID: 972 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Red Squirrel

Send message
Joined: 26 Sep 05
Posts: 13
Credit: 3,613
RAC: 0
Message 973 - Posted: 5 Oct 2005, 9:43:16 UTC

If this happening on your slower computer, and you're switching between projects, the normal one hour run slot may not be enough for the WU to reach the next check point (at 91.66%), so each time it restarts it's dropping back to 83.33%. The computing is quite intensive for the last 2 steps of each WU. Most people suggest that you have the WU set to remain in memory when preempted, or you could increase the time slice for each project to, say, 90 mins.
Regards,Alan

ID: 973 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 974 - Posted: 5 Oct 2005, 10:07:19 UTC - in response to Message 973.  
Last modified: 5 Oct 2005, 10:14:26 UTC

If this happening on your slower computer, and you're switching between projects, the normal one hour run slot may not be enough for the WU to reach the next check point (at 91.66%), so each time it restarts it's dropping back to 83.33%.


If you're talking about my post... It's the faster PC that's getting stuck, it's only running Rosetta and even if it wasn't, it's set to keep work units in memory, of which it has 512MB. It is only the odd WU that gets stuck - this PC has crunched over 50 Rosetta work units and I know the drill.

I have restarted the computer (and thus BOINC and the work unit) and it has since overtaken the slower computer, which highlights that it did get stuck.

But if you're referring to the original post by Simmel... Rosetta, as opposed to other projects, will cause problems on slow computers that take hours between steps, unless the devs can do something about the time between steps (e.g. save at intermediate intervals). I'm sure it can be done, since all the other projects I have run can do it (Einstein, Predictor, SETI, CPDN and LHC).
*** Join BOINC@Australia today ***
ID: 974 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Red Squirrel

Send message
Joined: 26 Sep 05
Posts: 13
Credit: 3,613
RAC: 0
Message 975 - Posted: 5 Oct 2005, 11:07:23 UTC

Hi Yoda,
I was referring to the original post by Simmel, as that could have been his problem. But I take your point - there do seem to be the occasional WU's that do seem to "get stuck" at 83.33% and this problem does need to be sorted out. And the WU's do need more save points - this definitely needs attention from the devs, especially before they start sending out more complex proteins to work on.
Alan
ID: 975 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 976 - Posted: 5 Oct 2005, 11:22:02 UTC - in response to Message 975.  
Last modified: 5 Oct 2005, 11:22:55 UTC

And the WU's do need more save points - this definitely needs attention from the devs, especially before they start sending out more complex proteins to work on.


Yep, I'll go along with that. If a Predictor (or whatever) WU gets stuck, it soon becomes obvious. But with Rosetta, there's no way of knowing how long it should take before the WU moves to the next step. Will it be another 10 minutes? 2 hours?

Work units vary a lot in time to process on the same PC already and without having an accurate guide as to where it's at, we may THINK a WU is stuck when it's just a longer WU. At least with the other projects, the progress meter moves regularly (usually at increments of 0.01%), so you know whether it's working or stuck.

That's what I'd like to see on Rosetta - even 1% increments (and saves) would be an improvement.
*** Join BOINC@Australia today ***
ID: 976 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile S@NL - SIMMEL
Avatar

Send message
Joined: 27 Sep 05
Posts: 2
Credit: 40,398
RAC: 0
Message 1006 - Posted: 6 Oct 2005, 3:41:10 UTC

HI all,

It seems that (regarding your posts) all worked fine and it did. After more than 24 hrs computing time (P4 3.2 GHz) WU finished. All three WU which seemed to stuck I gave all CPU time and finished after all.

Now I experience though a WU (1btn_abrelax_no_cst_23604_0) with a %done of 125% and still crunching. How about that! Time already spend 18 hrs on this one. Know Rosetta is beta so let fix this one too.

Simmel
Simmel

ID: 1006 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 1007 - Posted: 6 Oct 2005, 4:03:47 UTC - in response to Message 1006.  

HI all,

It seems that (regarding your posts) all worked fine and it did. After more than 24 hrs computing time (P4 3.2 GHz) WU finished. All three WU which seemed to stuck I gave all CPU time and finished after all.

Now I experience though a WU (1btn_abrelax_no_cst_23604_0) with a %done of 125% and still crunching. How about that! Time already spend 18 hrs on this one. Know Rosetta is beta so let fix this one too.

Simmel


whoops, another bug to fix. :)
ID: 1007 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile FZB

Send message
Joined: 17 Sep 05
Posts: 84
Credit: 4,948,999
RAC: 0
Message 1018 - Posted: 6 Oct 2005, 7:37:18 UTC

btw. what does the _no_cst_ stand for? have one of that wu'S as well
--
Florian
www.domplatz1.de
ID: 1018 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Padanian

Send message
Joined: 27 Sep 05
Posts: 14
Credit: 15,190
RAC: 0
Message 1036 - Posted: 6 Oct 2005, 19:48:34 UTC

The dreaded 83.33% bug is affecting many of us.
There's another thread about this topic...
ID: 1036 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 1039 - Posted: 6 Oct 2005, 20:57:47 UTC - in response to Message 1036.  

The dreaded 83.33% bug is affecting many of us.
There's another thread about this topic...


Since this is the point where the current WUs do full-atom relax which takes a bit longer to run per structure and uses much more memory, it may be that it is just taking longer, particularly if enough memory is not available and virtual memory is being used.
ID: 1039 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 1040 - Posted: 6 Oct 2005, 21:05:08 UTC - in response to Message 1018.  

btw. what does the _no_cst_ stand for? have one of that wu'S as well


_no_cst_ identifies the WU's that are not using constraint information for the final full-atom relax predictions. Constraints are used based on homologous (determined by similarity in amino acid sequence) structures that have been experimentally determined.
ID: 1040 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Padanian

Send message
Joined: 27 Sep 05
Posts: 14
Credit: 15,190
RAC: 0
Message 1105 - Posted: 8 Oct 2005, 12:48:00 UTC

One more of those stuck WU. Detaching Rosetta.
ID: 1105 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 4,100,301
RAC: 93
Message 1148 - Posted: 9 Oct 2005, 9:25:43 UTC
Last modified: 9 Oct 2005, 10:02:26 UTC

Nevermind, I don't have a clue on how this Project runs ...
ID: 1148 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 1149 - Posted: 9 Oct 2005, 9:40:39 UTC - in response to Message 1148.  
Last modified: 9 Oct 2005, 9:41:58 UTC

I noticed both WU's were about 1/2 hour past the total amount of time it should have took them to finish for that Computer.


How do you know "the total amount of time it should have took"? There's no "standard" work unit time - some work units may take 3 times as long as others. Maybe even more than that
*** Join BOINC@Australia today ***
ID: 1149 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 4,100,301
RAC: 93
Message 1150 - Posted: 9 Oct 2005, 10:00:01 UTC - in response to Message 1149.  

I noticed both WU's were about 1/2 hour past the total amount of time it should have took them to finish for that Computer.


How do you know "the total amount of time it should have took"? There's no "standard" work unit time - some work units may take 3 times as long as others. Maybe even more than that


Ya your right, I'm Stupid and don't know what I'm talking about, I'll stop reporting abnormal running WU's and just let them run for 2 days without any progress before I abort them.

ID: 1150 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 1154 - Posted: 9 Oct 2005, 11:04:39 UTC - in response to Message 1150.  
Last modified: 9 Oct 2005, 11:05:54 UTC

Ya your right, I'm Stupid and don't know what I'm talking about


No need for sarcasm. I was only trying to say that half an hour longer (a figure you mentioned) is not necessarily cause for alarm, given the big variation between work units.

My laptop finishes an Einstein WU in 8 hours and 20 minutes, give or take 5 minutes. But on Rosetta I have no idea how long each WU will take - the longest has taken about 4 hours, while some were done in half that time.
*** Join BOINC@Australia today ***
ID: 1154 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 42
Message 1167 - Posted: 9 Oct 2005, 14:52:30 UTC
Last modified: 9 Oct 2005, 14:53:36 UTC

-----
If this happening on your slower computer, and you're switching between projects, the normal one hour run slot may not be enough for the WU to reach the next check point (at 91.66%), so each time it restarts it's dropping back to 83.33%.
-----

If you have your General Prefs set to "Leave in memory", this should not be an issue though, since the model just suspends, and does not need to reload from a checkpoint. Unless they've goofed of course and are loading the checkpoint on restart anyway!
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 1167 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 4,100,301
RAC: 93
Message 1168 - Posted: 9 Oct 2005, 15:08:25 UTC - in response to Message 1167.  

-----
If this happening on your slower computer, and you're switching between projects, the normal one hour run slot may not be enough for the WU to reach the next check point (at 91.66%), so each time it restarts it's dropping back to 83.33%.
-----

If you have your General Prefs set to "Leave in memory", this should not be an issue though, since the model just suspends, and does not need to reload from a checkpoint. Unless they've goofed of course and are loading the checkpoint on restart anyway!


The slowest computer I have is a P4 3.06 HT CPU, the WU's were on a P4 3.4 HT Computer and Rosetta was the only Project running. In fact it was the only Project I had WU's for on that Computer at the Time. I also have my Preferences set to 300 Min's before switching & to Leave in Memory, but those are not issue's if I only have WU's from one Project.
ID: 1168 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : Crunching stucks on 83.33%



©2024 University of Washington
https://www.bakerlab.org