Increased to 512MB as recommended memory requirement

Message boards : Number crunching : Increased to 512MB as recommended memory requirement

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 4,100,301
RAC: 93
Message 1446 - Posted: 17 Oct 2005, 23:53:06 UTC
Last modified: 18 Oct 2005, 0:14:46 UTC

Hey ARmassey, how you doing, good job your doing for the Team ... :)

David should be aware of this problem or at least he was at one time as people were sending him the stdout.txt from the slots of Hung WU's. I personally sent him 6 of them myself. Maybe he thought the problem went away because nobody's really been saying much about it lately, I don't know.

But I can assure him the problem is still here & not just @ 1% either. So far I've seen it @ 1% - 8.33% - 75% & 91.66% ... I've had WU's Hung at all 4 of those % Points just today alone ... I only made mention of it in the earlier post because I hadn't seen any response from any of the Dev's on it recently.

Hopefully something can be done about it because it's frustrating as all get out to wake up in the morning & see WU's with 7 hr's of CPU Time still @ the 1% Mark. I would just as soon see them Error out after a set time and go on to the next WU rather than just sit there at the same point hour after hour.
ID: 1446 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 1447 - Posted: 17 Oct 2005, 23:59:42 UTC - in response to Message 1446.  

Hey ARmassey, how you doing, good job your doing for the Team ... :)

David should be aware of this problem or at least he was at one time as people were sending him the stdout.txt from the slots of Hung WU's. I personally sent him 6 of them myself. Maybe he thought the problem went away because nobody's really been saying much about it lately, I don't know.

But I can assure him the problem is still here & not just @ 1% either. So far I've seen it @ 1% - 8.33% - 75% & 91.66% ... I've had WU's Hung at all 4 of those % Points just today alone ... I only made mention of it in the earlier post because I hadn't seen any response from any of the Dev's on it recently.

Hopefully something can be done about it because it's frustrating as a get out to wake up in the morning & see WU's with 7 hr's of CPU Time still @ the 1% Mark.


I am aware of this problem and have been looking into it. I actually ran into this myself on my laptop running a 1acf_abrelax WU, which allowed me to look closely at the issue and try to debug.....but when I tried to run it again with the same random number seed on the same computer, it continued on past where it was stuck on the previous run so it looks like it may not an issue with the rosetta application but possibly with the boinc client. I'll keep looking into it.
ID: 1447 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 4,100,301
RAC: 93
Message 1449 - Posted: 18 Oct 2005, 0:05:57 UTC

but when I tried to run it again with the same random number seed on the same computer, it continued on past where it was stuck on the previous run
=========

Yes, thats very common for the WU to continue the Second try after you Shut the Manager down & restart it.
ID: 1449 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 1450 - Posted: 18 Oct 2005, 0:29:50 UTC - in response to Message 1449.  
Last modified: 18 Oct 2005, 0:32:46 UTC

but when I tried to run it again with the same random number seed on the same computer, it continued on past where it was stuck on the previous run
=========

Yes, thats very common for the WU to continue the Second try after you Shut the Manager down & restart it.


Actually, I am running the tests in standalone mode. What I mean is that when I try the WU again with the exact same app, computer, and random number seed (which should give an identical run, and in fact the numbers that are returned in stdout are exactly the same, which confirms this), it does not get stuck. If it was the app, it should get stuck in the exact same place. I guess this is equivilent to restarting the manager since it also should use the same random number seed.
ID: 1450 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 4,100,301
RAC: 93
Message 1451 - Posted: 18 Oct 2005, 0:34:22 UTC

Okay, I get what you mean ... but like I said & other people have said also is if we restart the WU again by some other means most often it will run normally. If thats worth anything or means anything ... :)
ID: 1451 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 1452 - Posted: 18 Oct 2005, 0:40:58 UTC

I am hoping that our updated app using the most current boinc api code will fix this issue, assuming it is an issue with the boinc api that may have been dealt with.
ID: 1452 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
AnRM
Avatar

Send message
Joined: 18 Sep 05
Posts: 123
Credit: 1,355,486
RAC: 0
Message 1453 - Posted: 18 Oct 2005, 1:04:47 UTC - in response to Message 1452.  
Last modified: 18 Oct 2005, 1:15:23 UTC

I am hoping that our updated app using the most current boinc api code will fix this issue, assuming it is an issue with the boinc api that may have been dealt with.

> That is great news, David. How soon will you issue the new app?. As I have stated before, I wasn't having any problems until you updated your servers which in my mind also points to a random BOINC bug. After I updated my boxes to 5.2.1, it also reduced the problems by 90%. Again, a BOINC issue. With your updated app using the most current BOINC API code, we all should be on the same page. Thanks for the feedback....Cheers, Rog.
ID: 1453 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 1454 - Posted: 18 Oct 2005, 1:21:39 UTC - in response to Message 1453.  

After I updated my boxes to 5.2.1, it also reduced the problems by 90%. Again, a BOINC issue. With your updated app using the most current BOINC API code, we all should be on the same page. Thanks for the feedback....Cheers, Rog.


For whatever it is worth, BOINC 5.2.2 is out now.
Regards,
Bob P.
ID: 1454 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AnRM
Avatar

Send message
Joined: 18 Sep 05
Posts: 123
Credit: 1,355,486
RAC: 0
Message 1455 - Posted: 18 Oct 2005, 1:27:43 UTC - in response to Message 1446.  
Last modified: 18 Oct 2005, 1:29:27 UTC

[quote]Hey ARmassey, how you doing, good job your doing for the Team ... :)

> Thanks, Bob. Until I upgraded to 5.2.1 and it reduced the 'hang' rate I was tempted to bail out as well. Thanks for hanging in there and hopefully David is on to something with their new app. Also, it was good that you brought the subject up again as they maybe didn't realize what a pain it was to people who have mutiple boxes. I rather suspect they don't want to lose someone like yourself who has such impressive processing capability.....Cheers, Rog.
ID: 1455 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AnRM
Avatar

Send message
Joined: 18 Sep 05
Posts: 123
Credit: 1,355,486
RAC: 0
Message 1456 - Posted: 18 Oct 2005, 1:32:51 UTC - in response to Message 1454.  
Last modified: 18 Oct 2005, 1:33:34 UTC

For whatever it is worth, BOINC 5.2.2 is out now.[/quote]
> Good to know....Thanks, Bob. Will check it out. Cheers, Rog.
ID: 1456 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 1458 - Posted: 18 Oct 2005, 2:38:59 UTC

5.2.2 is not "out" except as a test release. Use at your own risk ...
ID: 1458 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AnRM
Avatar

Send message
Joined: 18 Sep 05
Posts: 123
Credit: 1,355,486
RAC: 0
Message 1460 - Posted: 18 Oct 2005, 3:04:09 UTC - in response to Message 1458.  

5.2.2 is not "out" except as a test release. Use at your own risk ...

OK, thanks for the 'heads up', Paul. (maybe 'heads up' has a different meaning for a ex-sailor??:) Cheers, Rog.
ID: 1460 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 4,100,301
RAC: 93
Message 1463 - Posted: 18 Oct 2005, 10:06:18 UTC - in response to Message 1452.  
Last modified: 18 Oct 2005, 10:13:07 UTC

I am hoping that our updated app using the most current boinc api code will fix this issue, assuming it is an issue with the boinc api that may have been dealt with.


What will the new App # be David, right now I have the Rosetta 4.77 WU's ... ???
The reason I ask is because I'm trying to run the WU's I have now down to just a couple on each Computer, so when the New App comes out I can start running them right away and see if they Hang or not at certain % Points ... :)
ID: 1463 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 1474 - Posted: 18 Oct 2005, 17:09:39 UTC - in response to Message 1460.  


OK, thanks for the 'heads up', Paul. (maybe 'heads up' has a different meaning for a ex-sailor??:) Cheers, Rog.

Yeah, no more lines ... :)
ID: 1474 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile kb7rzf
Avatar

Send message
Joined: 7 Oct 05
Posts: 16
Credit: 35,427
RAC: 0
Message 1477 - Posted: 18 Oct 2005, 17:39:26 UTC - in response to Message 1447.  
Last modified: 18 Oct 2005, 17:47:29 UTC


I am aware of this problem and have been looking into it. I actually ran into this myself on my laptop running a 1acf_abrelax WU, which allowed me to look closely at the issue and try to debug.....but when I tried to run it again with the same random number seed on the same computer, it continued on past where it was stuck on the previous run so it looks like it may not an issue with the rosetta application but possibly with the boinc client. I'll keep looking into it.


I am now running one of these WU's as well, it paused @ 1%, 8.33% now at 16.67%, and still sitting there. Tried the exiting of BOINC and rebooting and still stuck. Just posting for info. Thanks

Jeremy
[edit] Now paused at 25%.[/edit]

ID: 1477 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 7 Oct 05
Posts: 65
Credit: 10,612,039
RAC: 0
Message 1481 - Posted: 18 Oct 2005, 19:57:58 UTC - in response to Message 1477.  

David,

A lot of people are concerned over this 1% (and other points) hangs. In your learned opinion, how long (in CPU time) should a person wait before they actually call it a hang. Be generous, some folks have slow computers. My concern is that people are giving up early. I have seen some "hang" for about 1.5 hrs and then continue on (they were not hung)... and this is on a Pent 4 running 2.4Mhz.

ID: 1481 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 1482 - Posted: 18 Oct 2005, 20:10:49 UTC

If it hangs for over 3 hours, it is likely to be stuck. I am going to change the rsc_fpops_bound value for new WU's so that it will abort WU's that are stuck (exceed the time it takes to do this upper bound of floating-point operations based on the computer's benchmark). I am hoping the updated BOINC api will deal with this issue because it does not seem like it is our application.
ID: 1482 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AnRM
Avatar

Send message
Joined: 18 Sep 05
Posts: 123
Credit: 1,355,486
RAC: 0
Message 1484 - Posted: 18 Oct 2005, 20:31:18 UTC - in response to Message 1481.  

David,

A lot of people are concerned over this 1% (and other points) hangs. In your learned opinion, how long (in CPU time) should a person wait before they actually call it a hang. Be generous, some folks have slow computers. My concern is that people are giving up early. I have seen some "hang" for about 1.5 hrs and then continue on (they were not hung)... and this is on a Pent 4 running 2.4Mhz.

>Rest assured we are talking about the 8-12 hours or so range. Your point is well taken though as the % readout is not linear. Thanks for the fix, David. Cheers,Rog.
ID: 1484 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 1495 - Posted: 18 Oct 2005, 23:11:13 UTC
Last modified: 18 Oct 2005, 23:11:44 UTC

Just for the record.

I just restarted BOINC as I had a WU going nowhere - still on 1% after 2 hours on a 3.4GHz P4 with 1GB RAM. stderr.txt was empty (zero bytes). Since restarting, that same WU (1btn_abrelax_04549_2) has crunched for just 25 minutes and it's at 41.67%

David, if these things happen (for those of us willing and able to spend time monitoring) is there anything we can glean from the stdout.txt file (or another file in the slot) in terms of seeing whether it's stuck or not?
*** Join BOINC@Australia today ***
ID: 1495 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 1498 - Posted: 19 Oct 2005, 0:04:08 UTC - in response to Message 1495.  

Just for the record.

I just restarted BOINC as I had a WU going nowhere - still on 1% after 2 hours on a 3.4GHz P4 with 1GB RAM. stderr.txt was empty (zero bytes). Since restarting, that same WU (1btn_abrelax_04549_2) has crunched for just 25 minutes and it's at 41.67%

David, if these things happen (for those of us willing and able to spend time monitoring) is there anything we can glean from the stdout.txt file (or another file in the slot) in terms of seeing whether it's stuck or not?



Good question. The stdout file should grow in smaller time increments compared to the structures being produced so if you do not see output being appendend to the stdout file for over an hour, particularly when it is still at the initial stages (1%), it is most likely stuck.
ID: 1498 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Increased to 512MB as recommended memory requirement



©2024 University of Washington
https://www.bakerlab.org