Restarting Results ad infinitum II

Message boards : Number crunching : Restarting Results ad infinitum II

To post messages, you must log in.

AuthorMessage
AMDave

Send message
Joined: 16 Dec 05
Posts: 35
Credit: 12,576,896
RAC: 0
Message 29513 - Posted: 17 Oct 2006, 14:18:32 UTC

Running BOINC Mgr v5.2.13, Rosetta v5.32

Been having a problem for more than a month now. I've experienced this before (see this thread "Restarting Results ad infinitum"), however, I don't remember doing anything in particular to alleviate the problem. In my preferences, "Leave applications in memory while suspended?" = YES. As before, I have selected Reset Project from the BOINC mgr.

Take a look at the following excerpt from the Messages tab:

10/16/2006 11:47:16 PM|rosetta@home|Result DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0 exited with zero status but no 'finished' file
10/16/2006 11:47:16 PM|rosetta@home|If this happens repeatedly you may need to reset the project.
10/16/2006 11:47:16 PM||request_reschedule_cpus: process exited
10/16/2006 11:47:16 PM|rosetta@home|Restarting result DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0 using rosetta version 532
10/17/2006 12:44:09 AM|rosetta@home|Result DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0 exited with zero status but no 'finished' file
10/17/2006 12:44:09 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
10/17/2006 12:44:09 AM||request_reschedule_cpus: process exited
10/17/2006 12:44:09 AM|rosetta@home|Restarting result DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0 using rosetta version 532
10/17/2006 1:40:48 AM|rosetta@home|Result DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0 exited with zero status but no 'finished' file
10/17/2006 1:40:48 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
10/17/2006 1:40:48 AM||request_reschedule_cpus: process exited
10/17/2006 1:40:48 AM|rosetta@home|Restarting result DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0 using rosetta version 532
10/17/2006 2:37:30 AM|rosetta@home|Result DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0 exited with zero status but no 'finished' file
10/17/2006 2:37:30 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
10/17/2006 2:37:30 AM||request_reschedule_cpus: process exited
10/17/2006 2:37:30 AM|rosetta@home|Restarting result DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0 using rosetta version 532
10/17/2006 3:20:35 AM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
10/17/2006 3:20:35 AM|rosetta@home|Reason: To fetch work
10/17/2006 3:20:35 AM|rosetta@home|Requesting 1203 seconds of new work, and reporting 1 results
10/17/2006 3:20:40 AM|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded
10/17/2006 3:20:42 AM|rosetta@home|Started download of 1mkyA.fasta.gz
10/17/2006 3:20:42 AM|rosetta@home|Started download of 1mkyA.psipred_ss2.gz
10/17/2006 3:20:43 AM|rosetta@home|Finished download of 1mkyA.fasta.gz
10/17/2006 3:20:43 AM|rosetta@home|Throughput 270 bytes/sec
10/17/2006 3:20:43 AM|rosetta@home|Finished download of 1mkyA.psipred_ss2.gz
10/17/2006 3:20:43 AM|rosetta@home|Throughput 1821 bytes/sec
10/17/2006 3:20:43 AM|rosetta@home|Started download of aa1mkyA03_05.400_v1_3.gz
10/17/2006 3:20:43 AM|rosetta@home|Started download of aa1mkyA09_05.400_v1_3.gz
10/17/2006 3:21:18 AM|rosetta@home|Finished download of aa1mkyA03_05.400_v1_3.gz
10/17/2006 3:21:18 AM|rosetta@home|Throughput 43828 bytes/sec
10/17/2006 3:21:18 AM|rosetta@home|Started download of 1mky.pdb.gz
10/17/2006 3:21:19 AM|rosetta@home|Finished download of 1mky.pdb.gz
10/17/2006 3:21:19 AM|rosetta@home|Throughput 12323 bytes/sec
10/17/2006 3:21:45 AM|rosetta@home|Finished download of aa1mkyA09_05.400_v1_3.gz
10/17/2006 3:21:45 AM|rosetta@home|Throughput 63873 bytes/sec
10/17/2006 3:21:46 AM||request_reschedule_cpus: files downloaded
10/17/2006 3:34:19 AM|rosetta@home|Result DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0 exited with zero status but no 'finished' file
10/17/2006 3:34:19 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
10/17/2006 3:34:19 AM||request_reschedule_cpus: process exited
10/17/2006 3:34:19 AM|rosetta@home|Restarting result DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0 using rosetta version 532
10/17/2006 3:56:10 AM||request_reschedule_cpus: process exited
10/17/2006 3:56:10 AM|rosetta@home|Computation for result DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0 finished
10/17/2006 3:56:10 AM|rosetta@home|Starting result 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0 using rosetta version 532
10/17/2006 3:56:12 AM|rosetta@home|Started upload of DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0_0
10/17/2006 3:56:22 AM|rosetta@home|Finished upload of DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0_0
10/17/2006 3:56:22 AM|rosetta@home|Throughput 26289 bytes/sec
10/17/2006 4:30:46 AM|rosetta@home|Result 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0 exited with zero status but no 'finished' file
10/17/2006 4:30:46 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
10/17/2006 4:30:46 AM||request_reschedule_cpus: process exited
10/17/2006 4:30:46 AM|rosetta@home|Restarting result 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0 using rosetta version 532
10/17/2006 5:27:27 AM|rosetta@home|Result 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0 exited with zero status but no 'finished' file
10/17/2006 5:27:27 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
10/17/2006 5:27:27 AM||request_reschedule_cpus: process exited
10/17/2006 5:27:27 AM|rosetta@home|Restarting result 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0 using rosetta version 532
10/17/2006 6:23:43 AM|rosetta@home|Result 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0 exited with zero status but no 'finished' file
10/17/2006 6:23:43 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
10/17/2006 6:23:43 AM||request_reschedule_cpus: process exited
10/17/2006 6:23:43 AM|rosetta@home|Restarting result 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0 using rosetta version 532
10/17/2006 7:00:48 AM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
10/17/2006 7:00:48 AM|rosetta@home|Reason: To fetch work
10/17/2006 7:00:48 AM|rosetta@home|Requesting 415 seconds of new work, and reporting 1 results
10/17/2006 7:00:53 AM|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded
10/17/2006 7:00:55 AM||request_reschedule_cpus: files downloaded
10/17/2006 7:20:00 AM|rosetta@home|Result 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0 exited with zero status but no 'finished' file
10/17/2006 7:20:00 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
10/17/2006 7:20:00 AM||request_reschedule_cpus: process exited
10/17/2006 7:20:00 AM|rosetta@home|Restarting result 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0 using rosetta version 532
10/17/2006 8:02:05 AM||request_reschedule_cpus: process exited
10/17/2006 8:02:05 AM|rosetta@home|Computation for result 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0 finished
10/17/2006 8:02:05 AM|rosetta@home|Starting result DOC_1IAI_pose_u_pert_with_bbmin_1282_868_0 using rosetta version 532
10/17/2006 8:02:07 AM|rosetta@home|Started upload of 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0_0
10/17/2006 8:02:13 AM|rosetta@home|Finished upload of 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0_0
10/17/2006 8:02:13 AM|rosetta@home|Throughput 24001 bytes/sec
10/17/2006 8:16:27 AM|rosetta@home|Result DOC_1IAI_pose_u_pert_with_bbmin_1282_868_0 exited with zero status but no 'finished' file
10/17/2006 8:16:27 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
10/17/2006 8:16:27 AM||request_reschedule_cpus: process exited
10/17/2006 8:16:27 AM|rosetta@home|Restarting result DOC_1IAI_pose_u_pert_with_bbmin_1282_868_0 using rosetta version 532
10/17/2006 9:12:38 AM|rosetta@home|Result DOC_1IAI_pose_u_pert_with_bbmin_1282_868_0 exited with zero status but no 'finished' file
10/17/2006 9:12:38 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
10/17/2006 9:12:38 AM||request_reschedule_cpus: process exited
10/17/2006 9:12:38 AM|rosetta@home|Restarting result DOC_1IAI_pose_u_pert_with_bbmin_1282_868_0 using rosetta version 532
10/17/2006 9:24:08 AM||request_reschedule_cpus: project op
10/17/2006 9:24:12 AM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
10/17/2006 9:24:12 AM|rosetta@home|Reason: Requested by user
10/17/2006 9:24:12 AM|rosetta@home|Reporting 1 results
10/17/2006 9:24:17 AM|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded
10/17/2006 10:09:00 AM|rosetta@home|Result DOC_1IAI_pose_u_pert_with_bbmin_1282_868_0 exited with zero status but no 'finished' file
10/17/2006 10:09:00 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
10/17/2006 10:09:00 AM||request_reschedule_cpus: process exited
10/17/2006 10:09:00 AM|rosetta@home|Restarting result DOC_1IAI_pose_u_pert_with_bbmin_1282_868_0 using rosetta version 532


Thanks for your help.
ID: 29513 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 29515 - Posted: 17 Oct 2006, 14:41:46 UTC

This question was posed in the dev mail list. Here was Dr. Andersons reply:

David Anderson to Andre, boinc_projects
More options Oct 2

The Manager isn't involved here.
This error means that the core client stopped running
(i.e. it crashed, or it's stopped in a debugger)

-- David


Andre Kerstens wrote:
> Hi all,
>
> Some of our crunchers at D@H have the following message in their stderr.txt:
>
> No heartbeat from core client for 31 sec - exiting
>
> Is this a problem where the boincmgr cannot contact the boinc core
> client or could it be a problem with our app? I suspect it is the first
> one, but like to find out if that is the case.

Then Rom Walton added:

Rom Walton to Nicolas, Andre, boinc_projects
More options Oct 2

I have also seen this happen when the science applications have a memory
corruption issue.

An example would be when an application defines a static character array
of x number of characters but uses a function like sprintf and formats a
string of x+25 in size. Since the runtime library uses a bunch of
static variables the linker lumps them all together when generating the
binary and so they are sensitive to buffer overruns.


----- Rom

ID: 29515 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
WendyR
Avatar

Send message
Joined: 7 Dec 05
Posts: 10
Credit: 215,574
RAC: 0
Message 29517 - Posted: 17 Oct 2006, 15:12:44 UTC

I have seen those "exited with zero status but no 'finished' file" errors quite a few times too.

I found there were a couple of events that "triggered" them in my case. I found that I got one of those each time I closed the cover on my laptop to move it somewhere else. That triggers a "go to sleep" mode in my laptop, and exactly which order that things happen during the "sleep" and the corresponding "wakeup" process are probably causing this.

I also get that message when I try to do something with the "client_state.xml" file at the same time that the BOINC manager wants it. In my case, I was opening "BOINC Debt Viewer" when BOINC decided to switch between tasks.

I know that other people are doing stuff with grep to monitor things in that file. Are you running something else that is attempting to look at that file? Is your virus checking software examining that file a lot because it is getting changed regularly? Do you run some automated backup or indexing software that is hitting that file regularly?

Just some things to think about...
ID: 29517 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Michael G.R.

Send message
Joined: 11 Nov 05
Posts: 264
Credit: 11,247,510
RAC: 0
Message 29520 - Posted: 17 Oct 2006, 16:51:03 UTC - in response to Message 29513.  

Running BOINC Mgr v5.2.13, Rosetta v5.32


I have no idea if it has anything to do with your problem, but I would recommend upgrading to the lasttest version of BOINC: 5.4.11
ID: 29520 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMDave

Send message
Joined: 16 Dec 05
Posts: 35
Credit: 12,576,896
RAC: 0
Message 29583 - Posted: 18 Oct 2006, 15:00:02 UTC

To Wendy R:
No to all Qs.

Short of installing a newer mgr, is there anything else that I can do/look for to correct this? I've had good luck with this version of the mgr and would consider installing another as a last resort. Can anyone offer any links, perhaps, where I could further research this matter.

To mmciastro:
Could you suggest any particular links/avenues for me to persue?
ID: 29583 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 29584 - Posted: 18 Oct 2006, 15:14:01 UTC

My suggestion is to do nothing.

Yes, that's right, nothing.

That error message in and of itself is not uncommon. I see you're last eleven WU have not errored out. The only two that have errored out with just that message are both FRA 2rio wus and there is evidence from others that there seems to be an issue with that WU on some systems (they haven't nailed it down that I know). The only other errors were due to you "aborting" the work.

So, I'd do nothing and just keep an eye on your upcoming work.

tony
ID: 29584 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMDave

Send message
Joined: 16 Dec 05
Posts: 35
Credit: 12,576,896
RAC: 0
Message 29587 - Posted: 18 Oct 2006, 15:52:01 UTC
Last modified: 18 Oct 2006, 15:52:40 UTC

I aborted those WUs b/c they were assigned to the previous version of Rosetta (5.25 I think) and I wanted to see if the new version would fix the issue.

I understand the 'if it ain't broke, don't fix' basis for your suggestion, however, I'm inclined to believe that it is broke, b/c it's not functioning properly. This issue is wasting electricity and, unfortunately, I don't have the financial means to ignore this any further. Let me add that, I've dedicated 100% of my DC resources to Rosetta and would like to give the project the most bang for its buck.

Also, when something unexpected happens with regard to either the hardware or software comprising my system, I get very curious and go into research mode.
ID: 29587 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 29588 - Posted: 18 Oct 2006, 16:15:52 UTC

OK, If you want to research, keep an eye on your tasks until one is completed and another starts. While it's running, start and stop boinc a set number of times, then let it finish without letting it go into screensaver or changing any other conditions. After it's uploaded and reported, check the Result ID of that wu and see if you don't get that error once for each stop/start of boinc. Play around with the screensaver settings on the next wu, forcing the screensaver to stop/start a recordable number of times and see if you get them. You might find a correlation between them. This will only address that message. If you look at all the recent threads, you'll notice that many of the reported errors pertain to users attempting to run FRA 2rio. If the FRA 2rio wus have problems on certain computers, then It falls to the project scientists to look at this issue and figure out why.
ID: 29588 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMDave

Send message
Joined: 16 Dec 05
Posts: 35
Credit: 12,576,896
RAC: 0
Message 29590 - Posted: 18 Oct 2006, 18:13:00 UTC

I'll try the stop/start method. Regarding the screensaver, I've never used it. Instead, I have one initiated by the OS. The only times my system is off is either during an electrical storm, or when I need to work on another PC.

How long have the FRA 2rio WUs been circulating? Remember, this problem has been affecting my system for @ month now. This leads me to wonder if the WU 'family' is a moot point.
ID: 29590 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 29591 - Posted: 18 Oct 2006, 18:22:03 UTC

The earliest FRA 2rio on my record was 14 Oct. The no heartbeat message has been around since early in the history of boinc.
ID: 29591 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 29598 - Posted: 18 Oct 2006, 19:57:33 UTC - in response to Message 29583.  
Last modified: 18 Oct 2006, 20:04:00 UTC

To Wendy R:
No to all Qs.

Short of installing a newer mgr, is there anything else that I can do/look for to correct this? I've had good luck with this version of the mgr and would consider installing another as a last resort. Can anyone offer any links, perhaps, where I could further research this matter.

To mmciastro:
Could you suggest any particular links/avenues for me to persue?



honestly there is nothing wrong with th 5.4.11 version and fixes bug in the earlier versions.
If you are worried, don't know why as it in wide use now) but Stop boinc, copy the folder, then install the updated version. If all goes wrong you can just delete and replace :-)

Though in fact you can just install the update if it crashes and burns then just uninstall and reinstall the older version. Nothing lost.
It's a much quicker test than sitting and watching.
Team mauisun.org
ID: 29598 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMDave

Send message
Joined: 16 Dec 05
Posts: 35
Credit: 12,576,896
RAC: 0
Message 29610 - Posted: 19 Oct 2006, 1:02:05 UTC

It's not that I'm apprehensive about upgrading, it's just that I'm curious. I installed it last Dec, it works fine for @ six months, develops an issue, the issue goes away (if only I could remember what I did to make it go away), it works fine for a couple more months, then the same issue appears.

I performed some alterations on my system this summer, but this issue resurfaced some time after the fact. The end result most likely will be that I'll perform the upgrade. Until then, I'd like to try some investigating first.
ID: 29610 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 29626 - Posted: 19 Oct 2006, 8:44:56 UTC - in response to Message 29610.  

It's not that I'm apprehensive about upgrading, it's just that I'm curious. I installed it last Dec, it works fine for @ six months, develops an issue, the issue goes away (if only I could remember what I did to make it go away), it works fine for a couple more months, then the same issue appears.

I performed some alterations on my system this summer, but this issue resurfaced some time after the fact. The end result most likely will be that I'll perform the upgrade. Until then, I'd like to try some investigating first.


I'm not saying it will work either, but I havn't seen that sort of error since the 5.2.x series. Don't know if it was the boinc client or not though ;-)

Could it be a virus scanner, firewall (know issues with some newer firewalls hence the release of 5.4.11) , defrag programs, anti virus etc kicking in and causing it to get pushed out of the way.
Maybe your computer is trying to suspend or hibernate at that time or your hardrive for some reason is trying to stop spinning (though an OS drive should not stop spinning).


It is not just a particualr type of task that is doing it, all of your tasks seem to be doing it.

I would recommend a boinc install as the test, could just be something setup worng. Also reset your preferences to default to start with.

Team mauisun.org
ID: 29626 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Restarting Results ad infinitum II



©2024 University of Washington
https://www.bakerlab.org