Message boards : Number crunching : Restarting Results ad infinitum II
Author | Message |
---|---|
AMDave Send message Joined: 16 Dec 05 Posts: 35 Credit: 12,576,896 RAC: 0 |
Running BOINC Mgr v5.2.13, Rosetta v5.32 Been having a problem for more than a month now. I've experienced this before (see this thread "Restarting Results ad infinitum"), however, I don't remember doing anything in particular to alleviate the problem. In my preferences, "Leave applications in memory while suspended?" = YES. As before, I have selected Reset Project from the BOINC mgr. Take a look at the following excerpt from the Messages tab: 10/16/2006 11:47:16 PM|rosetta@home|Result DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0 exited with zero status but no 'finished' file 10/16/2006 11:47:16 PM|rosetta@home|If this happens repeatedly you may need to reset the project. 10/16/2006 11:47:16 PM||request_reschedule_cpus: process exited 10/16/2006 11:47:16 PM|rosetta@home|Restarting result DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0 using rosetta version 532 10/17/2006 12:44:09 AM|rosetta@home|Result DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0 exited with zero status but no 'finished' file 10/17/2006 12:44:09 AM|rosetta@home|If this happens repeatedly you may need to reset the project. 10/17/2006 12:44:09 AM||request_reschedule_cpus: process exited 10/17/2006 12:44:09 AM|rosetta@home|Restarting result DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0 using rosetta version 532 10/17/2006 1:40:48 AM|rosetta@home|Result DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0 exited with zero status but no 'finished' file 10/17/2006 1:40:48 AM|rosetta@home|If this happens repeatedly you may need to reset the project. 10/17/2006 1:40:48 AM||request_reschedule_cpus: process exited 10/17/2006 1:40:48 AM|rosetta@home|Restarting result DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0 using rosetta version 532 10/17/2006 2:37:30 AM|rosetta@home|Result DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0 exited with zero status but no 'finished' file 10/17/2006 2:37:30 AM|rosetta@home|If this happens repeatedly you may need to reset the project. 10/17/2006 2:37:30 AM||request_reschedule_cpus: process exited 10/17/2006 2:37:30 AM|rosetta@home|Restarting result DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0 using rosetta version 532 10/17/2006 3:20:35 AM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi 10/17/2006 3:20:35 AM|rosetta@home|Reason: To fetch work 10/17/2006 3:20:35 AM|rosetta@home|Requesting 1203 seconds of new work, and reporting 1 results 10/17/2006 3:20:40 AM|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded 10/17/2006 3:20:42 AM|rosetta@home|Started download of 1mkyA.fasta.gz 10/17/2006 3:20:42 AM|rosetta@home|Started download of 1mkyA.psipred_ss2.gz 10/17/2006 3:20:43 AM|rosetta@home|Finished download of 1mkyA.fasta.gz 10/17/2006 3:20:43 AM|rosetta@home|Throughput 270 bytes/sec 10/17/2006 3:20:43 AM|rosetta@home|Finished download of 1mkyA.psipred_ss2.gz 10/17/2006 3:20:43 AM|rosetta@home|Throughput 1821 bytes/sec 10/17/2006 3:20:43 AM|rosetta@home|Started download of aa1mkyA03_05.400_v1_3.gz 10/17/2006 3:20:43 AM|rosetta@home|Started download of aa1mkyA09_05.400_v1_3.gz 10/17/2006 3:21:18 AM|rosetta@home|Finished download of aa1mkyA03_05.400_v1_3.gz 10/17/2006 3:21:18 AM|rosetta@home|Throughput 43828 bytes/sec 10/17/2006 3:21:18 AM|rosetta@home|Started download of 1mky.pdb.gz 10/17/2006 3:21:19 AM|rosetta@home|Finished download of 1mky.pdb.gz 10/17/2006 3:21:19 AM|rosetta@home|Throughput 12323 bytes/sec 10/17/2006 3:21:45 AM|rosetta@home|Finished download of aa1mkyA09_05.400_v1_3.gz 10/17/2006 3:21:45 AM|rosetta@home|Throughput 63873 bytes/sec 10/17/2006 3:21:46 AM||request_reschedule_cpus: files downloaded 10/17/2006 3:34:19 AM|rosetta@home|Result DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0 exited with zero status but no 'finished' file 10/17/2006 3:34:19 AM|rosetta@home|If this happens repeatedly you may need to reset the project. 10/17/2006 3:34:19 AM||request_reschedule_cpus: process exited 10/17/2006 3:34:19 AM|rosetta@home|Restarting result DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0 using rosetta version 532 10/17/2006 3:56:10 AM||request_reschedule_cpus: process exited 10/17/2006 3:56:10 AM|rosetta@home|Computation for result DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0 finished 10/17/2006 3:56:10 AM|rosetta@home|Starting result 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0 using rosetta version 532 10/17/2006 3:56:12 AM|rosetta@home|Started upload of DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0_0 10/17/2006 3:56:22 AM|rosetta@home|Finished upload of DOC_1GLA_pose_u_pert_with_bbmin_1282_868_0_0 10/17/2006 3:56:22 AM|rosetta@home|Throughput 26289 bytes/sec 10/17/2006 4:30:46 AM|rosetta@home|Result 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0 exited with zero status but no 'finished' file 10/17/2006 4:30:46 AM|rosetta@home|If this happens repeatedly you may need to reset the project. 10/17/2006 4:30:46 AM||request_reschedule_cpus: process exited 10/17/2006 4:30:46 AM|rosetta@home|Restarting result 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0 using rosetta version 532 10/17/2006 5:27:27 AM|rosetta@home|Result 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0 exited with zero status but no 'finished' file 10/17/2006 5:27:27 AM|rosetta@home|If this happens repeatedly you may need to reset the project. 10/17/2006 5:27:27 AM||request_reschedule_cpus: process exited 10/17/2006 5:27:27 AM|rosetta@home|Restarting result 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0 using rosetta version 532 10/17/2006 6:23:43 AM|rosetta@home|Result 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0 exited with zero status but no 'finished' file 10/17/2006 6:23:43 AM|rosetta@home|If this happens repeatedly you may need to reset the project. 10/17/2006 6:23:43 AM||request_reschedule_cpus: process exited 10/17/2006 6:23:43 AM|rosetta@home|Restarting result 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0 using rosetta version 532 10/17/2006 7:00:48 AM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi 10/17/2006 7:00:48 AM|rosetta@home|Reason: To fetch work 10/17/2006 7:00:48 AM|rosetta@home|Requesting 415 seconds of new work, and reporting 1 results 10/17/2006 7:00:53 AM|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded 10/17/2006 7:00:55 AM||request_reschedule_cpus: files downloaded 10/17/2006 7:20:00 AM|rosetta@home|Result 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0 exited with zero status but no 'finished' file 10/17/2006 7:20:00 AM|rosetta@home|If this happens repeatedly you may need to reset the project. 10/17/2006 7:20:00 AM||request_reschedule_cpus: process exited 10/17/2006 7:20:00 AM|rosetta@home|Restarting result 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0 using rosetta version 532 10/17/2006 8:02:05 AM||request_reschedule_cpus: process exited 10/17/2006 8:02:05 AM|rosetta@home|Computation for result 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0 finished 10/17/2006 8:02:05 AM|rosetta@home|Starting result DOC_1IAI_pose_u_pert_with_bbmin_1282_868_0 using rosetta version 532 10/17/2006 8:02:07 AM|rosetta@home|Started upload of 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0_0 10/17/2006 8:02:13 AM|rosetta@home|Finished upload of 1dtj__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_7016_0_0 10/17/2006 8:02:13 AM|rosetta@home|Throughput 24001 bytes/sec 10/17/2006 8:16:27 AM|rosetta@home|Result DOC_1IAI_pose_u_pert_with_bbmin_1282_868_0 exited with zero status but no 'finished' file 10/17/2006 8:16:27 AM|rosetta@home|If this happens repeatedly you may need to reset the project. 10/17/2006 8:16:27 AM||request_reschedule_cpus: process exited 10/17/2006 8:16:27 AM|rosetta@home|Restarting result DOC_1IAI_pose_u_pert_with_bbmin_1282_868_0 using rosetta version 532 10/17/2006 9:12:38 AM|rosetta@home|Result DOC_1IAI_pose_u_pert_with_bbmin_1282_868_0 exited with zero status but no 'finished' file 10/17/2006 9:12:38 AM|rosetta@home|If this happens repeatedly you may need to reset the project. 10/17/2006 9:12:38 AM||request_reschedule_cpus: process exited 10/17/2006 9:12:38 AM|rosetta@home|Restarting result DOC_1IAI_pose_u_pert_with_bbmin_1282_868_0 using rosetta version 532 10/17/2006 9:24:08 AM||request_reschedule_cpus: project op 10/17/2006 9:24:12 AM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi 10/17/2006 9:24:12 AM|rosetta@home|Reason: Requested by user 10/17/2006 9:24:12 AM|rosetta@home|Reporting 1 results 10/17/2006 9:24:17 AM|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded 10/17/2006 10:09:00 AM|rosetta@home|Result DOC_1IAI_pose_u_pert_with_bbmin_1282_868_0 exited with zero status but no 'finished' file 10/17/2006 10:09:00 AM|rosetta@home|If this happens repeatedly you may need to reset the project. 10/17/2006 10:09:00 AM||request_reschedule_cpus: process exited 10/17/2006 10:09:00 AM|rosetta@home|Restarting result DOC_1IAI_pose_u_pert_with_bbmin_1282_868_0 using rosetta version 532 Thanks for your help. |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
This question was posed in the dev mail list. Here was Dr. Andersons reply: David Anderson to Andre, boinc_projects More options Oct 2 The Manager isn't involved here. This error means that the core client stopped running (i.e. it crashed, or it's stopped in a debugger) -- David Andre Kerstens wrote: > Hi all, > > Some of our crunchers at D@H have the following message in their stderr.txt: > > No heartbeat from core client for 31 sec - exiting > > Is this a problem where the boincmgr cannot contact the boinc core > client or could it be a problem with our app? I suspect it is the first > one, but like to find out if that is the case. Then Rom Walton added: Rom Walton to Nicolas, Andre, boinc_projects More options Oct 2 I have also seen this happen when the science applications have a memory corruption issue. An example would be when an application defines a static character array of x number of characters but uses a function like sprintf and formats a string of x+25 in size. Since the runtime library uses a bunch of static variables the linker lumps them all together when generating the binary and so they are sensitive to buffer overruns. ----- Rom |
WendyR Send message Joined: 7 Dec 05 Posts: 10 Credit: 215,574 RAC: 0 |
I have seen those "exited with zero status but no 'finished' file" errors quite a few times too. I found there were a couple of events that "triggered" them in my case. I found that I got one of those each time I closed the cover on my laptop to move it somewhere else. That triggers a "go to sleep" mode in my laptop, and exactly which order that things happen during the "sleep" and the corresponding "wakeup" process are probably causing this. I also get that message when I try to do something with the "client_state.xml" file at the same time that the BOINC manager wants it. In my case, I was opening "BOINC Debt Viewer" when BOINC decided to switch between tasks. I know that other people are doing stuff with grep to monitor things in that file. Are you running something else that is attempting to look at that file? Is your virus checking software examining that file a lot because it is getting changed regularly? Do you run some automated backup or indexing software that is hitting that file regularly? Just some things to think about... |
Michael G.R. Send message Joined: 11 Nov 05 Posts: 264 Credit: 11,247,510 RAC: 0 |
Running BOINC Mgr v5.2.13, Rosetta v5.32 I have no idea if it has anything to do with your problem, but I would recommend upgrading to the lasttest version of BOINC: 5.4.11 |
AMDave Send message Joined: 16 Dec 05 Posts: 35 Credit: 12,576,896 RAC: 0 |
To Wendy R: No to all Qs. Short of installing a newer mgr, is there anything else that I can do/look for to correct this? I've had good luck with this version of the mgr and would consider installing another as a last resort. Can anyone offer any links, perhaps, where I could further research this matter. To mmciastro: Could you suggest any particular links/avenues for me to persue? |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
My suggestion is to do nothing. Yes, that's right, nothing. That error message in and of itself is not uncommon. I see you're last eleven WU have not errored out. The only two that have errored out with just that message are both FRA 2rio wus and there is evidence from others that there seems to be an issue with that WU on some systems (they haven't nailed it down that I know). The only other errors were due to you "aborting" the work. So, I'd do nothing and just keep an eye on your upcoming work. tony |
AMDave Send message Joined: 16 Dec 05 Posts: 35 Credit: 12,576,896 RAC: 0 |
I aborted those WUs b/c they were assigned to the previous version of Rosetta (5.25 I think) and I wanted to see if the new version would fix the issue. I understand the 'if it ain't broke, don't fix' basis for your suggestion, however, I'm inclined to believe that it is broke, b/c it's not functioning properly. This issue is wasting electricity and, unfortunately, I don't have the financial means to ignore this any further. Let me add that, I've dedicated 100% of my DC resources to Rosetta and would like to give the project the most bang for its buck. Also, when something unexpected happens with regard to either the hardware or software comprising my system, I get very curious and go into research mode. |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
OK, If you want to research, keep an eye on your tasks until one is completed and another starts. While it's running, start and stop boinc a set number of times, then let it finish without letting it go into screensaver or changing any other conditions. After it's uploaded and reported, check the Result ID of that wu and see if you don't get that error once for each stop/start of boinc. Play around with the screensaver settings on the next wu, forcing the screensaver to stop/start a recordable number of times and see if you get them. You might find a correlation between them. This will only address that message. If you look at all the recent threads, you'll notice that many of the reported errors pertain to users attempting to run FRA 2rio. If the FRA 2rio wus have problems on certain computers, then It falls to the project scientists to look at this issue and figure out why. |
AMDave Send message Joined: 16 Dec 05 Posts: 35 Credit: 12,576,896 RAC: 0 |
I'll try the stop/start method. Regarding the screensaver, I've never used it. Instead, I have one initiated by the OS. The only times my system is off is either during an electrical storm, or when I need to work on another PC. How long have the FRA 2rio WUs been circulating? Remember, this problem has been affecting my system for @ month now. This leads me to wonder if the WU 'family' is a moot point. |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
The earliest FRA 2rio on my record was 14 Oct. The no heartbeat message has been around since early in the history of boinc. |
FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
To Wendy R: honestly there is nothing wrong with th 5.4.11 version and fixes bug in the earlier versions. If you are worried, don't know why as it in wide use now) but Stop boinc, copy the folder, then install the updated version. If all goes wrong you can just delete and replace :-) Though in fact you can just install the update if it crashes and burns then just uninstall and reinstall the older version. Nothing lost. It's a much quicker test than sitting and watching. Team mauisun.org |
AMDave Send message Joined: 16 Dec 05 Posts: 35 Credit: 12,576,896 RAC: 0 |
It's not that I'm apprehensive about upgrading, it's just that I'm curious. I installed it last Dec, it works fine for @ six months, develops an issue, the issue goes away (if only I could remember what I did to make it go away), it works fine for a couple more months, then the same issue appears. I performed some alterations on my system this summer, but this issue resurfaced some time after the fact. The end result most likely will be that I'll perform the upgrade. Until then, I'd like to try some investigating first. |
FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
It's not that I'm apprehensive about upgrading, it's just that I'm curious. I installed it last Dec, it works fine for @ six months, develops an issue, the issue goes away (if only I could remember what I did to make it go away), it works fine for a couple more months, then the same issue appears. I'm not saying it will work either, but I havn't seen that sort of error since the 5.2.x series. Don't know if it was the boinc client or not though ;-) Could it be a virus scanner, firewall (know issues with some newer firewalls hence the release of 5.4.11) , defrag programs, anti virus etc kicking in and causing it to get pushed out of the way. Maybe your computer is trying to suspend or hibernate at that time or your hardrive for some reason is trying to stop spinning (though an OS drive should not stop spinning). It is not just a particualr type of task that is doing it, all of your tasks seem to be doing it. I would recommend a boinc install as the test, could just be something setup worng. Also reset your preferences to default to start with. Team mauisun.org |
Message boards :
Number crunching :
Restarting Results ad infinitum II
©2024 University of Washington
https://www.bakerlab.org