Message boards : Number crunching : Too many restarts with no progress
Author | Message |
---|---|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2220 Credit: 42,306,715 RAC: 24,311 ![]() |
A new member of my team has just started crunching, but I'm not sure if they're having problems or not. I'm guessing they are even though all tasks are reporting success. This task barely ran 30 minutes instead of 3 hours. Lots of messages came back as if it was struggling to run successfully and had to restart repeatedly, culminating in the message "Too many restarts with no progress. Keep application in memory while preempted" before closing down cleanly. I have lots of RAM and one of my jobs looks like this. I've asked him to ensure he has "Leave applications in memory while suspended" ticked - many fewer messages in the task. He has a dual core with 1Gb RAM, which may be tight on RAM. He's adjusted RAM to use 90% (from 60%) while the computer is in use and more recent jobs have run longer and with fewer messages. Does this seem to be a memory-related issue, as I suspect, or could it indicate some other problem? 1Gb RAM (on an XP machine) ought to be plenty really. Any further suggestions I could look at? All advice appreciated. ![]() ![]() |
![]() ![]() Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Any further suggestions I could look at? Morning Sid - from time to time I too have seen restarts on tasks with no other associated error messages - just something like "restarting from checkpoint ..." My first thought is that this is NOT related to the amount of available memory - but I will admit that not having the source to BOINC or the application I have not seen the logic behind triggering a restart. You make no mention of any system related error messages - such as a segfault or a processor exception so I would not rush to jump on a hardware issue. As far as available memory resources go, I would think that if real memory was not available you would either page fault and swap or the user would see the task go into the "waiting for memory" state - I saw that on my systems a few times when I first started using hex core processors without upgrading memory first. One gig minus the OS overhead may not leave much for the Rosetta tasks but I used to successfully run a tri-core AMD with two gig on Linux. If you still suspect system resources are the root cause of this issue then why not suggest that he bring up the system monitor and sort of watch things for a while - he should be able to see free memory and swap activity. Another thing to try to isolate the issue to a shortage of memory would be to go to the Computing Preferences page on his account and set the maximum number of processors to use to 1. If it is being caused by a memory shortage that should help - if it dogs out other systems on the account oh well, at least you know what it is and what needs to be done to resolve it. Let us know what you come up with. |
mikey![]() Send message Joined: 5 Jan 06 Posts: 1896 Credit: 10,138,586 RAC: 20,966 ![]() |
Any further suggestions I could look at? Do you think it would help if he changed the 25% level to stop crunching to 0%? I have done that on all of my machines and Boinc no longer stops crunching at all. All of my machines also have at least 2 gig of memory in them. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
One potential cause of the message that we sometimes forget is people rebooting their machine. If you reboot before a checkpoint is reached, that is "no progress" on the next restart. Do this several times and the application figures something isn't going well for this combination of task and host, so that task is sent home (I think it takes 5 times restarting with no progress). Another thought, have you reviewed the BOINC settings for disk page file ("swap") space? If this were set very low, perhaps odd problems would arise. Rosetta Moderator: Mod.Sense |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2220 Credit: 42,306,715 RAC: 24,311 ![]() |
Morning Sid - from time to time I too have seen restarts on tasks with no other associated error messages - just something like "restarting from checkpoint..." It doesn't say that. It's just the repetition of previous messages. You make no mention of any system related error messages - such as a segfault or a processor exception so I would not rush to jump on a hardware issue. Fair comment. I'll ask if there are any other clues from the messages tab. At the moment I'm only going by the reported task details. As far as available memory resources go, I would think that if real memory was not available you would either page fault and swap or the user would see the task go into the "waiting for memory" state - I saw that on my systems a few times when I first started using hex core processors without upgrading memory first. Understood, but I'm not sure if Boinc uses the swapfiles too well. Still guessing here. If you still suspect system resources are the root cause of this issue then why not suggest that he bring up the system monitor and sort of watch things for a while - he should be able to see free memory and swap activity. I'm in recruitment mode and I'm reluctant to indicate BoincRosetta needs this level of babysitting. I think I'd scare off more people than I recruit. Another thing to try to isolate the issue to a shortage of memory would be to go to the Computing Preferences page on his account and set the maximum number of processors to use to 1. If it is being caused by a memory shortage that should help - if it dogs out other systems on the account oh well, at least you know what it is and what needs to be done to resolve it. Nice idea. I'll keep that one up my sleeve for the moment if things don't settle down. ![]() ![]() |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2220 Credit: 42,306,715 RAC: 24,311 ![]() |
Do you think it would help if he changed the 25% level to stop crunching to 0%? I have done that on all of my machines and Boinc no longer stops crunching at all. All of my machines also have at least 2 gig of memory in them. Very possible. This may explain better why it only seems to get so far then go back to the start unexpectedly. Good suggestions from everyone. I've pointed out this thread to the user - hopefully one or all of the suggestions makes the difference. Thanks. ![]() ![]() |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2220 Credit: 42,306,715 RAC: 24,311 ![]() |
I've asked him to ensure he has "Leave applications in memory while suspended" ticked... It wasn't ticked. One potential cause of the message that we sometimes forget is people rebooting their machine. If you reboot before a checkpoint is reached, that is "no progress" on the next restart. Do this several times and the application figures something isn't going well for this combination of task and host, so that task is sent home (I think it takes 5 times restarting with no progress). This may be an ongoing issue too. No task has outright failed yet, but being aware of possibilities always helps. Task details look much tidier now so I'm happy enough to close this issue now. Thanks all. ![]() ![]() |
Message boards :
Number crunching :
Too many restarts with no progress
©2025 University of Washington
https://www.bakerlab.org