Message boards : Number crunching : More checkpointing problems
Previous · 1 · 2 · 3
Author | Message |
---|---|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 8,784 |
Application Rosetta 4.07 Seeing as this task hasn't finished yet it may be worthwhile tracking how it's getting on with just an excerpt of its attributes Application Rosetta 4.07 So, 78 mins have passed, just 16 mins of CPU time, no further checkpoint, estimated time remaining actually increased by 1 minute. No other PF tasks (or RB) are doing this. 2 later PF tasks completed normally around the 8hr mark as expected. No idea what's going on. |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Followup data: The task with 8 hours uncheckpointed actually did checkpoint sometime before 10 hours and it finally finished around12 hours. Right now I'm actually on a Linux box, one of my machines that rarely runs for a long period. It has a small supply of non PF... units and none of them appear to be sick puppies. I'm trying to avoid downloading any of the PF... units here, but worse than that, the project has apparently switched to the short-term rb... units. I see that one of them did the fancy finish with the Computation Error. If it crashed quickly (and I suspect it did), then there is little waste of my machine's computation time, but the Rosetta project is just wasting bandwidth for any data that was sent. It should NOT be a battle to participate "effectively" in the project. If the project is having trouble retaining volunteers, then perhaps there is a connection? #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 8,784 |
Application Rosetta 4.07 All a bit weird - still running... CPU time 09:44:54 Another 250mins have passed, only 50mins of CPU time further on, still no checkpoint, remaining time 3 minutes more <shrug> |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 8,784 |
Ok, so it died not long after with a compute error. Final figures and std err report at the end. Application Rosetta 4.07 CPU time 09:50:24 Stderr report (edited for brevity) <core_client_version>7.12.1</core_client_version> |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,054,272 RAC: 6,536 |
Sid, did you see the "Disk usage limit exceeded" error message in the STDERR? If BOINC exceeded your disk allocated, disk writes would fail. <core_client_version>7.12.1</core_client_version> <![CDATA[ <message> Disk usage limit exceeded</message> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <stderr_txt> range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range |
Admin Project administrator Send message Joined: 1 Jul 05 Posts: 4805 Credit: 0 RAC: 0 |
I talked to Ivan, the owner of these jobs. He said there may be a few very large targets in his benchmark that take a while to generate models. He said he doesn't have plans for any more such targets. Sorry for any inconvenience. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 8,784 |
Sid, did you see the "Disk usage limit exceeded" error message in the STDERR? No, I didn't notice it. Thanks for pointing it out. I have to say I was blinded by the extreme length of the report and glossed over that part. To be fair, this STDERR report is only revealed after the task reported so I didn't have any evidence of it earlier. That said, I allocate 10Gb of disk space to Rosetta and the ~40 tasks I hold in my buffer consumes just short of 5Gb, with just over 5Gb spare. There was no sign of this getting called up while the job was running. I will add a couple of Gb more now though as I have plenty to spare. While the disk line is obviously caused by 'something' I can't help looking at the 500 separate ERROR lines saying values are out of range. In my ignorance it does seem kind of relevant as to why this task has gone rogue the way it has. The job did run over 20 hours before crashing. Am I right to be more concerned by those 20hrs than the eventual crash it resulted in? I'll leave that to the experts, none of whom are me. I should emphasise, while I have plenty of issues with PF* tasks - reported over the last 8 months in the pinned thread - this particular one is a one-off. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 8,784 |
I talked to Ivan, the owner of these jobs. He said there may be a few very large targets in his benchmark that take a while to generate models. He said he doesn't have plans for any more such targets. Sorry for any inconvenience. One thing I haven't mentioned is that a lot of these PF tasks get to 567 hours still on the 1st model with like 580,000 steps. This particular one was on the 6th model, not just the 1st, if that makes a difference. This applies to pretty much all PF tasks I've looked at. Maybe this is why PF tasks generally lend themselves to problems, though I'm obviously guessing here. I'd appreciate it if someone took a look at the errors reported in the Rosetta 4.0x thread as well. Those show a much more common issue in my experience, resulting in Computing Errors. |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
I wonder if that's in reference to the PF problems? Still running about 25% sick puppies when I don't get them nuked before they start. Same policy towards rb units. Current puppy has over an hour with no checkpoint, and I want to reboot the machine, so I've already queued some "safe" tasks and will nuke that one before shutting down (unless it managed to checkpoint itself while I'm writing this message). During the recent task shortage I actually switched to a different project. I noticed that most of their tasks are on the order of 2 to 4 hours now. If the goal of longer work units is to save bandwidth, it certainly doesn't seem to be working in my case with all the nuking of likely sick puppies and other problematic work units that's going on. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 8,784 |
I wonder if that's in reference to the PF problems? Still running about 25% sick puppies when I don't get them nuked before they start. Same policy towards rb units. Current puppy has over an hour with no checkpoint, and I want to reboot the machine, so I've already queued some "safe" tasks and will nuke that one before shutting down (unless it managed to checkpoint itself while I'm writing this message). It was about the past PF problems. I've checked all my machines and I have no errors at all related to the current batch of PF jobs even though I definitely had the same issues as you last time. I do have some errors, but I think they're more related to my overclock - so, all about me, not the tasks. All my current running PF tasks on this machine have checkpointed within the last 11mins (1) 4mins (1) and under 2mins (6) |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Thanks for the data and sorry I haven't been checking in more frequently. Well, not really sorry, since that mostly means there are no problems that seem worth worrying about. Or back to the sorry side again, maybe not visiting just reflects a loss of hope of making things better... Latest peculiarities: (1) Tasks that terminate themselves en masse when the computer wakes up. Presumably there is another (possibly new) completion criterion related to wall clock time, and when the computer wakes up many of the tasks discover that they are now regarded as completed. Not bad as a sanity check of some sort. (2) Sick puppies from new projects, but nothing prevalent and annoying as the previous ones. Still seeing about 20% of the rb tasks behaving badly, but mostly ignoring that problem except for the 3-day tasks (which still get nuked whenever I spot them in time) and for the one machine with the limited run time. Today's visit was actually provoked by another out-of-tasks condition, so off to look for relevant posts... #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 8,784 |
Today's visit was actually provoked by another out-of-tasks condition, so off to look for relevant posts... Yup, try the top pinned thread. No tasks of any type currently available, 5 days before Christmas.... |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Not sure where you were referencing, but if you mean the top thread in the "Number crunching" forum, then it's rarely useful. Currently it's 10 days old. This one is mostly for checkpointing problems, which seem less severe than before. They have spread to some of the new subprojects, however. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 8,784 |
Not sure where you were referencing, but if you mean the top thread in the "Number crunching" forum, then it's rarely useful. Currently it's 10 days old. and the message you replied to was 13 days old... |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
More sick puppies to report. Names start with "Cx_" where I have noticed x values from 3 to 5. Especially annoying in that the tasks claim to be checkpointing properly, but are lying about it. If you look at the Properties, it will say there was a recent checkpoint, perhaps a minute ago, but if you then reboot the computer, it typically loses 20% of its progress, representing about two hours of work. The elapsed time is conserved. In today's example, the task had over 7 hours in the Elapsed column and Remaining was under an hour, but after rebooting the computer, Elapsed was still over 7, but Progress had fallen to 60% and Remaining was over 3 hours. Usually I spot these things on a computer than only runs for a few hours at a time. However this time I actually noticed it during the major OS upgrades last month. Just confirmed it on the short-running computer. On your [the project management's] side it should probably show as a series of peaks in completion times. At least on the evidence I've noticed, the 2-hour loss seems to be consistent, so there would be one peak around 8 hours for uninterrupted tasks, a second around 10 hours for once-interrupted tasks, and smaller and smaller peaks each two hours after that for more and more interruptions. The rb sick puppies remain around 20% of all rb tasks. In their defense, at least they tell the truth about never completing a checkpoint. They seemed to be getting worse lately, often running from zero without a single checkpoint, so I'm back to scrubbing them from the short-running machine before they get a chance to start. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Message boards :
Number crunching :
More checkpointing problems
©2024 University of Washington
https://www.bakerlab.org