More checkpointing problems

Message boards : Number crunching : More checkpointing problems

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,228,659
RAC: 8,784
Message 89508 - Posted: 9 Sep 2018, 22:38:14 UTC - in response to Message 89506.  

Application Rosetta 4.07
Name PF13731.5_jmps_aivan_SAVE_ALL_OUT_03_09_686650_5044
State Running
CPU time 08:38:41
CPU time since checkpoint 01:15:19
Elapsed time 14:30:57
Estimated time remaining 00:16:47
Fraction done 98.108%

PF13731.5_jmps_aivan_SAVE_ALL_OUT_03_09_686650_5044_0

Seeing as this task hasn't finished yet it may be worthwhile tracking how it's getting on with just an excerpt of its attributes

Application Rosetta 4.07
Name PF13731.5_jmps_aivan_SAVE_ALL_OUT_03_09_686650_5044
State Running
CPU time 08:54:25
CPU time since checkpoint 01:31:03
Elapsed time 15:48:38
Estimated time remaining 00:17:45
Fraction done 98.163%

So, 78 mins have passed, just 16 mins of CPU time, no further checkpoint, estimated time remaining actually increased by 1 minute.
No other PF tasks (or RB) are doing this. 2 later PF tasks completed normally around the 8hr mark as expected. No idea what's going on.
ID: 89508 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 89510 - Posted: 10 Sep 2018, 2:42:29 UTC - in response to Message 89507.  

Followup data: The task with 8 hours uncheckpointed actually did checkpoint sometime before 10 hours and it finally finished around12 hours.

Right now I'm actually on a Linux box, one of my machines that rarely runs for a long period. It has a small supply of non PF... units and none of them appear to be sick puppies. I'm trying to avoid downloading any of the PF... units here, but worse than that, the project has apparently switched to the short-term rb... units. I see that one of them did the fancy finish with the Computation Error. If it crashed quickly (and I suspect it did), then there is little waste of my machine's computation time, but the Rosetta project is just wasting bandwidth for any data that was sent.

It should NOT be a battle to participate "effectively" in the project. If the project is having trouble retaining volunteers, then perhaps there is a connection?
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 89510 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,228,659
RAC: 8,784
Message 89511 - Posted: 10 Sep 2018, 2:48:57 UTC - in response to Message 89508.  

Application Rosetta 4.07
Name PF13731.5_jmps_aivan_SAVE_ALL_OUT_03_09_686650_5044
State Running
CPU time 08:38:41
CPU time since checkpoint 01:15:19
Elapsed time 14:30:57
Estimated time remaining 00:16:47
Fraction done 98.108%

PF13731.5_jmps_aivan_SAVE_ALL_OUT_03_09_686650_5044_0

Seeing as this task hasn't finished yet it may be worthwhile tracking how it's getting on with just an excerpt of its attributes

Application Rosetta 4.07
Name PF13731.5_jmps_aivan_SAVE_ALL_OUT_03_09_686650_5044
State Running
CPU time 08:54:25
CPU time since checkpoint 01:31:03
Elapsed time 15:48:38
Estimated time remaining 00:17:45
Fraction done 98.163%

So, 78 mins have passed, just 16 mins of CPU time, no further checkpoint, estimated time remaining actually increased by 1 minute.
No other PF tasks (or RB) are doing this. 2 later PF tasks completed normally around the 8hr mark as expected. No idea what's going on.

All a bit weird - still running...
CPU time 09:44:54
CPU time since checkpoint 02:21:32
Elapsed time 20:02:51
Estimated time remaining 00:20:33
Fraction done 98.319%

Another 250mins have passed, only 50mins of CPU time further on, still no checkpoint, remaining time 3 minutes more <shrug>
ID: 89511 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,228,659
RAC: 8,784
Message 89516 - Posted: 10 Sep 2018, 12:48:09 UTC - in response to Message 89511.  

Ok, so it died not long after with a compute error. Final figures and std err report at the end.

Application Rosetta 4.07
Name PF13731.5_jmps_aivan_SAVE_ALL_OUT_03_09_686650_5044
State Running
CPU time 08:38:41
CPU time since checkpoint 01:15:19
Elapsed time 14:30:57
Estimated time remaining 00:16:47
Fraction done 98.108%

PF13731.5_jmps_aivan_SAVE_ALL_OUT_03_09_686650_5044_0

Seeing as this task hasn't finished yet it may be worthwhile tracking how it's getting on with just an excerpt of its attributes

Application Rosetta 4.07
Name PF13731.5_jmps_aivan_SAVE_ALL_OUT_03_09_686650_5044
State Running
CPU time 08:54:25
CPU time since checkpoint 01:31:03
Elapsed time 15:48:38
Estimated time remaining 00:17:45
Fraction done 98.163%

So, 78 mins have passed, just 16 mins of CPU time, no further checkpoint, estimated time remaining actually increased by 1 minute.
No other PF tasks (or RB) are doing this. 2 later PF tasks completed normally around the 8hr mark as expected. No idea what's going on.

All a bit weird - still running...
CPU time 09:44:54
CPU time since checkpoint 02:21:32
Elapsed time 20:02:51
Estimated time remaining 00:20:33
Fraction done 98.319%

Another 250mins have passed, only 50mins of CPU time further on, still no checkpoint, remaining time 3 minutes more <shrug>


CPU time 09:50:24
Elapsed time 20:29:39


Stderr report (edited for brevity)
<core_client_version>7.12.1</core_client_version>
<![CDATA[
<message>
Disk usage limit exceeded</message>
<stderr_txt>
range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
[Deleted 460 repeated lines]
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range

Unhandled Exception Detected...

sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x000007FEFCB531F2

Engaging BOINC Windows Runtime Debugger...

sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
[Deleted 30 more repeated lines]
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range


********************

BOINC Windows Runtime Debugger Version 7.9.0

Dump Timestamp : 09/10/18 04:11:44
Install Directory : C:Program FilesBOINC
Data Directory : C:ProgramDataBOINC
Project Symstore : https://boinc.bakerlab.org/rosetta/symstore
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
LoadLibraryA( C:ProgramDataBOINCdbghelp.dll ): GetLastError = 126
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
Loaded Library : dbghelp.dll
LoadLibraryA( C:ProgramDataBOINCsymsrv.dll ): GetLastError = 126
LoadLibraryA( symsrv.dll ): GetLastError = 126
LoadLibraryA( C:ProgramDataBOINCsrcsrv.dll ): GetLastError = 126
LoadLibraryA( srcsrv.dll ): GetLastError = 126
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
LoadLibraryA( C:ProgramDataBOINCversion.dll ): GetLastError = 126
Loaded Library : version.dll
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal rangeDebugger Engine : 4.0.5.0
Symbol Search Path: C:ProgramDataBOINCslots6;C:ProgramDataBOINCprojectsboinc.bakerlab.org_rosetta;srv*C:ProgramDataBOINCprojectsboinc.bakerlab.org_rosettasymbols*http://msdl.microsoft.com/download/symbols;srv*C:ProgramDataBOINCprojectsboinc.bakerlab.org_rosettasymbols*https://boinc.bakerlab.org/rosetta/symstore

[Deleted section]

*** Dump of the Process Statistics: ***

- I/O Operations Counters -
Read: 50841, Write: 480090826, Other 34490324

- I/O Transfers Counters -
Read: 362491692, Write: 1490402847, Other -408524486

- Paged Pool Usage -
QuotaPagedPoolUsage: 283472, QuotaPeakPagedPoolUsage: 283480
QuotaNonPagedPoolUsage: 15000, QuotaPeakNonPagedPoolUsage: 15720

- Virtual Memory Usage -
VirtualSize: 437776384, PeakVirtualSize: 1152479232

- Pagefile Usage -
PagefileUsage: 437776384, PeakPagefileUsage: 585773056

- Working Set Size -
WorkingSetSize: 439787520, PeakWorkingSetSize: 604778496, PageFaultCount: 1885542

*** Dump of thread ID 1196 (state: Initialized): ***

- Information -
Status: Base Priority: Normal, Priority: Normal, , Kernel Time: 0.000000, User Time: 0.000000, Wait Time: 0.000000

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x000007FEFCB531F2

- Registers -
rax=0000000000000000 rbx=0000000000000001 rcx=00000000432b82e0 rdx=0000000019b5f3e0 rsi=0000000000000000 rdi=0000000000000000
r8=0000000019b5f3e0 r9=00000000432b82d0 r10=0000000000000001 r11=0000000000000fff r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000 rip=00000000fcb531f2 rsp=0000000019b5f3b8 rbp=0000000000000000
cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246

- Callstack -
ChildEBP RetAddr Args to Child

19b5f3b0 417358ee 00000001 19b5f3e0 19b5f3e0 432b82d0 KERNELBASE!DebugBreak+0x0
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
19b5f7f0 417368e0 00000000 00000000 00000000 00000000 rosetta_4.07_windows_x86_64!cppdb::backend::statements_cache::statements_cache+0x0
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
19b5fa50 76df59cd 00000000 00000000 00000000 00000000 rosetta_4.07_windows_x86_64!cppdb::backend::statements_cache::statements_cache+0x0
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
19b5fa80 76f5383d sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
00000000 00000000 00000000 00000000 kernel32!BaseThreadInitThunk+0x0
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
19b5fad0 sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
00000000 00000000 00000000 00000000 00000000 ntdll!RtlUserThreadStart+0x0

*** Dump of thread ID 30689287 (state: Initialized): ***

- Information -
Status: Base Priority: Normal, Priority: Unknown, , Kernel Time: 17179869184.000000, User Time: 21475590144.000000, Wait Time: 0.000000

- Registers -
rax=0000000000000000 rbx=0000000000000000 rcx=0000000000000000 rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000000
r8=0000000000000000 r9=0000000000000000 r10=0000000000000000 r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000 rip=0000000000000000 rsp=0000000000000000 rbp=0000000000000000
cs=0000 ss=0000 ds=0000 es=0000 fs=0000 gs=0000 efl=00000000

- Callstack -
ChildEBP RetAddr Args to Child
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
(-nosymbols- PC == 0)
00000000 00000000 00000000 00000000 00000000 00000000 !+0x0
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range


*** Debug Message Dump ****


*** Foreground Window Data ***
Window Name :
Window Class :
Window Process ID: 0
Window Thread ID : 0

Exiting...

</stderr_txt>
]]>

ID: 89516 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,054,272
RAC: 6,536
Message 89522 - Posted: 10 Sep 2018, 15:19:14 UTC - in response to Message 89516.  

Sid, did you see the "Disk usage limit exceeded" error message in the STDERR?

If BOINC exceeded your disk allocated, disk writes would fail.


<core_client_version>7.12.1</core_client_version>
<![CDATA[
<message>
Disk usage limit exceeded</message> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
<stderr_txt>
range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
ID: 89522 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin
Project administrator

Send message
Joined: 1 Jul 05
Posts: 4805
Credit: 0
RAC: 0
Message 89523 - Posted: 10 Sep 2018, 17:50:38 UTC

I talked to Ivan, the owner of these jobs. He said there may be a few very large targets in his benchmark that take a while to generate models. He said he doesn't have plans for any more such targets. Sorry for any inconvenience.
ID: 89523 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,228,659
RAC: 8,784
Message 89524 - Posted: 10 Sep 2018, 21:04:07 UTC - in response to Message 89522.  

Sid, did you see the "Disk usage limit exceeded" error message in the STDERR?

If BOINC exceeded your disk allocated, disk writes would fail.


<core_client_version>7.12.1</core_client_version>
<![CDATA[
<message>
Disk usage limit exceeded</message> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
<stderr_txt>
range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range

No, I didn't notice it. Thanks for pointing it out. I have to say I was blinded by the extreme length of the report and glossed over that part.

To be fair, this STDERR report is only revealed after the task reported so I didn't have any evidence of it earlier.

That said, I allocate 10Gb of disk space to Rosetta and the ~40 tasks I hold in my buffer consumes just short of 5Gb, with just over 5Gb spare. There was no sign of this getting called up while the job was running. I will add a couple of Gb more now though as I have plenty to spare.

While the disk line is obviously caused by 'something' I can't help looking at the 500 separate ERROR lines saying values are out of range. In my ignorance it does seem kind of relevant as to why this task has gone rogue the way it has. The job did run over 20 hours before crashing. Am I right to be more concerned by those 20hrs than the eventual crash it resulted in? I'll leave that to the experts, none of whom are me.

I should emphasise, while I have plenty of issues with PF* tasks - reported over the last 8 months in the pinned thread - this particular one is a one-off.
ID: 89524 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,228,659
RAC: 8,784
Message 89525 - Posted: 10 Sep 2018, 21:15:20 UTC - in response to Message 89523.  

I talked to Ivan, the owner of these jobs. He said there may be a few very large targets in his benchmark that take a while to generate models. He said he doesn't have plans for any more such targets. Sorry for any inconvenience.

One thing I haven't mentioned is that a lot of these PF tasks get to 567 hours still on the 1st model with like 580,000 steps. This particular one was on the 6th model, not just the 1st, if that makes a difference.

This applies to pretty much all PF tasks I've looked at. Maybe this is why PF tasks generally lend themselves to problems, though I'm obviously guessing here.

I'd appreciate it if someone took a look at the errors reported in the Rosetta 4.0x thread as well. Those show a much more common issue in my experience, resulting in Computing Errors.
ID: 89525 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 89797 - Posted: 29 Oct 2018, 6:49:05 UTC - in response to Message 89525.  

I wonder if that's in reference to the PF problems? Still running about 25% sick puppies when I don't get them nuked before they start. Same policy towards rb units. Current puppy has over an hour with no checkpoint, and I want to reboot the machine, so I've already queued some "safe" tasks and will nuke that one before shutting down (unless it managed to checkpoint itself while I'm writing this message).

During the recent task shortage I actually switched to a different project. I noticed that most of their tasks are on the order of 2 to 4 hours now. If the goal of longer work units is to save bandwidth, it certainly doesn't seem to be working in my case with all the nuking of likely sick puppies and other problematic work units that's going on.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 89797 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,228,659
RAC: 8,784
Message 89799 - Posted: 29 Oct 2018, 12:26:05 UTC - in response to Message 89797.  

I wonder if that's in reference to the PF problems? Still running about 25% sick puppies when I don't get them nuked before they start. Same policy towards rb units. Current puppy has over an hour with no checkpoint, and I want to reboot the machine, so I've already queued some "safe" tasks and will nuke that one before shutting down (unless it managed to checkpoint itself while I'm writing this message).

During the recent task shortage I actually switched to a different project. I noticed that most of their tasks are on the order of 2 to 4 hours now. If the goal of longer work units is to save bandwidth, it certainly doesn't seem to be working in my case with all the nuking of likely sick puppies and other problematic work units that's going on.

It was about the past PF problems.
I've checked all my machines and I have no errors at all related to the current batch of PF jobs even though I definitely had the same issues as you last time.
I do have some errors, but I think they're more related to my overclock - so, all about me, not the tasks.
All my current running PF tasks on this machine have checkpointed within the last 11mins (1) 4mins (1) and under 2mins (6)
ID: 89799 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 90036 - Posted: 19 Dec 2018, 23:11:09 UTC - in response to Message 89799.  

Thanks for the data and sorry I haven't been checking in more frequently. Well, not really sorry, since that mostly means there are no problems that seem worth worrying about. Or back to the sorry side again, maybe not visiting just reflects a loss of hope of making things better...

Latest peculiarities:

(1) Tasks that terminate themselves en masse when the computer wakes up. Presumably there is another (possibly new) completion criterion related to wall clock time, and when the computer wakes up many of the tasks discover that they are now regarded as completed. Not bad as a sanity check of some sort.

(2) Sick puppies from new projects, but nothing prevalent and annoying as the previous ones. Still seeing about 20% of the rb tasks behaving badly, but mostly ignoring that problem except for the 3-day tasks (which still get nuked whenever I spot them in time) and for the one machine with the limited run time.

Today's visit was actually provoked by another out-of-tasks condition, so off to look for relevant posts...
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 90036 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,228,659
RAC: 8,784
Message 90040 - Posted: 20 Dec 2018, 10:29:12 UTC - in response to Message 90036.  

Today's visit was actually provoked by another out-of-tasks condition, so off to look for relevant posts...

Yup, try the top pinned thread. No tasks of any type currently available, 5 days before Christmas....
ID: 90040 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 90136 - Posted: 3 Jan 2019, 20:51:37 UTC - in response to Message 90040.  

Not sure where you were referencing, but if you mean the top thread in the "Number crunching" forum, then it's rarely useful. Currently it's 10 days old.

This one is mostly for checkpointing problems, which seem less severe than before. They have spread to some of the new subprojects, however.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 90136 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,228,659
RAC: 8,784
Message 90156 - Posted: 7 Jan 2019, 4:00:28 UTC - in response to Message 90136.  

Not sure where you were referencing, but if you mean the top thread in the "Number crunching" forum, then it's rarely useful. Currently it's 10 days old.

and the message you replied to was 13 days old...
ID: 90156 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 90888 - Posted: 4 Jul 2019, 4:14:31 UTC

More sick puppies to report. Names start with "Cx_" where I have noticed x values from 3 to 5. Especially annoying in that the tasks claim to be checkpointing properly, but are lying about it. If you look at the Properties, it will say there was a recent checkpoint, perhaps a minute ago, but if you then reboot the computer, it typically loses 20% of its progress, representing about two hours of work. The elapsed time is conserved. In today's example, the task had over 7 hours in the Elapsed column and Remaining was under an hour, but after rebooting the computer, Elapsed was still over 7, but Progress had fallen to 60% and Remaining was over 3 hours.

Usually I spot these things on a computer than only runs for a few hours at a time. However this time I actually noticed it during the major OS upgrades last month. Just confirmed it on the short-running computer.

On your [the project management's] side it should probably show as a series of peaks in completion times. At least on the evidence I've noticed, the 2-hour loss seems to be consistent, so there would be one peak around 8 hours for uninterrupted tasks, a second around 10 hours for once-interrupted tasks, and smaller and smaller peaks each two hours after that for more and more interruptions.

The rb sick puppies remain around 20% of all rb tasks. In their defense, at least they tell the truth about never completing a checkpoint. They seemed to be getting worse lately, often running from zero without a single checkpoint, so I'm back to scrubbing them from the short-running machine before they get a chance to start.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 90888 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3

Message boards : Number crunching : More checkpointing problems



©2024 University of Washington
https://www.bakerlab.org