WUs stuck at 99.50% and no progress

Message boards : Number crunching : WUs stuck at 99.50% and no progress

To post messages, you must log in.

AuthorMessage
Chris

Send message
Joined: 12 Apr 06
Posts: 6
Credit: 13,598,060
RAC: 0
Message 96362 - Posted: 11 May 2020, 13:20:57 UTC

I've got old DL380 G7, couple of days ago I installed Debian 10 on it and got Boinc running.
It's been going pretty well, but today I noticed 2 tasks stuck, with 600s of work remaining - for several hours now.
Anyone can give me some hint what to do with that? I mean - I probably need to abort these, I'd rather avoid such situation in the future.

Task details:
1) -----------
   name: SR5AGU10_LVPlG_44_42843164_5mers_0001_0001_SAVE_ALL_OUT_927439_310_0
   WU name: SR5AGU10_LVPlG_44_42843164_5mers_0001_0001_SAVE_ALL_OUT_927439_310
   project URL: https://boinc.bakerlab.org/rosetta/
   received: Sat May  9 17:50:35 2020
   report deadline: Tue May 12 17:50:35 2020
   ready to report: no
   state: downloaded
   scheduler state: scheduled
   active_task_state: EXECUTING
   app version num: 420
   resources: 1 CPU
   estimated CPU time remaining: 600.567115
   CPU time at last checkpoint: 0.000000
   current CPU time: 112797.000000
   fraction done: 0.994708
   swap size: 379 MB
   working set size: 305 MB
2) -----------
   name: SR5AGU10_LVPlG_32_8261859_5mers_0001_0001_SAVE_ALL_OUT_927421_311_0
   WU name: SR5AGU10_LVPlG_32_8261859_5mers_0001_0001_SAVE_ALL_OUT_927421_311
   project URL: https://boinc.bakerlab.org/rosetta/
   received: Sat May  9 17:50:35 2020
   report deadline: Tue May 12 17:50:35 2020
   ready to report: no
   state: downloaded
   scheduler state: scheduled
   active_task_state: EXECUTING
   app version num: 420
   resources: 1 CPU
   estimated CPU time remaining: 600.760101
   CPU time at last checkpoint: 0.000000
   current CPU time: 112682.000000
   fraction done: 0.994701
   swap size: 378 MB
   working set size: 304 MB

ID: 96362 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 390
Credit: 12,073,013
RAC: 4,827
Message 96363 - Posted: 11 May 2020, 14:09:21 UTC

I’d leave them running for a while.

When you first start processing on a new machine or for a new project it takes Boinc a time to get settled in and work out how long WUs are likely to take, it could be a symptom of that.
ID: 96363 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
floyd

Send message
Joined: 26 Jun 14
Posts: 23
Credit: 10,268,639
RAC: 0
Message 96364 - Posted: 11 May 2020, 14:10:55 UTC - in response to Message 96362.  

31 hours of runtime and no checkpoint? Check if those tasks cause any CPU load, I dare guess they haven't done any work at all and the 99.5% progress are just fake. You can turn LAIM off, then suspend and resume the tasks to restart them from the beginning. You have enough time left, but look if they work normally this time. Or abort them and leave them to somebody else.
ID: 96364 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chris

Send message
Joined: 12 Apr 06
Posts: 6
Credit: 13,598,060
RAC: 0
Message 96368 - Posted: 11 May 2020, 14:49:14 UTC - in response to Message 96363.  
Last modified: 11 May 2020, 14:49:30 UTC

I’d leave them running for a while.

When you first start processing on a new machine or for a new project it takes Boinc a time to get settled in and work out how long WUs are likely to take, it could be a symptom of that.


The machine is running about 9 days now (uptime) - should be enough for new tasks to settle?
ID: 96368 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 96369 - Posted: 11 May 2020, 14:49:36 UTC
Last modified: 11 May 2020, 14:49:50 UTC

The new watchdog will only kick in if the task has not ended ten hours passed the runtime preference. What is your runtime preference set to? The default runtime is 8hrs.

Ending them will lose all of the work they've done. Running them a second time will most likely bring you back to the same situation. Since maximum runtime preference is 36 hours, it would be possible for them to be normal up through 46 hours of CPU time. If 31 hours is already more than 10 hours passed your runtime preference, then the watchdog did not catch them for some reason, and I would abort them. The provide links if you can so we can see if the "wingman" does any better with them.
Rosetta Moderator: Mod.Sense
ID: 96369 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chris

Send message
Joined: 12 Apr 06
Posts: 6
Credit: 13,598,060
RAC: 0
Message 96370 - Posted: 11 May 2020, 15:14:09 UTC - in response to Message 96364.  

31 hours of runtime and no checkpoint? Check if those tasks cause any CPU load, I dare guess they haven't done any work at all and the 99.5% progress are just fake. You can turn LAIM off, then suspend and resume the tasks to restart them from the beginning. You have enough time left, but look if they work normally this time. Or abort them and leave them to somebody else.


I also saw the checkpoint missing, but I simply don't know what that could mean.
First problematic task has PID 12258 and definitely it is doing something in there
boinc    12258 99.9  2.5 388488 311956 ?       RNl  May10 1981:22 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-gnu


This is alternate view from boinctui shows a bit more info (I think).


ID: 96370 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 100
Credit: 21,059,812
RAC: 0
Message 96372 - Posted: 11 May 2020, 15:48:04 UTC
Last modified: 11 May 2020, 16:02:17 UTC

I also have two of these "SR5AGU10" work units running long on one of my machines.

WU1: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1056154721
WU2: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1056126717

The machine these are running on is set to the default 8hr runtime. (Machine in question: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3752305)

At present both these WU's have been crunching for ~40 hours. Stuck at 99.585% on one, 99.578% on the other.


/edit. After about 15 minutes the % on both has minimally increased by around .003, so they aren't dead in the water.
ID: 96372 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
floyd

Send message
Joined: 26 Jun 14
Posts: 23
Credit: 10,268,639
RAC: 0
Message 96373 - Posted: 11 May 2020, 16:36:22 UTC - in response to Message 96370.  
Last modified: 11 May 2020, 17:32:53 UTC

I also saw the checkpoint missing, but I simply don't know what that could mean.
I also don't know what that means in detail but I would think that in all the time the task hasn't reached the first intermediate point where something is worth saving.

First problematic task has PID 12258 and definitely it is doing something in there
Well I'm surprised now. I've occasionally seen tasks with the clock ticking but nothing being done. Those continued fine after a restart. But I've never seen a task work that long without coming to a result. Someone with more detailed knowledge will have to tell us how we can know if the task is actually making progress and will eventually come to an end.

By the way, I also have three of those running. Only around two hours now and nothing suspicious, except the displayed progress is quite high for that short time.
Addendum: The first task finished after three hours.
ID: 96373 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Energiequant

Send message
Joined: 19 Sep 05
Posts: 1
Credit: 595,842
RAC: 0
Message 96378 - Posted: 11 May 2020, 22:25:27 UTC
Last modified: 11 May 2020, 22:27:31 UTC

I also got one of those, running under Windows 10: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1056151065

Application Rosetta 4.20
Name SR5AGU10_LVPlG_35_50416319_5mers_0001_0001_SAVE_ALL_OUT_927427_226
State Running
Received 09/05/2020 23:34:21
Report deadline 12/05/2020 23:34:25
Estimated computation size 80,000 GFLOPs
CPU time 1d 02:46:01
CPU time since checkpoint 1d 02:46:01
Elapsed time 1d 03:51:49
Estimated time remaining 00:10:24
Fraction done 99.381%
Virtual memory size 248.93 MB
Working set size 50.90 MB
Directory slots/4
Process ID 1128
Progress rate 3.600% per hour
Executable rosetta_4.20_windows_x86_64.exe

So it only has seen one checkpoint roughly one hour after it started. Process 1128 is still running (CPU is at 12.5% so one complete hyperthread) but with very low memory consumption (20.4MB) according to the task manager. I did not set up "Target CPU run time" so I assume the watchdog should have aborted the WU 10 hours ago?

Looks like a generic issue with that SR5AGU10 job?
ID: 96378 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chris

Send message
Joined: 12 Apr 06
Posts: 6
Credit: 13,598,060
RAC: 0
Message 96385 - Posted: 12 May 2020, 7:57:23 UTC

After another (nearly) 24 hours the tasks are mostly unchanged, so I'm going to abort them.

1) -----------
   name: SR5AGU10_LVPlG_44_42843164_5mers_0001_0001_SAVE_ALL_OUT_927439_310_0
   WU name: SR5AGU10_LVPlG_44_42843164_5mers_0001_0001_SAVE_ALL_OUT_927439_310
   project URL: https://boinc.bakerlab.org/rosetta/
   received: Sat May  9 17:50:35 2020
   report deadline: Tue May 12 17:50:35 2020
   ready to report: no
   state: downloaded
   scheduler state: scheduled
   active_task_state: EXECUTING
   app version num: 420
   resources: 1 CPU
   estimated CPU time remaining: 600.986054
   CPU time at last checkpoint: 0.000000
   current CPU time: 179824.500000
   fraction done: 0.996672
   swap size: 379 MB
   working set size: 305 MB
2) -----------
   name: SR5AGU10_LVPlG_32_8261859_5mers_0001_0001_SAVE_ALL_OUT_927421_311_0
   WU name: SR5AGU10_LVPlG_32_8261859_5mers_0001_0001_SAVE_ALL_OUT_927421_311
   project URL: https://boinc.bakerlab.org/rosetta/
   received: Sat May  9 17:50:35 2020
   report deadline: Tue May 12 17:50:35 2020
   ready to report: no
   state: downloaded
   scheduler state: scheduled
   active_task_state: EXECUTING
   app version num: 420
   resources: 1 CPU
   estimated CPU time remaining: 600.848945
   CPU time at last checkpoint: 0.000000
   current CPU time: 179697.900000
   fraction done: 0.996671
   swap size: 378 MB
   working set size: 304 MB


ID: 96385 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : WUs stuck at 99.50% and no progress



©2024 University of Washington
https://www.bakerlab.org