COVID 19 WU Errors

Message boards : Number crunching : COVID 19 WU Errors

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
TheMoD

Send message
Joined: 10 Feb 06
Posts: 3
Credit: 839,735
RAC: 134
Message 92381 - Posted: 27 Mar 2020, 9:54:39 UTC

Hello everybody

I have problems with the COVID 19 workunits.

They run too long and all produce an error:

Calculation error
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x757D4192

What can i do?

TheMoD
ID: 92381 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 353
Credit: 1,227,479
RAC: 1,506
Message 92384 - Posted: 27 Mar 2020, 10:37:19 UTC - in response to Message 92381.  

These work units are using huge amounts of RAM, over 1 GB per task.
You need to allow more memory or run less rosetta workunits.
ID: 92384 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TheMoD

Send message
Joined: 10 Feb 06
Posts: 3
Credit: 839,735
RAC: 134
Message 92391 - Posted: 27 Mar 2020, 15:10:31 UTC

Thanks a lot

I use a Celeron 4 core processor with 4GB RAM.
So far there have never been any problems with Rosetta or any other application.

What would I have to change, to make COVID WU's work?

Actually, no work units should be delivered that do not match the requirements, right?

Greetings TheMoD
ID: 92391 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 353
Credit: 1,227,479
RAC: 1,506
Message 92395 - Posted: 27 Mar 2020, 16:09:41 UTC - in response to Message 92391.  
Last modified: 27 Mar 2020, 16:14:17 UTC

Run less concurrent Rosetta@home workunits. These COVID-19 WU's are using a lot of RAM.
You can do it manually or by editing a .xml file, which I'm afraid I do not recall how.
Hopefully someone can help you further.
ID: 92395 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92398 - Posted: 27 Mar 2020, 16:39:40 UTC

It seems that systems with 1GB of memory per CPU are having problems running some of the tasks being sent.

You might try running World Community Grid with the same resource share as R@h. WCG tasks tend to be lower memory usage, so running both projects typically results in a mix of low and high memory tasks.

Another approach is to reduce the number of CPUs that BOINC is allowed to use. This is a setting in your preferences.
Rosetta Moderator: Mod.Sense
ID: 92398 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TheMoD

Send message
Joined: 10 Feb 06
Posts: 3
Credit: 839,735
RAC: 134
Message 92405 - Posted: 27 Mar 2020, 18:18:24 UTC - in response to Message 92398.  

Thank you for your answers

I'll try it.

Have a nice weekend
ID: 92405 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Miklos M

Send message
Joined: 8 Dec 13
Posts: 29
Credit: 5,277,251
RAC: 0
Message 92449 - Posted: 28 Mar 2020, 13:04:28 UTC

At the rate I am going it could take days for each task to get done. These are fast computers with fast cpu cards also the time lapsed moves inaccurately, showing 5 minutes after several hours. Time remaining is about 4-6 hours. What am I doing wrong here?

Thank you,

Miklos
ID: 92449 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 353
Credit: 1,227,479
RAC: 1,506
Message 92451 - Posted: 28 Mar 2020, 13:19:54 UTC - in response to Message 92449.  

The WUs should run for the duration set on this page https://boinc.bakerlab.org/rosetta/prefs.php?subset=project
I think the cut-off limit is +4 hours beyond whatever was set on that page, so if you have the default 10 hours, they should go for no more than 14 CPU hours.
That is my understanding.
ID: 92451 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92459 - Posted: 28 Mar 2020, 14:32:38 UTC - in response to Message 92449.  

Falconet's description of the "watch-dog" is correct. So, from your description, it sounds like tasks are running longer than your preference. When this happens, as the estimated runtime gets under 5 minutes, things are adjusted to correctly indicate that forward progress is being made, but since it has no better estimate to show you, it scales time exponentially into those last 5 minutes.

There is nothing that you need to do. If the work unit is hitting a long-running model that is causing it to run long, that gets reported back with the result so the algorithm being used can be reviewed.
Rosetta Moderator: Mod.Sense
ID: 92459 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Miklos M

Send message
Joined: 8 Dec 13
Posts: 29
Credit: 5,277,251
RAC: 0
Message 92465 - Posted: 28 Mar 2020, 15:29:16 UTC

I tweaked my cpu's power setting and it is on 90% and I am not running too many tasks at once, though it is a tiny bit faster now, it still looks like it will take over 10 hours to finish each task, the other two computers are just as fast as this one below:
GenuineIntel
Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz [Family 6 Model 85 Stepping 4]
(36 processors) [3] NVIDIA GeForce RTX 2080 Ti (4095MB) driver: 43521 Linux Ubuntu
Ubuntu 18.04.4 LTS [5.3.0-40-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)]

Can the run times be speeded up?

Thank you,
Miklos
ID: 92465 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92471 - Posted: 28 Mar 2020, 17:02:25 UTC

There is a Rosetta preference (configured from the website rather than the BOINC preferences on your machine) where you can define your workunit runtime preference... if that is what you meant.
Rosetta Moderator: Mod.Sense
ID: 92471 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Miklos M

Send message
Joined: 8 Dec 13
Posts: 29
Credit: 5,277,251
RAC: 0
Message 92483 - Posted: 28 Mar 2020, 19:48:15 UTC - in response to Message 92471.  

Did someone here say that 4 hours is the maximum time after that run time, the wu is erroring out?
ID: 92483 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92486 - Posted: 28 Mar 2020, 20:13:41 UTC - in response to Message 92483.  
Last modified: 28 Mar 2020, 20:22:51 UTC

No, 4 hours is not the maximum time.

Yes, the watch-dog will kick in and clean up the WU if it runs longer than 4 hours more than the runtime preference. The runtime preference is between 1 and 24 hours.
Rosetta Moderator: Mod.Sense
ID: 92486 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dayle

Send message
Joined: 6 Jan 14
Posts: 13
Credit: 759,731
RAC: 0
Message 92493 - Posted: 28 Mar 2020, 23:12:40 UTC
Last modified: 28 Mar 2020, 23:16:57 UTC

Just resubscribed to this project on a reliable PC that's been running Rosetta software on World Community Grid (their Microbiome Immunity Project) without issue.
Set WU size to 24 hours and ran a mix of the two feeds, weighted 50-50.
PC has 32 Threads and 16 gig of RAM.
When I went to bed last night there were two gigs of free system memory and a 16 GB page file just in case.

Looks like at some point there was a spike in RAM usage (while otherwise idle), and 5 work units errored without credit.
Total loss: two days, four hours of work on a modern system (plus five more hours of WCG tasks).
Maybe nothing over time but quite painful all at once, and not a great trend if it continues.

One of the failures didn't mention RAM, just "finish file present too long".
I'm hypothesizing that this task encountered a problem and got bigger and bigger, crashing the rest?
Output text is below.

It's also possible the crash took place when minirosetta tasks finished and were replaced by full size COVID tasks.

If anybody has any thoughts, they'd be appreciated.

Thanks,

Dayle

Task 1134921561
Name rb_03_27_19542_19448_ab_t000__h002_robetta_IGNORE_THE_REST_11_09_903961_5_0
Workunit 1022160017
Created 27 Mar 2020, 20:53:25 UTC
Sent 27 Mar 2020, 21:14:52 UTC
Report deadline 4 Apr 2020, 21:14:52 UTC
Received 28 Mar 2020, 22:40:43 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT
Computer ID 3925665
Run time 21 hours 14 min 43 sec
CPU time 21 hours 14 min 43 sec
Validate state Invalid
Credit 0.00
Device peak FLOPS 4.49 GFLOPS
Application version Rosetta v4.07
windows_x86_64
Peak working set size 1,283.20 MB
Peak swap size 1,491.86 MB
Peak disk usage 492.17 MB
Stderr output
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.07_windows_x86_64.exe @rb_03_27_19542_19448_ab_t000__h002_robetta_FLAGS -in::file::fasta t000__h002.fasta -psipred_ss2 t000__h002.spider3_ss2 -kill_hairpins t000__h002.nobuformat.spider3_ss2 -abinitio::use_filters true -in:file:boinc_wu_zip rb_03_27_19542_19448_ab_t000__h002_robetta.zip -frag3 rb_03_27_19542_19448_ab_t000__h002_robetta.200.3mers.index.gz -fragA rb_03_27_19542_19448_ab_t000__h002_robetta.200.9mers.index.gz -fragB rb_03_27_19542_19448_ab_t000__h002_robetta.200.11mers.index.gz -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2413747
Starting watchdog...
Watchdog active.
======================================================
DONE :: 1 starting structures 76648.4 cpu seconds
This process generated 5 decoys from 5 attempts
======================================================
BOINC :: WS_max 1.34554e+09

BOINC :: Watchdog shutting down...
13:39:55 (14096): called boinc_finish(0)

</stderr_txt>
<message>
finish file present too long</message>
]]>
ID: 92493 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Miklos M

Send message
Joined: 8 Dec 13
Posts: 29
Credit: 5,277,251
RAC: 0
Message 92494 - Posted: 28 Mar 2020, 23:32:25 UTC - in response to Message 92486.  

Thank you for clarifying it.
ID: 92494 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92501 - Posted: 29 Mar 2020, 4:34:02 UTC - in response to Message 92493.  

@Dayle the machine registered with your account shows is has 32GB of memory, are you saying that BOINC is configured to only use half of the memory?

It seems the COVID tasks are great lovers of memory. I'm unclear why the BOINC Manager is having trouble making things work. And the Project Team will be tagging these tasks are being more memory intensive in the future. Obviously they are the new kids on the block here. So there are still tweaks to be made in how the WUs are created that will help things run smoother.

...but yes, It's possible one or more minirosetta tasks finished and were replaced by one or more full size COVID tasks.

Just by what I'm seeing people reporting, I'm saying that the COVID tasks need 2GB per active thread. In your case, running 50/50 with WCG is a great idea. Because WCG typically has tasks that run in a much smaller memory footprint. However, even if you get a balance of 16 WCG threads and 16 R@h threads, you still push hard against my 2GB per thread observation (and if you are only allowing BOINC to use 16GB, then this would still be too much).

I am hopeful that the BOINC Manager's ability to suspend a task in a "waiting for memory" state, will be more stable once all of the COVID WUs have the higher memory requirement defined in them.

There is a computing preference, in the disk and memory tab, that is checked to "leave applications in memory while suspended". When a task gets to a "waiting for memory" state, it is "suspended" by the BOINC Manager. I wonder if the BOINC Manager hesitates to act, when the WU is not near a checkpoint, if it will not be left in memory while suspended. I also point out that things do not truly stay "in memory", instead they go out to your swap space. And I recommend that folks with multiple projects, or in these high memory consumption cases, check the box, and do keep in memory while suspended.

Because you mentioned it, a 16GB page file starts to sound rather small as well. But I would think Windows would have been spitting messages if that were filling.
Rosetta Moderator: Mod.Sense
ID: 92501 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ktamail666

Send message
Joined: 25 Jun 06
Posts: 1
Credit: 379,335
RAC: 0
Message 92565 - Posted: 29 Mar 2020, 19:57:53 UTC

I had similar "out of memory" issues, but I don't understand because my machine has 32gb ram. Limit is allow to use 26gb:
24-Mar-2020 19:09:18 [---] max memory usage when active: 26168.39 MB
24-Mar-2020 19:09:18 [---] max memory usage when idle: 26168.39 MB

This machine has 6 cpu core and it's ran 6 WU same time.

https://boinc.bakerlab.org/rosetta/result.php?resultid=1136619062
https://boinc.bakerlab.org/rosetta/result.php?resultid=1135909251
https://boinc.bakerlab.org/rosetta/result.php?resultid=1133534586

If I calculate with peak ram usage 1.5GB * 6core that is also just 9 GB memory. Of course the 32 bit applications able to use 4gb per process.
Currently I limited the run time to 1 hour, to avoid big WU loses. But as you see OOM happend in 1136619062 at minute 47.

Does Linux version use 64bit or just 32bit wrapper as I read it in another thread?

What do you think about these memory issue?
ID: 92565 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Miklos M

Send message
Joined: 8 Dec 13
Posts: 29
Credit: 5,277,251
RAC: 0
Message 92566 - Posted: 29 Mar 2020, 21:26:26 UTC

I got two units running so far for a day plus 10 hours and still less than 70% finished, with the estimated time to go 19 hours. Keep running or abort?
ID: 92566 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dayle

Send message
Joined: 6 Jan 14
Posts: 13
Credit: 759,731
RAC: 0
Message 92569 - Posted: 29 Mar 2020, 23:44:25 UTC - in response to Message 92501.  

Hello Mod.Sence,

Thanks for taking the time to investigate this.

When I posted, I had 16 GB of memory in my system, and BOINC was allowed to use 90% of memory.
I have since cannibalized memory from another system, which is why it's now showing 32 GB.
The memory is mismatched, and even though it's DDR4 it's showing speeds lower then what I thought was possible for that standard (1067 MHz).

I also updated Rosetta to a one third share, with WCG at two thirds.
BOINC doesn't seem to care, and is running only Rosetta on all 32 threads, as if to make up for lost time.
Since adding 16 more gigabytes of memory, I've still lost a task to OOM errors.

I've always left applications in memory while suspended.
ID: 92569 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92576 - Posted: 30 Mar 2020, 1:48:03 UTC - in response to Message 92569.  

I also updated Rosetta to a one third share, with WCG at two thirds.
BOINC doesn't seem to care, and is running only Rosetta on all 32 threads, as if to make up for lost time.
Since adding 16 more gigabytes of memory, I've still lost a task to OOM errors.


Give BOINC Manager some time to get used to the new project resource share. It will balance out when the work cache is refreshed.

Yes, there are OOM errors occurring. I am not certain when the WU configuration to indicate they require more memory will roll out, nor can I say I know exactly what BOINC Manager will do with the better info. on the tasks. But hoping things settle down next week.
Rosetta Moderator: Mod.Sense
ID: 92576 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : COVID 19 WU Errors



©2024 University of Washington
https://www.bakerlab.org