Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 28 · 29 · 30 · 31 · 32 · 33 · 34 . . . 305 · Next

AuthorMessage
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,005,376
RAC: 16,150
Message 91868 - Posted: 4 Mar 2020, 21:54:33 UTC - in response to Message 91867.  

What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here.


Within the stdout file the app reports the number of scenarios that have been processed in the time you make available, normally in the standard 8 hour window you’ll process the data with, maybe, 40 different starting positions (these are known as decoys for some reason that escapes me). I set my processing window to 6 hours and a normal work unit will process maybe 30 decoys, the work units that are erroring out have not finished the first run through the data after 10 hours (6 hour preference plus 4 hours allowed overrun) when the watchdog aborts the process.


Are you setting this limit or is it done by the server? I just let all projects take whatever time they feel necessary.


You’ll find it in your account > project preferences > target cpu time, it defaults to 8 hours but after I had quite a few of these errors I dropped mine to 6 hours in the hope I’d waste slightly less processing time.


Strange, all projects that have given me errors cause them well under the normal processing time. And some projects like LHC seem to have tasks that usually take 2 hours but can take 4 days, but still complete fine.
ID: 91868 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 395
Credit: 12,223,668
RAC: 10,295
Message 91871 - Posted: 5 Mar 2020, 11:59:25 UTC - in response to Message 91868.  



Are you setting this limit or is it done by the server? I just let all projects take whatever time they feel necessary.


You’ll find it in your account > project preferences > target cpu time, it defaults to 8 hours but after I had quite a few of these errors I dropped mine to 6 hours in the hope I’d waste slightly less processing time.


Strange, all projects that have given me errors cause them well under the normal processing time. And some projects like LHC seem to have tasks that usually take 2 hours but can take 4 days, but still complete fine.


Rosetta works differently to most in that it does as much processing as it can in the time allowed rather than takes as much time as the fixed amount of processing takes.

To ensure that a rogue work unit does not lock out a core it has a watchdog that closes out the WU if it is still running 4 hours beyond the time limit set. As the WU closes down at the end of a decoy and it predicts that there is not time to process another before the deadline, this can only really happen if a single decoy runs for more than the 4 hours and I’d guess that this implies the decoy is in a loop but cannot say that for certain.
ID: 91871 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,005,376
RAC: 16,150
Message 91875 - Posted: 5 Mar 2020, 19:01:19 UTC - in response to Message 91871.  

Rosetta works differently to most in that it does as much processing as it can in the time allowed rather than takes as much time as the fixed amount of processing takes.

To ensure that a rogue work unit does not lock out a core it has a watchdog that closes out the WU if it is still running 4 hours beyond the time limit set. As the WU closes down at the end of a decoy and it predicts that there is not time to process another before the deadline, this can only really happen if a single decoy runs for more than the 4 hours and I’d guess that this implies the decoy is in a loop but cannot say that for certain.


My Rosetta WUs all seem to finish in a very precise timeframe of 7.5 hours (8.5 hours on the slower machines), I've seen no variation. Perhaps the limiter only takes effect occasionally. LHC has a huge variation, the theory tasks can take from 1.5 hours to 4 days! The percentage moves very slowly as though it will take 4 days, but it seems to complete at a random point somewhere in there, often at only "2% completed", it jumps to 100% and says it was successful, I guess it's looking for an answer somewhere in there and finds it early?
ID: 91875 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 395
Credit: 12,223,668
RAC: 10,295
Message 91876 - Posted: 5 Mar 2020, 20:43:13 UTC - in response to Message 91875.  

Rosetta works differently to most in that it does as much processing as it can in the time allowed rather than takes as much time as the fixed amount of processing takes.

To ensure that a rogue work unit does not lock out a core it has a watchdog that closes out the WU if it is still running 4 hours beyond the time limit set. As the WU closes down at the end of a decoy and it predicts that there is not time to process another before the deadline, this can only really happen if a single decoy runs for more than the 4 hours and I’d guess that this implies the decoy is in a loop but cannot say that for certain.


My Rosetta WUs all seem to finish in a very precise timeframe of 7.5 hours (8.5 hours on the slower machines), I've seen no variation. Perhaps the limiter only takes effect occasionally. LHC has a huge variation, the theory tasks can take from 1.5 hours to 4 days! The percentage moves very slowly as though it will take 4 days, but it seems to complete at a random point somewhere in there, often at only "2% completed", it jumps to 100% and says it was successful, I guess it's looking for an answer somewhere in there and finds it early?


Pass, I’ve never looked at LHC so I wouldn’t know.
ID: 91876 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,005,376
RAC: 16,150
Message 91877 - Posted: 5 Mar 2020, 21:01:09 UTC - in response to Message 91876.  

Pass, I’ve never looked at LHC so I wouldn’t know.


They've got Atlas tasks, which will run one WU on all your CPU cores at once. I want to get a Ryzen threadripper to see if they'll give me a 64 core task :-)
ID: 91877 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rlpm

Send message
Joined: 23 Mar 20
Posts: 13
Credit: 84
RAC: 0
Message 92241 - Posted: 24 Mar 2020, 21:38:52 UTC

All WUs for Rosetta v4.07 i686-pc-linux-gnu on my 1st gen AppleTV running linux (OSMC with all GUI etc. disabled) are failing for going over the RAM limit. See here. E.g.:
working set size > client RAM limit: 167.87MB > 167.55MB

Is there something wrong with the working set size matching to the amount of available RAM?

Or can I limit to the Rosetta Mini application only?
ID: 92241 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rlpm

Send message
Joined: 23 Mar 20
Posts: 13
Credit: 84
RAC: 0
Message 92247 - Posted: 25 Mar 2020, 2:07:34 UTC - in response to Message 92241.  

Looks like the same is happening for mini tasks, e.g. task 1132535295:
working set size > client RAM limit: 170.39MB > 167.55MB
ID: 92247 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92274 - Posted: 25 Mar 2020, 15:16:04 UTC

rlpm, your host profile shows 256MB of memory. And the "mini" tasks require just as much memory as any others. They seem to have moved the documentation on minimum host requirements on the R@h website, so I'm not finding it at the moment. But the basic guideline is 1GB of memory per active CPU core.

I might suggest that you attach the machine to World Community Grid. They have a number of bioscience projects running there, and generally can run in a smaller memory footprint.
Rosetta Moderator: Mod.Sense
ID: 92274 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rlpm

Send message
Joined: 23 Mar 20
Posts: 13
Credit: 84
RAC: 0
Message 92279 - Posted: 25 Mar 2020, 16:40:34 UTC - in response to Message 92274.  

Thanks Mod.Sense.
It would be nice if BOINC automatically failed early, perhaps even at project attachment, if the host doesn't meet the minimum requirements for any app (RAM, disk, instruction set, OS).
I already have my old 1st gen RasPis crunching on TN-Grid (gene sequencing) via BOINC, so I'll do the same with this AppleTV.
ID: 92279 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bormolino

Send message
Joined: 16 May 13
Posts: 4
Credit: 160,977
RAC: 0
Message 92292 - Posted: 25 Mar 2020, 20:11:24 UTC

The graphics of the Rosetta 4.07 WU for COVID-19 does not work. It shows "Stage unknown" and "No shared mem" inside the graphics-window.

The graphics of the other WUs are working without any problems.
ID: 92292 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2129
Credit: 41,382,114
RAC: 14,438
Message 92355 - Posted: 26 Mar 2020, 19:29:06 UTC

I've seen the Rosetta stats for the number of new users who've come on board recently - basically quadrupled with massive throughput, which is great.
The number of in-progress tasks is similarly huge - well over a million - more than I can ever remember seeing.

A little earlier this afternoon I saw my buffers were smaller than usual and noticed that a few calls for new tasks had brought none down. This is hardly surprising.

Before I finally got to this page to mention the task shortage, more had come on stream, which is great.

I guess all I'm saying is, especially with all the new users around, if there's an interruption in task supply in the coming daysweeks, we (more accurately, I) need to have a little patience and understanding. It's going to happen and it's surprising it hasn't happened already.

Great job on keeping the tasks coming through - thanks.
ID: 92355 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Shaky Jake

Send message
Joined: 26 Mar 07
Posts: 2
Credit: 55,684
RAC: 0
Message 92455 - Posted: 28 Mar 2020, 13:58:41 UTC - in response to Message 80621.  

I have an older desktop computer with a Pentium Duo cpu that is having a problem with the COVID-19 workunits. They are erroring out at about 2 min.

EXAMPLE:

Task 1134452442
Name 0ef4jx8h_jhr_design1_COVID-19_SAVE_ALL_OUT_903439_1_0

Workunit 1021756085
Created 27 Mar 2020, 9:12:21 UTC
Sent 27 Mar 2020, 9:38:35 UTC
Report deadline 4 Apr 2020, 9:38:35 UTC
Received 28 Mar 2020, 12:10:42 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 11 (0x0000000B) Unknown error code
Computer ID 3794680
Run time 2 min 15 sec
CPU time 1 min 59 sec
Validate state Invalid
Credit 0.00
Device peak FLOPS 1.87 GFLOPS
Application version Rosetta v4.08
x86_64-pc-linux-gnu
Stderr output

<core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
process got signal 11
</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol jhr_boinc.xml @flags -in:file:silent 0ef4jx8h_jhr_design1_COVID-19.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip 0ef4jx8h_jhr_design1_COVID-19.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3902678
Starting watchdog...
Watchdog active.

</stderr_txt>
]]>

I have seen a couple that did complete and were validated.

EXAMPLE:

Task 1133949909
Name 0gr1iv8s_jhr_design1_COVID-19_SAVE_ALL_OUT_903456_1_0
Workunit 1021309240
Created 26 Mar 2020, 20:05:44 UTC
Sent 26 Mar 2020, 20:22:20 UTC
Report deadline 3 Apr 2020, 20:22:20 UTC
Received 27 Mar 2020, 23:58:09 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x00000000)
Computer ID 3794680
Run time 13 hours 53 min 23 sec
CPU time 10 hours 30 min 46 sec
Validate state Valid
Credit 222.11
Device peak FLOPS 1.87 GFLOPS
Application version Rosetta v4.07
i686-pc-linux-gnu
Stderr output

<core_client_version>7.2.42</core_client_version>
<![CDATA[
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_i686-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol jhr_boinc.xml @flags -in:file:silent 0gr1iv8s_jhr_design1_COVID-19.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip 0gr1iv8s_jhr_design1_COVID-19.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3546964
Starting watchdog...
Watchdog active.
======================================================
DONE :: 3 starting structures 37846.6 cpu seconds
This process generated 3 decoys from 3 attempts
======================================================
BOINC :: WS_max 9.36336e-97

BOINC :: Watchdog shutting down...
18:53:10 (26863): called boinc_finish(0)

</stderr_txt>
]]>


Should I stop using this computer for this project or let it continue. All of the other workunits appear to process with no problems.
ID: 92455 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
IBM01902

Send message
Joined: 23 Mar 20
Posts: 3
Credit: 43,044
RAC: 0
Message 92460 - Posted: 28 Mar 2020, 14:40:07 UTC - in response to Message 92455.  

I am seeing this too with older computers. I don't have any new ones. They seem to eventually find something they can work on, but there's nothing in the BOINC event log that's helpful. I will occasionally have a task that halts and waits for memory, but that's not the Computation Error result we're seeing. Glad it's not just me.
ID: 92460 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rlpm

Send message
Joined: 23 Mar 20
Posts: 13
Credit: 84
RAC: 0
Message 92464 - Posted: 28 Mar 2020, 15:16:30 UTC - in response to Message 92460.  

<message>
process got signal 11
</message>

The process is crashing. More info:
 SIGSEGV      11       Core    Invalid memory reference

The people with access to the code will have to look into it. I don't know whether there are any crash reports (stack traces, etc.) that you can pull to provide more information to them.
ID: 92464 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,005,376
RAC: 16,150
Message 92468 - Posted: 28 Mar 2020, 16:21:24 UTC - in response to Message 92460.  

I am seeing this too with older computers. I don't have any new ones. They seem to eventually find something they can work on, but there's nothing in the BOINC event log that's helpful. I will occasionally have a task that halts and waits for memory, but that's not the Computation Error result we're seeing. Glad it's not just me.


Working ok for me on all my computers. My oldest is an Intel Q8400 (about 10 years old).

It's a pity you can't select which sub projects to run in the Rosetta preferences. Most projects allow you to pick which ones, so you can block the ones that don't work on your machines.

I guess as long as some of them work, you should keep going. Sending one back with an error just means the server will try someone else.
ID: 92468 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92474 - Posted: 28 Mar 2020, 17:18:13 UTC - in response to Message 92455.  

@Shaky Jake. I see you have two machines. It appears the one with 2 CPUs and 2GB of memory is where the errors are occurring the most (the other machine has 2CPUs and 4GB). This is consistent with what I have gleaned from others as well. I believe the Project Team will be tagging the COVID tasks as requiring more memory in the coming days. This should help things run smoother going forward.
Rosetta Moderator: Mod.Sense
ID: 92474 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Shaky Jake

Send message
Joined: 26 Mar 07
Posts: 2
Credit: 55,684
RAC: 0
Message 92489 - Posted: 28 Mar 2020, 21:01:16 UTC - in response to Message 92455.  
Last modified: 28 Mar 2020, 21:10:21 UTC

I found the problem. I am short .1 GB of memory so when 2 COVID-19 WUs try to run, one of them will fail due to lack of memory. I have ordered additional memory. Until it arrives I have set the computer to use run only 1 WU at a time.


Thanks Mod.Sense

Every thing seems to be running OK by using only 1 core. I am going to upgrade to 4GB of memory. I think that will solve the problem. My other computer is a laptop with 2 cores and 4GB memory and it has had no problems.

Shaky Jake
ID: 92489 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rlpm

Send message
Joined: 23 Mar 20
Posts: 13
Credit: 84
RAC: 0
Message 92490 - Posted: 28 Mar 2020, 21:22:44 UTC - in response to Message 92489.  
Last modified: 28 Mar 2020, 21:28:47 UTC

The binaries should check that there's enough memory for the WU, both at process start time, and checking results of malloc, etc. at run time. Since the process on your computer hit a segfault, it may have been due to a memory allocation failing but the software not checking the result of the allocation. There must be some checking in the 32-bit (for linux) version of the Rosetta & Rosetta Mini binaries, since I've encountered this error message on an older box with only 256MB of memory:
working set size > client RAM limit: 180.00MB > 179.51MB

(But it would be nice to have the check happen ahead of time -- before sending the WU to the computer.)
ID: 92490 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bormolino

Send message
Joined: 16 May 13
Posts: 4
Credit: 160,977
RAC: 0
Message 92491 - Posted: 28 Mar 2020, 21:24:50 UTC

The graphics of the Rosetta 4.07 WU for COVID-19 does not work. It shows "Stage unknown" and "No shared mem" inside the graphics-window.

The graphics of the other WUs are working without any problems.
ID: 92491 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
EHM-1
Avatar

Send message
Joined: 21 Mar 20
Posts: 23
Credit: 183,782
RAC: 0
Message 92534 - Posted: 29 Mar 2020, 15:37:52 UTC
Last modified: 29 Mar 2020, 15:41:28 UTC

Hello all- Longtime SETI@Home user here, new to Rosetta. Hope I'm posting in the right place; please advise me if not.
I attached several days ago, and the screensaver was displaying what I would expect for processing until a couple days ago. Since at least yesterday morning (midday Mar 28 UT), the processing screen displays what I would call a blank template, with no indication that anything is being processed. See image below.
Any ideas? Anyone else encountering this? I could find no mention of anything similar in the forums.
Thanks in advance for any help.
Eric
PS- Just after posting, I now see that bormolino might be reporting the same issue just above my post.

ID: 92534 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 28 · 29 · 30 · 31 · 32 · 33 · 34 . . . 305 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org