Problems and Technical Issues with Rosetta@home

Author	Message
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 92274 - Posted: 25 Mar 2020, 15:16:04 UTC rlpm, your host profile shows 256MB of memory. And the "mini" tasks require just as much memory as any others. They seem to have moved the documentation on minimum host requirements on the R@h website, so I'm not finding it at the moment. But the basic guideline is 1GB of memory per active CPU core. I might suggest that you attach the machine to World Community Grid. They have a number of bioscience projects running there, and generally can run in a smaller memory footprint. Rosetta Moderator: Mod.Sense ID: 92274 · Rating: 0 · rate: / Reply Quote

rlpm Send message Joined: 23 Mar 20 Posts: 13 Credit: 84 RAC: 0	Message 92279 - Posted: 25 Mar 2020, 16:40:34 UTC - in response to Message 92274. Thanks Mod.Sense. It would be nice if BOINC automatically failed early, perhaps even at project attachment, if the host doesn't meet the minimum requirements for any app (RAM, disk, instruction set, OS). I already have my old 1st gen RasPis crunching on TN-Grid (gene sequencing) via BOINC, so I'll do the same with this AppleTV. ID: 92279 · Rating: 0 · rate: / Reply Quote

bormolino Send message Joined: 16 May 13 Posts: 4 Credit: 160,977 RAC: 0	Message 92292 - Posted: 25 Mar 2020, 20:11:24 UTC The graphics of the Rosetta 4.07 WU for COVID-19 does not work. It shows "Stage unknown" and "No shared mem" inside the graphics-window. The graphics of the other WUs are working without any problems. ID: 92292 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2541 Credit: 47,118,286 RAC: 389	Message 92355 - Posted: 26 Mar 2020, 19:29:06 UTC I've seen the Rosetta stats for the number of new users who've come on board recently - basically quadrupled with massive throughput, which is great. The number of in-progress tasks is similarly huge - well over a million - more than I can ever remember seeing. A little earlier this afternoon I saw my buffers were smaller than usual and noticed that a few calls for new tasks had brought none down. This is hardly surprising. Before I finally got to this page to mention the task shortage, more had come on stream, which is great. I guess all I'm saying is, especially with all the new users around, if there's an interruption in task supply in the coming daysweeks, we (more accurately, I) need to have a little patience and understanding. It's going to happen and it's surprising it hasn't happened already. Great job on keeping the tasks coming through - thanks. ID: 92355 · Rating: 0 · rate: / Reply Quote

Shaky Jake Send message Joined: 26 Mar 07 Posts: 2 Credit: 55,684 RAC: 0	Message 92455 - Posted: 28 Mar 2020, 13:58:41 UTC - in response to Message 80621. I have an older desktop computer with a Pentium Duo cpu that is having a problem with the COVID-19 workunits. They are erroring out at about 2 min. EXAMPLE: Task 1134452442 Name 0ef4jx8h_jhr_design1_COVID-19_SAVE_ALL_OUT_903439_1_0 Workunit 1021756085 Created 27 Mar 2020, 9:12:21 UTC Sent 27 Mar 2020, 9:38:35 UTC Report deadline 4 Apr 2020, 9:38:35 UTC Received 28 Mar 2020, 12:10:42 UTC Server state Over Outcome Computation error Client state Compute error Exit status 11 (0x0000000B) Unknown error code Computer ID 3794680 Run time 2 min 15 sec CPU time 1 min 59 sec Validate state Invalid Credit 0.00 Device peak FLOPS 1.87 GFLOPS Application version Rosetta v4.08 x86_64-pc-linux-gnu Stderr output <core_client_version>7.2.42</core_client_version> <![CDATA[ <message> process got signal 11 </message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol jhr_boinc.xml @flags -in:file:silent 0ef4jx8h_jhr_design1_COVID-19.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip 0ef4jx8h_jhr_design1_COVID-19.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3902678 Starting watchdog... Watchdog active. </stderr_txt> ]]> I have seen a couple that did complete and were validated. EXAMPLE: Task 1133949909 Name 0gr1iv8s_jhr_design1_COVID-19_SAVE_ALL_OUT_903456_1_0 Workunit 1021309240 Created 26 Mar 2020, 20:05:44 UTC Sent 26 Mar 2020, 20:22:20 UTC Report deadline 3 Apr 2020, 20:22:20 UTC Received 27 Mar 2020, 23:58:09 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x00000000) Computer ID 3794680 Run time 13 hours 53 min 23 sec CPU time 10 hours 30 min 46 sec Validate state Valid Credit 222.11 Device peak FLOPS 1.87 GFLOPS Application version Rosetta v4.07 i686-pc-linux-gnu Stderr output <core_client_version>7.2.42</core_client_version> <![CDATA[ <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_i686-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol jhr_boinc.xml @flags -in:file:silent 0gr1iv8s_jhr_design1_COVID-19.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip 0gr1iv8s_jhr_design1_COVID-19.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3546964 Starting watchdog... Watchdog active. ====================================================== DONE :: 3 starting structures 37846.6 cpu seconds This process generated 3 decoys from 3 attempts ====================================================== BOINC :: WS_max 9.36336e-97 BOINC :: Watchdog shutting down... 18:53:10 (26863): called boinc_finish(0) </stderr_txt> ]]> Should I stop using this computer for this project or let it continue. All of the other workunits appear to process with no problems. ID: 92455 · Rating: 0 · rate: / Reply Quote

IBM01902 Send message Joined: 23 Mar 20 Posts: 3 Credit: 43,044 RAC: 0	Message 92460 - Posted: 28 Mar 2020, 14:40:07 UTC - in response to Message 92455. I am seeing this too with older computers. I don't have any new ones. They seem to eventually find something they can work on, but there's nothing in the BOINC event log that's helpful. I will occasionally have a task that halts and waits for memory, but that's not the Computation Error result we're seeing. Glad it's not just me. ID: 92460 · Rating: 0 · rate: / Reply Quote

rlpm Send message Joined: 23 Mar 20 Posts: 13 Credit: 84 RAC: 0	Message 92464 - Posted: 28 Mar 2020, 15:16:30 UTC - in response to Message 92460. <message> process got signal 11 </message> The process is crashing. More info: SIGSEGV 11 Core Invalid memory reference The people with access to the code will have to look into it. I don't know whether there are any crash reports (stack traces, etc.) that you can pull to provide more information to them. ID: 92464 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1602 Credit: 13,011,009 RAC: 3	Message 92468 - Posted: 28 Mar 2020, 16:21:24 UTC - in response to Message 92460. I am seeing this too with older computers. I don't have any new ones. They seem to eventually find something they can work on, but there's nothing in the BOINC event log that's helpful. I will occasionally have a task that halts and waits for memory, but that's not the Computation Error result we're seeing. Glad it's not just me. Working ok for me on all my computers. My oldest is an Intel Q8400 (about 10 years old). It's a pity you can't select which sub projects to run in the Rosetta preferences. Most projects allow you to pick which ones, so you can block the ones that don't work on your machines. I guess as long as some of them work, you should keep going. Sending one back with an error just means the server will try someone else. ID: 92468 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 92474 - Posted: 28 Mar 2020, 17:18:13 UTC - in response to Message 92455. @Shaky Jake. I see you have two machines. It appears the one with 2 CPUs and 2GB of memory is where the errors are occurring the most (the other machine has 2CPUs and 4GB). This is consistent with what I have gleaned from others as well. I believe the Project Team will be tagging the COVID tasks as requiring more memory in the coming days. This should help things run smoother going forward. Rosetta Moderator: Mod.Sense ID: 92474 · Rating: 0 · rate: / Reply Quote

Shaky Jake Send message Joined: 26 Mar 07 Posts: 2 Credit: 55,684 RAC: 0	Message 92489 - Posted: 28 Mar 2020, 21:01:16 UTC - in response to Message 92455. Last modified: 28 Mar 2020, 21:10:21 UTC I found the problem. I am short .1 GB of memory so when 2 COVID-19 WUs try to run, one of them will fail due to lack of memory. I have ordered additional memory. Until it arrives I have set the computer to use run only 1 WU at a time. Thanks Mod.Sense Every thing seems to be running OK by using only 1 core. I am going to upgrade to 4GB of memory. I think that will solve the problem. My other computer is a laptop with 2 cores and 4GB memory and it has had no problems. Shaky Jake ID: 92489 · Rating: 0 · rate: / Reply Quote

rlpm Send message Joined: 23 Mar 20 Posts: 13 Credit: 84 RAC: 0	Message 92490 - Posted: 28 Mar 2020, 21:22:44 UTC - in response to Message 92489. Last modified: 28 Mar 2020, 21:28:47 UTC The binaries should check that there's enough memory for the WU, both at process start time, and checking results of malloc, etc. at run time. Since the process on your computer hit a segfault, it may have been due to a memory allocation failing but the software not checking the result of the allocation. There must be some checking in the 32-bit (for linux) version of the Rosetta & Rosetta Mini binaries, since I've encountered this error message on an older box with only 256MB of memory: working set size > client RAM limit: 180.00MB > 179.51MB (But it would be nice to have the check happen ahead of time -- before sending the WU to the computer.) ID: 92490 · Rating: 0 · rate: / Reply Quote

bormolino Send message Joined: 16 May 13 Posts: 4 Credit: 160,977 RAC: 0	Message 92491 - Posted: 28 Mar 2020, 21:24:50 UTC The graphics of the Rosetta 4.07 WU for COVID-19 does not work. It shows "Stage unknown" and "No shared mem" inside the graphics-window. The graphics of the other WUs are working without any problems. ID: 92491 · Rating: 0 · rate: / Reply Quote

EHM-1 Send message Joined: 21 Mar 20 Posts: 23 Credit: 183,782 RAC: 0	Message 92534 - Posted: 29 Mar 2020, 15:37:52 UTC Last modified: 29 Mar 2020, 15:41:28 UTC Hello all- Longtime SETI@Home user here, new to Rosetta. Hope I'm posting in the right place; please advise me if not. I attached several days ago, and the screensaver was displaying what I would expect for processing until a couple days ago. Since at least yesterday morning (midday Mar 28 UT), the processing screen displays what I would call a blank template, with no indication that anything is being processed. See image below. Any ideas? Anyone else encountering this? I could find no mention of anything similar in the forums. Thanks in advance for any help. Eric PS- Just after posting, I now see that bormolino might be reporting the same issue just above my post. ID: 92534 · Rating: 0 · rate: / Reply Quote

bormolino Send message Joined: 16 May 13 Posts: 4 Credit: 160,977 RAC: 0	Message 92558 - Posted: 29 Mar 2020, 18:34:00 UTC - in response to Message 92534. PS- Just after posting, I now see that bormolino might be reporting the same issue just above my post. Yes :D Same on my machines. ID: 92558 · Rating: 0 · rate: / Reply Quote

EHM-1 Send message Joined: 21 Mar 20 Posts: 23 Credit: 183,782 RAC: 0	Message 92572 - Posted: 30 Mar 2020, 0:17:14 UTC - in response to Message 92558. Follow-up to my earlier post: At the most recent screensaver invocation, the normal behavior resumed. Note: Though subscribed to this thread, I received no notification of bormolino's post. Eric ID: 92572 · Rating: 0 · rate: / Reply Quote

rlpm Send message Joined: 23 Mar 20 Posts: 13 Credit: 84 RAC: 0	Message 92573 - Posted: 30 Mar 2020, 0:23:19 UTC - in response to Message 92572. Note: Though subscribed to this thread, I received no notification of bormolino's post. Check your community prefs from your main account page. ID: 92573 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2541 Credit: 47,118,286 RAC: 389	Message 92581 - Posted: 30 Mar 2020, 2:21:05 UTC Not sure what this means atm 30/03/2020 3:17:00 \| Rosetta@home \| Scheduler request completed: got 0 new tasks 30/03/2020 3:17:00 \| Rosetta@home \| Server can't open database Also, entering this thread I initially got a message saying the site was down. Came back on a refresh ID: 92581 · Rating: 0 · rate: / Reply Quote

amgthis Send message Joined: 25 Mar 06 Posts: 81 Credit: 203,879,282 RAC: 0	Message 92582 - Posted: 30 Mar 2020, 2:43:32 UTC getting an 'temporarily failed upload of (w/u name here xxx ) transient http error' message on upload failure and time out. I'm guessing it's just some new message I've never seen and the project is just getting updated, etc. ID: 92582 · Rating: 0 · rate: / Reply Quote

HPE Belgium Send message Joined: 27 Mar 20 Posts: 16 Credit: 367,648,439 RAC: 0	Message 92589 - Posted: 30 Mar 2020, 9:22:12 UTC Hello, I have some servers that I want to use for R@H. Most of the servers use full CPU and all cores/logical CPU's, however I have 2 servers that only use half of the available logical processor. Both servers are ProLiant Gen9 servers. One server is a BL660c Gen9 with 32 logical CPU's but only half of them are working while I still have tasks "ready to start". Other server is DL380 Gen9 which takes 67% CPU load instad of 100% My other servers are Gen8 servers which take full load. Is there something I can do to fix this? Somebody that can help me troubleshoot? All my preferences are set to 100% load in my global preferences and this setting works fine on most of my servers. ID: 92589 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1926 Credit: 18,534,891 RAC: 0	Message 92591 - Posted: 30 Mar 2020, 9:37:39 UTC - in response to Message 92589. Is there something I can do to fix this? Somebody that can help me troubleshoot? All my preferences are set to 100% load in my global preferences and this setting works fine on most of my servers. Are they "Ready to start" or "Waiting on memory?"- they've got enough RAM to support all of those cores & threads? You haven't changed any settings in the BOINC Manager on those systems (local settings override web based ones)? Grant Darwin NT ID: 92591 · Rating: 0 · rate: / Reply Quote