Rosetta 4.1+ and 4.2+

Author	Message
MarkJ Send message Joined: 28 Mar 20 Posts: 72 Credit: 25,292,180 RAC: 0	Message 97978 - Posted: 9 Jul 2020, 11:43:13 UTC I had a bunch of TFSCAFFOLD0001 tasks fail. I’m currently up to 117 failures. My wingman on these has also failed so I don’t think it’s my machines (9 of them). Some seem to work but the dud ones fail within 18 seconds so at least don’t waste much compute time. The logs say it completed 1 decoy and then this: </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>TFSCAFFOLD0001_1_SAVE_ALL_OUT_IGNORE_THE_REST_9ji0sq4q_953352_10_1_r2051777753_0</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> </message> ]]> BOINC blog ID: 97978 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1265 Credit: 14,424,358 RAC: 0	Message 97981 - Posted: 9 Jul 2020, 14:26:16 UTC - in response to Message 97978. Last modified: 9 Jul 2020, 14:28:50 UTC I've seen 4 recent tasks on my computer that failed with an upload error, each with names starting with TFSCAFFOLD0001_ and with very short run times. I also have two such tasks that have run for at least two hours each. A guess at what went wrong: The application encountered an error that it was not set up to report at all, and therefore declared that it was finished without doing anything to produce the file that failed to upload. Some of the event log: 7/9/2020 8:45:41 AM \| Rosetta@home \| Starting task TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_9nx4xt4n_953354_14_0 7/9/2020 8:45:58 AM \| Rosetta@home \| Computation for task TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_9nx4xt4n_953354_14_0 finished 7/9/2020 8:45:58 AM \| Rosetta@home \| Output file TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_9nx4xt4n_953354_14_0_r1023400029_0 for task TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_9nx4xt4n_953354_14_0 absent 7/9/2020 8:46:01 AM \| Rosetta@home \| Starting task TFSCAFFOLD0001_1_SAVE_ALL_OUT_IGNORE_THE_REST_7ph8ms6m_953352_14_0 7/9/2020 8:46:21 AM \| Rosetta@home \| Computation for task TFSCAFFOLD0001_1_SAVE_ALL_OUT_IGNORE_THE_REST_7ph8ms6m_953352_14_0 finished 7/9/2020 8:46:21 AM \| Rosetta@home \| Output file TFSCAFFOLD0001_1_SAVE_ALL_OUT_IGNORE_THE_REST_7ph8ms6m_953352_14_0_r541959699_0 for task TFSCAFFOLD0001_1_SAVE_ALL_OUT_IGNORE_THE_REST_7ph8ms6m_953352_14_0 absent 7/9/2020 8:47:12 AM \| Rosetta@home \| Computation for task TFSCAFFOLD0001_9_SAVE_ALL_OUT_IGNORE_THE_REST_2ya1kk9v_953360_6_0 finished 7/9/2020 8:47:14 AM \| Rosetta@home \| Started upload of TFSCAFFOLD0001_9_SAVE_ALL_OUT_IGNORE_THE_REST_2ya1kk9v_953360_6_0_r1048254222_0 7/9/2020 8:47:19 AM \| Rosetta@home \| Finished upload of TFSCAFFOLD0001_9_SAVE_ALL_OUT_IGNORE_THE_REST_2ya1kk9v_953360_6_0_r1048254222_0 ID: 97981 · Rating: 0 · rate: / Reply Quote

Curt3g Send message Joined: 30 Mar 20 Posts: 4 Credit: 1,908,126 RAC: 0	Message 97982 - Posted: 9 Jul 2020, 14:34:37 UTC - in response to Message 97981. I've seen 4 recent tasks on my computer that failed with an upload error, each with names starting with TFSCAFFOLD0001_ and with very short run times. Same here. I've had 17 failures of this species of WU within the last 24 hours (on a single machine). ID: 97982 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1603 Credit: 13,015,132 RAC: 0	Message 97984 - Posted: 9 Jul 2020, 15:16:34 UTC Failures for me too, on all of my 6 Windows 10 PCs. Some scaffold tasks work, some don't. The failures only take 20 seconds, so I'll just leave them trying as some work. Non-scaffold tasks all work on all machines. https://boinc.bakerlab.org/rosetta/results.php?hostid=3792849&offset=0&show_names=0&state=6&appid= https://boinc.bakerlab.org/rosetta/results.php?hostid=4360598&offset=0&show_names=0&state=6&appid= https://boinc.bakerlab.org/rosetta/results.php?hostid=4368214&offset=0&show_names=0&state=6&appid= https://boinc.bakerlab.org/rosetta/results.php?hostid=3745283&offset=0&show_names=0&state=6&appid= https://boinc.bakerlab.org/rosetta/results.php?hostid=3746264&offset=0&show_names=0&state=6&appid= https://boinc.bakerlab.org/rosetta/results.php?hostid=3772248&offset=0&show_names=0&state=6&appid= 09/07/2020 3:13:32 PM \| Rosetta@home \| Starting task TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_4ub7hx3y_953354_6_1 09/07/2020 3:13:54 PM \| Rosetta@home \| Computation for task TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_4ub7hx3y_953354_6_1 finished 09/07/2020 3:13:54 PM \| Rosetta@home \| Output file TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_4ub7hx3y_953354_6_1_r1892657513_0 for task TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_4ub7hx3y_953354_6_1 absent 09/07/2020 3:13:59 PM \| Rosetta@home \| Starting task TFSCAFFOLD0001_8_SAVE_ALL_OUT_IGNORE_THE_REST_5zi6bs4t_953359_12_1 09/07/2020 3:14:02 PM \| Rosetta@home \| [sched_op] Deferring communication for 00:01:26 09/07/2020 3:14:02 PM \| Rosetta@home \| [sched_op] Reason: Unrecoverable error for task TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_4ub7hx3y_953354_6_1 09/07/2020 3:14:17 PM \| Rosetta@home \| Computation for task TFSCAFFOLD0001_8_SAVE_ALL_OUT_IGNORE_THE_REST_5zi6bs4t_953359_12_1 finished 09/07/2020 3:14:17 PM \| Rosetta@home \| Output file TFSCAFFOLD0001_8_SAVE_ALL_OUT_IGNORE_THE_REST_5zi6bs4t_953359_12_1_r1282681604_0 for task TFSCAFFOLD0001_8_SAVE_ALL_OUT_IGNORE_THE_REST_5zi6bs4t_953359_12_1 absent 09/07/2020 3:14:27 PM \| Rosetta@home \| Starting task TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_2fs4sj4q_953354_12_1 09/07/2020 3:14:27 PM \| Rosetta@home \| [sched_op] Deferring communication for 00:02:36 09/07/2020 3:14:27 PM \| Rosetta@home \| [sched_op] Reason: Unrecoverable error for task TFSCAFFOLD0001_8_SAVE_ALL_OUT_IGNORE_THE_REST_5zi6bs4t_953359_12_1 09/07/2020 3:14:49 PM \| Rosetta@home \| Computation for task TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_2fs4sj4q_953354_12_1 finished 09/07/2020 3:14:49 PM \| Rosetta@home \| Output file TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_2fs4sj4q_953354_12_1_r9418955_0 for task TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_2fs4sj4q_953354_12_1 absent 09/07/2020 3:14:54 PM \| Rosetta@home \| [sched_op] Deferring communication for 00:06:53 09/07/2020 3:14:54 PM \| Rosetta@home \| [sched_op] Reason: Unrecoverable error for task TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_2fs4sj4q_953354_12_1 ID: 97984 · Rating: 0 · rate: / Reply Quote

Tomcat雄猫 Send message Joined: 20 Dec 14 Posts: 180 Credit: 5,390,659 RAC: 0	Message 97988 - Posted: 9 Jul 2020, 22:15:52 UTC 8 TFSCAFFOLD0001 tasks here on my Ryzen Win10 machine, all upload failures with very short runtimes. Example: TFSCAFFOLD0001_1_SAVE_ALL_OUT_IGNORE_THE_REST_3yo4ci7f_953352_14_1 <core_client_version>7.16.7</core_client_version> <![CDATA[ <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol marA_boinc_v1.xml @flags_TFSCAFFOLD -in:file:silent TFSCAFFOLD0001_1_SAVE_ALL_OUT_IGNORE_THE_REST_3yo4ci7f.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip TFSCAFFOLD0001_1_SAVE_ALL_OUT_IGNORE_THE_REST_3yo4ci7f.zip @TFSCAFFOLD0001_1_SAVE_ALL_OUT_IGNORE_THE_REST_3yo4ci7f.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 2479129 Using database: database_357d5d93529_n_methylminirosetta_database ====================================================== DONE :: 1 starting structures 1201 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== BOINC :: WS_max 0 10:58:15 (5968): called boinc_finish(0) </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>TFSCAFFOLD0001_1_SAVE_ALL_OUT_IGNORE_THE_REST_3yo4ci7f_953352_14_1_r428283726_0</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> </message> ]]> ID: 97988 · Rating: 0 · rate: / Reply Quote

James W Send message Joined: 25 Nov 12 Posts: 130 Credit: 1,766,254 RAC: 0	Message 97994 - Posted: 10 Jul 2020, 8:56:34 UTC - in response to Message 97988. 4 TFSCAFFOLD0001 tasks errored out on my 2 hosts in last 24 hours. Example: WU: 1092308808 Task: 1217343606 Application: Rosetta v4.20 windows_x86_64 Errors: Too many errors (may have bug) Too many total results I was wingman for original host, so we both errored out. Had same error info as others reporting here. Hopefully powers that be got word of this problem WU and will let researcher(s) know to correct the problem. ID: 97994 · Rating: 0 · rate: / Reply Quote

Tomcat雄猫 Send message Joined: 20 Dec 14 Posts: 180 Credit: 5,390,659 RAC: 0	Message 98003 - Posted: 11 Jul 2020, 6:08:59 UTC Another TFSCAFFOLD0001 errored out on my end, that was expected. One task ended prematurely due to a power outage for me, although it validated fine and wasn't marked as a computational error. 11527f43_fold_SAVE_ALL_OUT_951790_903_0 <core_client_version>7.16.7</core_client_version> <![CDATA[ <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -abinitio::fastrelax 1 -ex2aro 1 -frag3 00001.200.3mers.index -in:file:native 00001.pdb -silent_gz 1 -frag9 00001.200.9mers.index -out:file:silent default.out -ex1 1 -abinitio::rsd_wt_loop 0.5 -relax::default_repeats 5 -abinitio::use_filters false -abinitio::increase_cycles 10 -abinitio::rsd_wt_helix 0.5 -beta 1 -abinitio::rg_reweight 0.5 -in:file:boinc_wu_zip 11527f43_data.zip -out:file:silent default.out -silent_gz -mute all -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3121464 Using database: database_357d5d93529_n_methylminirosetta_database ERROR: Fullatom mismatch in checkpointer. ERROR:: Exit from: ......srcprotocolscheckpointCheckPointer.cc line: 380 01:10:25 (15132): called boinc_finish(0) </stderr_txt> ]]> Methinks something went wrong with the check-pointing system? ID: 98003 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2206 Credit: 13,720,774 RAC: 6	Message 98015 - Posted: 11 Jul 2020, 18:58:49 UTC - in response to Message 97994. 4 TFSCAFFOLD0001 tasks errored out on my 2 hosts in last 24 hours. Still TSCAFFOLD errors ID: 98015 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 98018 - Posted: 11 Jul 2020, 20:39:30 UTC - in response to Message 98015. 4 TFSCAFFOLD0001 tasks errored out on my 2 hosts in last 24 hours. Still TSCAFFOLD errors I've had a few run to completion, but i'd estimate around a 95%+ failure rate for those particular Work Units. none making it to 30 seconds before erroring out. A very bad batch of work. Grant Darwin NT ID: 98018 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2206 Credit: 13,720,774 RAC: 6	Message 98025 - Posted: 12 Jul 2020, 7:06:38 UTC - in response to Message 98018. A very bad batch of work. The same TFSCAFFOLD has passed before on Ralph@Home with the same errors. I cannot understand why they published these wus on Rosetta... ID: 98025 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1603 Credit: 13,015,132 RAC: 0	Message 98030 - Posted: 12 Jul 2020, 17:34:32 UTC - in response to Message 98018. I've had a few run to completion, but i'd estimate around a 95%+ failure rate for those particular Work Units. none making it to 30 seconds before erroring out. A very bad batch of work. Strange, I'm only getting about 20-30% failing. Anyway, nothing to worry about, since they only waste 20 seconds. They'll see them all coming back with a problem and sort it. The server won't resend them repeatedly, it'll give up after a few hosts reject it. resetting the project would have them marked as Abandoned. If I reset a project, does it inform the project of the not-done work units? Or should I always abort and update first? One task ended prematurely due to a power outage for me, although it validated fine and wasn't marked as a computational error. Presumably that's because Rosetta units comprise a number of modules, and you had probably completed some of them. It's strange it broke the unit though, normally it just continues, or resumes from the last checkpoint. Maybe the power went off as it was saving to disk? ID: 98030 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 5 Jan 06 Posts: 1900 Credit: 12,902,147 RAC: 0	Message 98035 - Posted: 13 Jul 2020, 3:10:04 UTC - in response to Message 98030. If I reset a project, does it inform the project of the not-done work units? Or should I always abort and update first? Resetting aborts all workunits and they are put back in the queue...SOME Projects resend them back to the original user during the reset process but not all do that. Your cache settings will depend on how many you get back if a project resends them back to you. To directly answer your question 'it's 6 of one half a dozen of another' on the way you do it. ID: 98035 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 98038 - Posted: 13 Jul 2020, 7:27:43 UTC - in response to Message 98035. If I reset a project, does it inform the project of the not-done work units? Or should I always abort and update first? Resetting aborts all workunits and they are put back in the queue...SOME Projects resend them back to the original user during the reset process but not all do that. Your cache settings will depend on how many you get back if a project resends them back to you. To directly answer your question 'it's 6 of one half a dozen of another' on the way you do it. The advantage of resetting is it also deletes all the project files, and then re-downloads them. So if you have a corrupted application (or support) file, Resetting the project will sort out that issue, whereas aborting & updating won't (it just gets rid of the present cache of Tasks and then gets new ones). Grant Darwin NT ID: 98038 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2206 Credit: 13,720,774 RAC: 6	Message 98040 - Posted: 13 Jul 2020, 9:31:35 UTC - in response to Message 98030. Anyway, nothing to worry about, since they only waste 20 seconds. I'm worry, because it's a waste of time and of download. The problem has been reported days ago. And no solution or message from admins This morning other 3 errors. <message> upload failure: <file_xfer_error> <file_name>TFSCAFFOLD0001_7_SAVE_ALL_OUT_IGNORE_THE_REST_2nx9yq6a_953358_14_1_r142948493_0</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> </message> Please, STOP THIS BATCH! ID: 98040 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2206 Credit: 13,720,774 RAC: 6	Message 98050 - Posted: 13 Jul 2020, 14:44:06 UTC 1220249128 [ ERROR ]: Caught exception: File: C:cygwin64homeboinc4.17Rosettamainsourcesrccore/pack/dunbrack/SingleResidueDunbrackLibrary.hh:306 chi angle must be between -180 and 180: -nan(ind) ------------------------ Begin developer's backtrace ------------------------- BACKTRACE: ------------------------- End developer's backtrace -------------------------- AN INTERNAL ERROR HAS OCCURED. PLEASE SEE THE CONTENTS OF ROSETTA_CRASH.log FOR DETAILS. </stderr_txt> ID: 98050 · Rating: 0 · rate: / Reply Quote

James W Send message Joined: 25 Nov 12 Posts: 130 Credit: 1,766,254 RAC: 0	Message 98069 - Posted: 14 Jul 2020, 7:05:34 UTC - in response to Message 98018. 4 TFSCAFFOLD0001 tasks errored out on my 2 hosts in last 24 hours. Still TSCAFFOLD errors I've had a few run to completion, but i'd estimate around a 95%+ failure rate for those particular Work Units. none making it to 30 seconds before erroring out. A very bad batch of work. In the last approximate 24 hours (7/13-7/14/20) I had 4 tasks error out on my 2 hosts, while I had 3 complete and validate. Appears quite random. Also, no WU had one valid and one failed task; either validated with first host or failed with both hosts. Definitely an issue with certain tasks having "issues." ID: 98069 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1603 Credit: 13,015,132 RAC: 0	Message 98114 - Posted: 15 Jul 2020, 19:40:41 UTC - in response to Message 98040. Last modified: 15 Jul 2020, 19:41:17 UTC I'm worry, because it's a waste of time and of download. The problem has been reported days ago. And no solution or message from admins This morning other 3 errors. <message> upload failure: <file_xfer_error> <file_name>TFSCAFFOLD0001_7_SAVE_ALL_OUT_IGNORE_THE_REST_2nx9yq6a_953358_14_1_r142948493_0</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> </message> Please, STOP THIS BATCH! Is it really that big a deal? Your computer is only wasting 20 seconds on a buggered 8 hour task. Most of them work. Those that don't just get sent back and they'll realise their error eventually. The project has plenty of bandwidth unlike Universe. Resetting aborts all workunits and they are put back in the queue...SOME Projects resend them back to the original user during the reset process but not all do that. Your cache settings will depend on how many you get back if a project resends them back to you. To directly answer your question 'it's 6 of one half a dozen of another' on the way you do it. Thanks, I'll do that in future. I used to set no new tasks, abort unstarted work, update the project, then reset. Actually come to think of it I more often detach. Does that also send back unfinished units? ID: 98114 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 5 Jan 06 Posts: 1900 Credit: 12,902,147 RAC: 0	Message 98118 - Posted: 15 Jul 2020, 23:45:26 UTC - in response to Message 98114. Resetting aborts all workunits and they are put back in the queue...SOME Projects resend them back to the original user during the reset process but not all do that. Your cache settings will depend on how many you get back if a project resends them back to you. To directly answer your question 'it's 6 of one half a dozen of another' on the way you do it. Thanks, I'll do that in future. I used to set no new tasks, abort unstarted work, update the project, then reset. Actually come to think of it I more often detach. Does that also send back unfinished units? Yes just not as quickly, they do so after they expire because they are waiting for you to finish them, the Server has no way to know you actually left. At projects like this with a 3 day deadline it's better to abort all of them first but at most projects with 10 to 14 day or more deadlines there's not alot of difference except for those people who have a small queue and are waiting for a wingman. ID: 98118 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1603 Credit: 13,015,132 RAC: 0	Message 98127 - Posted: 16 Jul 2020, 17:17:41 UTC - in response to Message 98118. Yes just not as quickly, they do so after they expire because they are waiting for you to finish them, the Server has no way to know you actually left. At projects like this with a 3 day deadline it's better to abort all of them first but at most projects with 10 to 14 day or more deadlines there's not alot of difference except for those people who have a small queue and are waiting for a wingman. I disagree, on a project that gives you two weeks to do them, you could be delaying stuff for another 2 weeks. Boinc really should tell the server about any tasks you won't be doing. ID: 98127 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 5 Jan 06 Posts: 1900 Credit: 12,902,147 RAC: 0	Message 98139 - Posted: 16 Jul 2020, 20:44:07 UTC - in response to Message 98127. Yes just not as quickly, they do so after they expire because they are waiting for you to finish them, the Server has no way to know you actually left. At projects like this with a 3 day deadline it's better to abort all of them first but at most projects with 10 to 14 day or more deadlines there's not alot of difference except for those people who have a small queue and are waiting for a wingman. I disagree, on a project that gives you two weeks to do them, you could be delaying stuff for another 2 weeks. Boinc really should tell the server about any tasks you won't be doing. PrimeGrid tracks the workunits to a degree but I don't know any project that does what you are suggesting. ID: 98139 · Rating: 0 · rate: / Reply Quote